Infrastructure Overview¶

The WebGrip platform is built on a Kubernetes-based infrastructure that provides a scalable, secure, and observable foundation for application development and deployment.

Infrastructure Philosophy¶

Our infrastructure follows these core principles:

🏗️ Infrastructure as Code: All infrastructure defined in version-controlled Helm charts
🔒 Security by Default: Security controls built into every layer of the stack
📊 Observable by Design: Comprehensive monitoring and logging across all components
⚡ Self-Service: Developers can deploy and manage applications independently
🔄 GitOps: All changes flow through Git workflows with proper review and automation

Platform Stack¶

Container Orchestration¶

Technology: Kubernetes Configuration: catalog/systems/kubernetes.yaml

Kubernetes provides the foundation for container orchestration, offering:

Pod Management: Automated deployment, scaling, and health management
Service Discovery: Built-in DNS and service mesh capabilities
Resource Management: CPU, memory, and storage allocation and limits
Security Isolation: Namespace-based multi-tenancy and network policies

Package Management¶

Technology: Helm Configuration: ops/helm/

Helm manages Kubernetes application deployments through templated charts:

Chart Category	Purpose	Location
005-tainters	Node tainting and tolerations	`ops/helm/005-tainters/`
007-cluster-monitoring	Platform monitoring stack	`ops/helm/007-cluster-monitoring/`
010-cert-manager	Certificate automation	`ops/helm/010-cert-manager/`
020-cluster-issuers	Certificate issuers	`ops/helm/020-cluster-issuers/`
030-ingress-controllers	Ingress and load balancing	`ops/helm/030-ingress-controllers/`
040-gha-runners-controller	CI/CD infrastructure	`ops/helm/040-gha-runners-controller/`
045-gha-runners	Self-hosted runner instances	`ops/helm/045-gha-runners/`
060-grafana-stack	Observability dashboards	`ops/helm/060-grafana-stack/`
950-example-services	Sample applications	`ops/helm/950-example-services/`

📋 Chart Naming Convention: Charts are numbered to indicate deployment order and dependencies. Lower numbers deploy first.

Cloud Infrastructure¶

Provider: AWS (Amazon Web Services) Cluster: DigitalOcean Kubernetes (DOKS) Configuration: catalog/resources/staging-doks-cluster.yaml

The platform currently runs on a DigitalOcean Kubernetes cluster with AWS integrations for:

Identity & Access: AWS IAM integration for access control
Container Registry: AWS ECR for container image storage
Backup & Recovery: AWS S3 for persistent volume backups
External Services: AWS services for extended platform capabilities

Infrastructure Components¶

Core Platform Services¶

Component Details¶

Component	Type	Purpose	Configuration
Traefik	Ingress Controller	Load balancing and traffic routing	`ops/helm/030-ingress-controllers/ingress-traefik/`
cert-manager	Certificate Automation	TLS certificate provisioning	`ops/helm/010-cert-manager/`
kube-prometheus-stack	Monitoring	Metrics collection and alerting	`ops/helm/007-cluster-monitoring/kube-prometheus-stack/`
Grafana	Observability	Dashboard visualization	`ops/helm/060-grafana-stack/`
GitHub Actions Runners	CI/CD	Self-hosted CI/CD execution	`ops/helm/040-gha-runners-controller/`

Infrastructure Automation¶

GitOps Workflow¶

All infrastructure changes follow a GitOps workflow:

Deployment Automation¶

Workflow: .github/workflows/on_source_change.yml

Infrastructure deployments are automated through GitHub Actions:

Validation: Helm chart linting and security scanning
Staging: Deploy to staging environment for validation
Production: Deploy to production with manual approval gates
Verification: Automated health checks and rollback on failure

Secret Management¶

Technology: SOPS + Age Configuration: ops/secrets/

Secrets are encrypted at rest and managed through:

Encryption: Age-based encryption with public key distribution
Access Control: Role-based access to decrypt specific secret categories
Audit Trail: All secret changes tracked in Git history
Rotation: Structured processes for secret rotation and distribution

Secret Categories: - 007-kube-prometheus-stack-secrets: Monitoring credentials - 010-cert-manager-secrets: Certificate authority credentials
- 030-ingress-controllers: Ingress configuration secrets - 045-gha-runners-secrets: CI/CD runner credentials - 060-grafana-stack: Dashboard and alerting credentials

Infrastructure Requirements¶

Prerequisites¶

To work with this infrastructure, you need:

# Required tools (from README.md)
brew install awscli kubectl helm terraform age sops kubectx derailed/k9s/k9s

Access Requirements: - AWS CLI configured with appropriate permissions - kubectl access to the staging cluster - Age key for secret decryption (for authorized personnel) - GitHub repository access for GitOps workflows

Local Development Setup¶

# Configure AWS and cluster access
aws configure
aws eks update-kubeconfig --name staging-eks-cluster --region eu-west-1

# Verify cluster connectivity
kubectl get nodes

# Access platform dashboards
make view-grafana    # Grafana dashboards
make view-traefik    # Traefik dashboard

🔧 Makefile Commands: All operational commands are documented in the Makefile with targets for common tasks.

Infrastructure Monitoring¶

Health Indicators¶

The platform monitors key infrastructure health metrics:

Metric	Source	Dashboard
Cluster Resource Usage	kube-state-metrics	Cluster Overview
Pod Health & Restarts	kubelet	Pod Status Dashboard
Ingress Traffic & Latency	Traefik	Ingress Dashboard
Certificate Expiry	cert-manager	Certificate Dashboard
CI/CD Runner Health	GitHub Actions Controller	CI/CD Dashboard

Alerting¶

Critical infrastructure alerts are configured for:

Cluster Resource Exhaustion: CPU, memory, and storage thresholds
Component Health: Platform service availability and response times
Security Events: Certificate expiry, authentication failures
Performance Degradation: Latency and error rate thresholds

Next Steps¶

Dive deeper into specific infrastructure areas:

🌐 Network Architecture
Understand pod networking, service discovery, and ingress configuration

🔒 Security Model
Review security controls, access management, and compliance

📊 Resource Management
Learn about resource allocation, scaling, and capacity planning

🔧 Platform Components
Explore individual platform services and their configurations

🏗️ Infrastructure Evolution: Infrastructure changes follow our Architecture Decision Records (ADRs). Significant infrastructure modifications require an ADR and stakeholder review.