The Challenge
A UK retail chain with 120 physical stores and a growing e-commerce operation was running its digital infrastructure on co-located bare-metal servers managed by a single in-house sysadmin. The setup had worked at small scale but was now a liability.
Monthly deployment terror. Releases happened once a month during a scheduled maintenance window, typically on a Sunday night. Each deployment involved SSHing into production servers, manually pulling code, running database migrations, and restarting services. Rollbacks were ad-hoc tar backups. Two of the last six deployments had caused partial outages lasting 2–4 hours, directly impacting online sales.
No scaling capability. During Black Friday and Boxing Day sales, the e-commerce site consistently crashed or slowed to a crawl. The bare-metal servers had fixed capacity. There was no auto-scaling, no CDN, and no load balancing beyond a single Nginx reverse proxy. The business estimated it lost over 150,000 GBP in the previous year's peak season due to site performance issues.
Zero observability. When something broke, the team found out from customer complaints or a spike in refund requests. There were no centralised logs, no metrics dashboards, no alerting, and no distributed tracing. Debugging production issues meant manually tailing log files on individual servers.
The Approach
RG INSYS proposed a phased transformation: containerise the existing applications, migrate to AWS with Kubernetes orchestration, automate everything with CI/CD, and build an observability stack that gave the team real-time visibility into system health.
Containerisation first: We dockerised all five services (e-commerce frontend, API backend, inventory sync, payment gateway, admin panel) without rewriting application code. Each service got a Dockerfile, health checks, and environment-based configuration. We validated containers locally and in staging before touching production.
Kubernetes on AWS EKS: We provisioned an EKS cluster using Terraform, with separate node groups for compute-heavy and memory-heavy workloads. Horizontal pod autoscaling ensured the e-commerce frontend could scale from 3 to 30 pods during traffic spikes. RDS replaced the self-managed MySQL instance, and ElastiCache replaced the hand-configured Redis server.
CI/CD with GitHub Actions: Every push to main triggered an automated pipeline: lint, test, build Docker image, push to ECR, deploy to staging. Production deploys required a manual approval gate and used rolling updates with automatic rollback on health check failure. The entire deployment process went from 3 hours of manual work to a 12-minute automated pipeline.
Full observability stack: We deployed Prometheus and Grafana for metrics, Loki for log aggregation, and Jaeger for distributed tracing. Custom dashboards showed real-time request rates, error rates, latency percentiles, and resource utilisation. PagerDuty integration ensured on-call engineers were alerted within 60 seconds of an anomaly.
Timeline: Week by Week
Weeks 1–2: Infrastructure audit and migration planning. Dockerisation of all 5 services. Local and staging validation. Terraform modules for AWS infrastructure.
Weeks 3–4: EKS cluster provisioning with Terraform. RDS and ElastiCache setup. Networking (VPC, subnets, security groups, ALB). Container registry (ECR) configuration.
Weeks 5–6: CI/CD pipeline in GitHub Actions: build, test, stage, and production workflows. Rolling deployment strategy with health checks and automatic rollback. Secret management via AWS Secrets Manager.
Weeks 7–8: Observability stack: Prometheus, Grafana, Loki, Jaeger. Custom dashboards for each service. Alerting rules for error rates, latency thresholds, and resource limits. PagerDuty integration.
Weeks 9–10: Production migration. Traffic cutover with DNS-based blue-green switch. Load testing at 5x normal traffic. Post-migration monitoring and 2-week stabilisation support.
Tech Stack
- Orchestration: Kubernetes (AWS EKS), Helm charts
- Infrastructure as Code: Terraform, AWS CloudFormation
- CI/CD: GitHub Actions, Docker, AWS ECR
- Cloud: AWS (EKS, RDS, ElastiCache, S3, CloudFront, ALB, Route 53)
- Observability: Prometheus, Grafana, Loki, Jaeger, PagerDuty
- Security: AWS Secrets Manager, IAM roles for service accounts, network policies
- AI tooling: Claude Code, Cursor IDE (for Terraform modules and pipeline scripts)
Results
Key Features Delivered
- Auto-scaling Kubernetes cluster: E-commerce frontend scales from 3 to 30 pods based on CPU and request rate. Black Friday traffic handled without manual intervention or performance degradation.
- Automated CI/CD pipeline: Push-to-deploy with automated testing, staging validation, production approval gate, rolling updates, and automatic rollback. Average pipeline execution: 12 minutes.
- Full observability: Grafana dashboards with real-time metrics per service, centralised log search via Loki, distributed request tracing via Jaeger. Anomaly alerts via PagerDuty with 60-second notification SLA.
- Infrastructure as Code: Entire AWS infrastructure defined in Terraform. New environments (staging, QA) can be spun up in under 30 minutes. Drift detection prevents manual configuration changes.
Struggling with deployments or downtime?
We modernise infrastructure and automate release pipelines so your team ships faster with confidence. Get a scope and estimate within 48 hours.
Book Free Consultation →