Our Work & Projects
Real-world case studies and engineering builds — the messy problems, the actual decisions made, and the outcomes.
Zero-Downtime Migration: Self-Managed K8s to EKS with 2TB S3 Transfer
Full infrastructure migration for a live product — cross-account S3 data transfer and Kubernetes cluster migration with no user-facing downtime
The Problem
A growing SaaS product needed to move from a self-managed kubeadm cluster on EC2 to fully managed EKS — across a new AWS account entirely. 2TB of user-uploaded assets, ML training data, and backups had to be moved without data loss. Zero downtime was non-negotiable.
What We Did
Used rclone for cross-account S3 transfer with 32 parallel transfers and MD5 checksum verification. Exported all Kubernetes resources, cleaned server-side fields, and rebuilt in EKS using eksctl. Migrated services in dependency order: ConfigMaps → StatefulSets → Deployments → Ingress. Switched from nginx-ingress to AWS Load Balancer Controller with ACM. Cut over using weighted Route 53 DNS routing — 10% → 50% → 100% over 48 hours.
Results
- 2TB transferred with zero data loss — verified via rclone checksum
- 12 services migrated to EKS with zero user-facing downtime
- Deploy time dropped from 25 minutes (manual) to 4 minutes (GitHub Actions → EKS)
- Infrastructure cost reduced ~20% through better node right-sizing on managed nodegroups
- Full rollback available at every stage via blue-green DNS weighted routing
Technologies
EKS migration, S3 rclone migration, Kubernetes migration, zero downtime, cross-account AWS
Cloud Native CI/CD Automation Platform
Replaced manual deployments causing release delays and production errors with a fully automated pipeline
The Problem
An 8-person engineering team was spending 25+ minutes per deployment on manual steps — pushing Docker images, SSHing into servers, restarting services by hand. Deployment errors were frequent. New engineers couldn't deploy safely without pairing with a senior.
What We Did
Designed end-to-end CI/CD using GitHub Actions: automated test runs on PR, Docker image build with layer caching, image publishing to ECR with environment-specific tags, and Kubernetes rolling deployments via kubectl. Added Slack notifications for deploy status and automatic rollback triggers on health check failures.
Results
- Deployment time cut from 25 minutes to under 4 minutes
- Any engineer on the team can deploy independently and safely
- Zero manual SSH steps remaining in release process
- Automatic rollback triggered within 60 seconds of failed health checks
Technologies
CI/CD pipeline, GitHub Actions, Docker, Kubernetes, DevOps automation
Kubernetes Infrastructure for Machine Learning Workloads
Built GPU-enabled K8s cluster for an ML platform running heavy training and inference jobs
The Problem
An ML platform needed to run GPU-intensive training jobs that were crashing on shared infrastructure. Jobs were queuing for hours, GPU utilization was under 30%, and there was no isolation between training and inference workloads.
What We Did
Deployed Kubernetes cluster with dedicated GPU node pools using NVIDIA GPU Operator for driver management. Configured node selectors and taints to isolate training from inference. Added Kubernetes autoscaler to spin GPU nodes up/down based on job queue depth, cutting idle GPU costs significantly.
Results
- GPU utilization improved from ~30% to consistent 85%+ during training runs
- Training jobs no longer compete with inference — separate node pools with taints
- Autoscaler reduces GPU node count to zero when queue is empty (major cost saving)
- Stable training environment — no more OOM crashes from resource contention
Technologies
Kubernetes GPU, machine learning infrastructure, GPU clusters, ML deployment
Production Monitoring and Observability Stack
Built full observability layer for a team flying blind — no dashboards, no alerts, incidents discovered by users
The Problem
The team had no infrastructure monitoring. Incidents were discovered when users complained. There was no visibility into CPU, memory, pod restarts, or error rates. Debugging required SSH access and manual log grep.
What We Did
Deployed Prometheus with custom scrape configs for all services, Grafana dashboards for system metrics and business KPIs, and AlertManager with PagerDuty routing for on-call. Added structured logging with correlation IDs to make cross-service tracing possible without a paid tool.
Results
- Mean time to detect (MTTD) went from "user reports" to under 2 minutes
- On-call team gets actionable alerts with context — not just "something is down"
- Pod restart loops caught automatically before they affect users
- Grafana dashboards used in weekly engineering reviews to track reliability trends
Technologies
Prometheus monitoring, Grafana dashboards, observability, incident detection
Scrum CLI Tool
Command-line sprint management tool built for engineering teams who live in the terminal
The Problem
Engineering teams needed a better way to manage agile workflows directly from the terminal without switching tools.
What We Did
Built a full-featured CLI application with sprint management, task tracking, and developer-focused productivity features.
Results
- Streamlined terminal-based workflow management
- Reduced context switching for developers
- Improved team productivity and agility
Technologies
Scrum CLI, agile tool, developer productivity, command-line interface
Have a similar problem?
Whether it's a cloud migration, K8s setup, or full infrastructure build — let's talk about what you're working with.
Start a Project →