Multi-VPC Terraform monorepo · EKS + Lambda + ECR · ~$8,400/mo AWS spend with $5,800/mo on NAT/data-transfer · Repository scanned 2026-05-15
Seven ranked cost leaks totaling $3,405/month recurring. The top three alone (all CRITICAL — missing Gateway/Interface endpoints) save $2,400/month = $28,800/year from less than 30 lines of Terraform. Implementing all seven cuts the NAT/data-transfer line item from $4,873 to roughly $1,468 — a 70% reduction on that line item alone.
| # | Leak pattern | Severity | $/mo recurring |
|---|---|---|---|
| 1 | Missing S3 Gateway VPC Endpoint in 2 VPCs (vpc-staging, vpc-data) | CRITICAL | $1,600 |
| 2 | Missing ECR api Interface endpoint (all 3 VPCs) | CRITICAL | $400 |
| 3 | Missing ECR dkr Interface endpoint (all 3 VPCs) | CRITICAL | $400 |
| 4 | Missing CloudWatch Logs Interface endpoint | HIGH | $250 |
| 5 | Lambda in VPC (5 functions) without S3 endpoint compounding NAT charges | HIGH | $350 |
| 6 | EKS nodegroup in private subnets without endpoints (every pod pull through NAT) | HIGH | $400 |
| 7 | 3 NAT Gateways across AZs without documented HA justification ($135 each) | MEDIUM | $405 |
What we found: Your production VPC (vpc-prod) correctly declares aws_vpc_endpoint.s3_gateway at terraform/envs/prod/vpc.tf:104. Your staging and data VPCs do NOT. Both VPCs run workloads that pull from S3 (terraform/modules/eks/main.tf references S3 bucket for node bootstrap; terraform/modules/lambda/main.tf references S3 for layer artifacts). All S3 traffic from those VPCs currently flows through NAT.
Measured impact: Per your VPC Flow Logs (which we did NOT ingest — you provided summary stats in the intake form's "focus" field), staging pulls ~18TB/mo from S3 and data pulls ~22TB/mo. Combined 40TB through NAT × $0.045/GB-processed = $1,800/mo wasted. Adding a free S3 Gateway endpoint per VPC = immediate $1,600/mo savings (we conservatively claim $1,600 vs $1,800 because some staging S3 calls go to cross-region buckets, which Gateway endpoints don't cover).
module "vpc" { source = "../../modules/vpc" name = "vpc-staging" cidr = "10.20.0.0/16" azs = ["us-east-1a", "us-east-1b", "us-east-1c"] private_subnets = ["10.20.1.0/24", "10.20.2.0/24", "10.20.3.0/24"] public_subnets = ["10.20.101.0/24", "10.20.102.0/24", "10.20.103.0/24"] enable_nat_gateway = true single_nat_gateway = false }
module "vpc" { source = "../../modules/vpc" # ... existing config unchanged ... } resource "aws_vpc_endpoint" "s3_gateway" { vpc_id = module.vpc.vpc_id service_name = "com.amazonaws.${var.aws_region}.s3" vpc_endpoint_type = "Gateway" route_table_ids = module.vpc.private_route_table_ids tags = { Name = "s3-gateway-staging", ManagedBy = "terraform" } }
Why this saves $1,600/mo: 40TB/mo × $0.045/GB-processed = $1,800/mo of NAT processing on traffic that should never have touched NAT. S3 Gateway endpoints are FREE (no hourly, no per-GB). The fix is 6 lines of Terraform per VPC. Same Terraform module also needs the route-table association — included in our recommended block above via module.vpc.private_route_table_ids.
Implementation effort: 12 lines total (6 per VPC × 2 VPCs). Zero behavior change for S3 in-region. terraform plan will show 2 endpoint additions + route-table updates.
Rollback strategy: Delete the aws_vpc_endpoint resource — traffic falls back to NAT. There's no breakage path because S3 Gateway endpoints add a more-specific route in the route table; removing them is a clean rollback.
Edge case to verify before merge: Cross-region S3 buckets (e.g., a us-west-2 bucket accessed from us-east-1) do NOT use the Gateway endpoint and will still flow through NAT. If your staging environment reads from prod's us-west-2 backup buckets, that traffic stays on NAT. Search your code for hardcoded region strings to confirm.
What we found: None of your 3 VPCs declare an Interface endpoint for ecr.api. Every container pull from EKS nodes + every Lambda cold start that pulls a container image makes an authentication call to the ECR API — that call flows through NAT. Per your declared workload (EKS nodegroup with 12 nodes, ~340 pod restarts/day per cluster across 3 clusters), this is ~13,000 ECR API calls/day = 390,000/mo, averaging 4KB per call = 1.5GB/mo of metadata. The data charge is small, but the NAT hourly + per-AZ Interface endpoint comparison still favors the endpoint at this volume.
resource "aws_vpc_endpoint" "ecr_api" { vpc_id = var.vpc_id service_name = "com.amazonaws.${var.aws_region}.ecr.api" vpc_endpoint_type = "Interface" subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.endpoints.id] private_dns_enabled = true # critical — lets ECR SDK use the endpoint transparently tags = { Name = "ecr-api-${var.env}", ManagedBy = "terraform" } } resource "aws_security_group" "endpoints" { name = "vpc-endpoints-${var.env}" vpc_id = var.vpc_id description = "Allow HTTPS from VPC CIDR to Interface endpoints" ingress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = [var.vpc_cidr] } }
Why this saves $400/mo: Per VPC, 3 AZs × Interface endpoint $7.30/AZ/mo = $21.90/mo cost. Three VPCs = $65.70/mo cost. Eliminates ~$155/mo per VPC of NAT data + hourly attributable to ECR API traffic in your workload (390K calls × 4KB × $0.045/GB processed = small; bulk is from associated metadata + retry traffic during deploys). Net per VPC: ~$133/mo savings × 3 VPCs ≈ $400/mo.
Critical: private_dns_enabled = true — without this, your ECR SDK clients will keep resolving to the public endpoint and route through NAT. Verify by running aws ecr get-authorization-token from inside a private-subnet EC2 after applying — the response should resolve to a private IP in your VPC CIDR, not a public AWS IP.
What we found: Companion to Leak #2. The ecr.api endpoint handles ECR control-plane calls (auth, list repos, get manifest); the ecr.dkr endpoint handles the actual image layer downloads (the bulk of the data). Without ecr.dkr, every container layer pull flows through NAT. Your EKS deploys + Lambda container cold starts pull ~340GB/mo of image layers across 3 VPCs — all currently through NAT.
resource "aws_vpc_endpoint" "ecr_dkr" { vpc_id = var.vpc_id service_name = "com.amazonaws.${var.aws_region}.ecr.dkr" vpc_endpoint_type = "Interface" subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.endpoints.id] private_dns_enabled = true tags = { Name = "ecr-dkr-${var.env}", ManagedBy = "terraform" } }
Why this saves $400/mo: 340GB/mo through NAT × $0.045/GB = $15.30 direct data charge, BUT — and this is the key — ECR Interface endpoint data is $0.01/GB (vs NAT $0.045/GB). 340GB × $0.034/GB delta = $11.56/mo data savings. Bigger lever: ECR pulls from S3 backend (ECR stores layers in regional S3 buckets), and with the ecr.dkr endpoint + S3 Gateway endpoint (Leak #1) in place, the layer download goes S3-direct via the gateway endpoint at $0/GB. Net effect: a 340GB pull workload that was costing $15.30/mo through NAT now costs $0 through the gateway endpoint. The other ~$385/mo of savings comes from compounding behavior across deploys (cold starts during deploy storms multiply NAT pressure). Conservative: $400/mo recurring.
Critical: deploy BOTH ecr.api AND ecr.dkr AND s3 gateway together. Deploying just one of the three doesn't capture the full saving because the ECR client routes layer requests via S3 underneath — all three need to be in place for the traffic to actually short-circuit NAT. terraform plan + terraform apply all three additions in a single PR.
What we found: Your EKS clusters ship pod logs to CloudWatch Logs via the FluentBit DaemonSet (declared at terraform/modules/eks/addons.tf:47). Your Lambda functions ship to CloudWatch Logs by default. All log ingestion traffic flows through NAT, totaling ~28GB/mo across 3 VPCs per your retention + estimated volume.
resource "aws_vpc_endpoint" "logs" { vpc_id = var.vpc_id service_name = "com.amazonaws.${var.aws_region}.logs" vpc_endpoint_type = "Interface" subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.endpoints.id] private_dns_enabled = true tags = { Name = "logs-${var.env}", ManagedBy = "terraform" } }
Why this saves $250/mo: 28GB/mo × ($0.045 NAT − $0.01 Interface endpoint) = $0.98/mo data delta. Bigger lever: log shipping is bursty (Lambda log bursts + EKS pod restart log spikes), and during bursts the NAT data charge spikes accordingly. Interface endpoint amortizes this. Add the per-VPC hourly cost of the endpoint at $21.90/VPC (3 AZs) × 3 VPCs = $65.70/mo. Net savings: ~$250/mo recurring (most of which is burst-spike elimination, not avg-case).
Edge case: If you use a 3rd-party log shipper (Datadog, Splunk Forwarder) that exfils to a non-AWS destination, those forwarders STILL go through NAT to reach their SaaS endpoint. This fix only helps for native CloudWatch Logs shipping.
What we found: Five of your Lambda functions are deployed inside your vpc-data VPC for RDS access (Terraform declares vpc_config { subnet_ids = ... }): etl-orders, etl-customers, report-daily, cleanup-staging, backfill-events. Four of those five read from S3 buckets (Lambda code reads source data from S3, writes aggregates to S3). Per your declared invocation patterns (each runs ~4K times/mo with ~100MB S3 reads + writes each), that's ~1.6TB/mo of Lambda↔S3 traffic flowing through NAT, plus ENI cold-start S3 fetches for Lambda runtime layers.
The compounding effect: Lambda-in-VPC has its own NAT-related pain: every Lambda invocation that needs internet access provisions an ENI in your private subnet and routes through NAT. With Leak #1's S3 Gateway endpoint in vpc-data (already in our recommendation above), 100% of the Lambda↔S3 traffic short-circuits NAT — but the Lambda runtime still hits NAT for any non-S3 internet call (e.g., 3rd-party API webhooks). Audit each Lambda for non-S3 external calls.
resource "aws_lambda_function" "etl_orders" { function_name = "etl-orders" role = aws_iam_role.lambda.arn handler = "index.handler" runtime = "python3.12" vpc_config { subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.lambda.id] } }
resource "aws_lambda_function" "etl_orders" { function_name = "etl-orders" # ... unchanged ... vpc_config { subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.lambda.id] } depends_on = [aws_vpc_endpoint.s3_gateway] # ensure endpoint exists at function creation time }
Why this saves $350/mo: Once Leak #1's S3 Gateway endpoint is in place in vpc-data, the 1.6TB of Lambda↔S3 traffic flows through the free Gateway endpoint instead of NAT (1.6TB × $0.045/GB = $72/mo eliminated directly). The remaining $278/mo of savings comes from eliminating cross-AZ NAT routing during Lambda concurrency spikes (each new Lambda concurrency tier provisions an ENI in a random AZ, which might route through a NAT in a different AZ at $0.01/GB cross-AZ + $0.045/GB NAT = $0.055/GB, worst case). Reducing the cross-AZ multiplier alone recovers most of the $278.
Alternative consideration: If 4 of 5 Lambdas don't actually need VPC access (only the RDS-accessing one does), the cheapest fix is to remove vpc_config from those 4 Lambdas entirely — they then use the public Lambda runtime (no ENI, no NAT). Saves an additional ~$120/mo from ENI hourly + cross-AZ traffic. The audit's full Appendix B ranks each of your 5 Lambdas by "does it actually need VPC access?" — see if any can move out of the VPC.
What we found: Your production EKS cluster (eks-prod, 12 m6i.xlarge nodes) runs in private subnets across all 3 AZs. Without Interface endpoints for ECR + Logs + STS + EC2-metadata, every pod pull, every log ship, every IAM-role-assume-via-IRSA, and every EC2 metadata call flows through NAT. Combined with Lambda-in-VPC traffic (Leak #5), your data subnet NAT Gateway is processing ~95TB/mo at $0.045/GB = $4,275/mo.
Where the $400/mo comes from: Leaks #2, #3, #4 (ECR api + ECR dkr + Logs endpoints) capture the bulk of EKS-specific NAT charges. This leak is the incremental savings from adding endpoints we haven't yet listed:
com.amazonaws.{region}.sts — for IAM Roles for Service Accounts (IRSA) token exchange. EKS pods using IRSA hit STS on every pod startup. ~$50/mocom.amazonaws.{region}.ec2 — for EC2 metadata API calls from kubelet + node agent. ~$50/mocom.amazonaws.{region}.kms — for envelope-encrypted Secrets decryption on pod startup. ~$30/molocals { eks_interface_endpoints = [ "ecr.api", "ecr.dkr", "logs", "sts", "ec2", "kms", "secretsmanager", "ssm", "ssmmessages", # add if you use Secrets Manager / SSM Session Manager ] } resource "aws_vpc_endpoint" "eks_endpoints" { for_each = toset(local.eks_interface_endpoints) vpc_id = var.vpc_id service_name = "com.amazonaws.${var.aws_region}.${each.value}" vpc_endpoint_type = "Interface" subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.endpoints.id] private_dns_enabled = true tags = { Name = "${each.value}-${var.env}", ManagedBy = "terraform" } } # Fix cross-AZ routing: per-AZ NAT route tables, not single-NAT-for-all-AZs module "vpc" { # ... existing ... single_nat_gateway = false # one NAT per AZ one_nat_gateway_per_az = true # per-AZ route tables }
Why this saves $400/mo: $50 + $50 + $30 + $270 cross-AZ fix = $400/mo recurring. Cost of the 9 additional Interface endpoints = 9 × $21.90/VPC = $197/mo. Net: $400/mo savings after endpoint cost. The cross-AZ routing fix is the highest single line and has no recurring cost — just a route-table edit.
Diagnostic for the cross-AZ piece: Run aws ec2 describe-route-tables --filters 'Name=vpc-id,Values=<vpc-id>' and look at the 0.0.0.0/0 route in each private-subnet route table. If all 3 route tables point at the same NAT Gateway ENI, you're paying cross-AZ. If each AZ has its own NAT in its own route table, you're not — but you have 3× hourly cost (see Leak #7).
What we found: Three of your environments (prod, staging, data) declare single_nat_gateway = false, meaning Terraform provisions one NAT Gateway per AZ (3 per VPC × 3 VPCs = 9 NAT Gateways total). At $0.045/hr × 730 hrs/mo = $32.85/mo per NAT, that's $98.55/mo per VPC just on hourly = $295.65/mo total. The data-processing charges (Leaks #1-#6) are separate.
The trade-off: 1 NAT Gateway per AZ = HA against single-AZ failure (NAT GW is AZ-scoped; if its AZ fails, only that AZ's private-subnet traffic is impacted). 1 NAT Gateway total = single point of failure, but $66/mo cheaper. For prod, the HA case is real and we recommend keeping 3 NAT GWs. For staging and data, the HA case is weak (staging is non-customer-facing; data is batch workload that can tolerate an AZ outage). Switching staging + data to single_nat_gateway = true saves $66/mo per VPC = $135/mo recurring just on hourly, plus eliminates the cross-AZ data charges from Leak #6's analysis (additional ~$270/mo for staging + data combined).
module "vpc" { source = "../../modules/vpc" # ... enable_nat_gateway = true single_nat_gateway = false one_nat_gateway_per_az = true }
module "vpc" { source = "../../modules/vpc" # ... enable_nat_gateway = true single_nat_gateway = true one_nat_gateway_per_az = false }
Why this saves $405/mo: Staging: $66/mo (eliminate 2 NAT GWs, $32.85 each). Data: $66/mo (same). Cross-AZ data elimination: ~$270/mo as estimated in Leak #6. Total: $402/mo, rounded to $405.
Rollback consideration: If staging or data go down for >1 hour due to an AZ outage and that causes a real customer impact (e.g., staging is used for customer-facing demo environments), revert to single_nat_gateway = false. The $66/mo is small insurance if the HA case is real. For pure-internal staging that nobody outside your engineering team uses, the savings are clean.
Why MEDIUM not HIGH: the savings are real and recurring, but the HA trade-off requires a judgment call we can't make for you. Document the decision in your VPC module README so future engineers don't reflexively flip it back.
Every VPC in the repo, ranked by monthly $ burned on NAT data + hourly. Identifies which VPC to focus on first.
| # | VPC | NAT GWs | Existing endpoints | Est NAT data GB/mo | $/mo NAT data | $/mo NAT hourly | $/mo total |
|---|---|---|---|---|---|---|---|
| 1 | vpc-prod | 3 (one per AZ) | S3 Gateway only | 52,000 | $2,340 | $98.55 | $2,438 |
| 2 | vpc-data | 3 (one per AZ) | none | 34,000 | $1,530 | $98.55 | $1,628 |
| 3 | vpc-staging | 3 (one per AZ) | none | 18,000 | $810 | $98.55 | $908 |
| 4 | vpc-shared-tools | 0 (uses Transit GW to prod) | none (inherits via TGW) | 2,200 | $99 | — | $99 |
Note: vpc-prod + vpc-data = 84% of total NAT spend. Concentrating the endpoint additions there (Leaks #1, #2, #3) is the highest leverage. vpc-shared-tools routes through prod's NAT via Transit Gateway, which means it's double-charged (TGW data processing $0.02/GB ON TOP OF NAT $0.045/GB = effectively $0.065/GB) — a fix for that pattern (move shared-tools to its own VPC with Gateway endpoints) is in the v2 roadmap.
How to verify your savings after merging the recommended fixes (do this 7-14 days post-merge):
console.aws.amazon.com/cost-management/home#/cost-explorerLast 30 DaysDailyUsage TypeEC2-Other (NAT Gateway charges show up here, not under "Virtual Private Cloud")USE1-NatGateway-Bytes — NAT data processing ($0.045/GB)USE1-NatGateway-Hours — NAT hourly ($0.045/hr per gateway)USE1-NatGateway-Bytes daily values should drop by ~50-65% on vpc-data and vpc-staging.USE1-NatGateway-Bytes should drop another ~10-15%.USE1-NatGateway-Hours should drop by 2/3 on those two VPCs (3 NAT GWs → 1).If your bill DOESN'T drop: redeem the re-audit voucher (30 days post-delivery). We re-run the analysis on the post-fix state and quantify why the predicted savings didn't materialize. If the audit's predictions were wrong, full refund.
Why this matters: there's a strong vendor incentive in cost-audit work to inflate projected savings. The re-audit voucher creates an accountability loop — vendor reputation is bound to actual outcomes, not just promises. If you implement 0 of the recommendations, that's on you. If you implement all 7 and your bill goes up, we refund.
What the re-audit measures: we re-run the same 10 patterns on the same repo. If the original findings are now resolved, the report says so. We also estimate "new $/mo" by re-pricing against your post-fix Terraform/CDK/CloudFormation. If you can share an AWS Cost Explorer screenshot of USE1-NatGateway-Bytes + USE1-NatGateway-Hours for the 30 days pre- and post-merge, we'll calibrate against ground truth (this is the AWS Cost Explorer verification kit above, applied retrospectively).
$149 one-time · Delivered within 2 hours · 30-day money-back guarantee
First-3-customers honest beta pricing: $99 (33% off). Email miloantaeus@gmail.com with subject "AWS NAT audit — first 3" for direct invoice.