SAMPLE REPORT — This is an anonymized example. Company name and service identifiers changed; pricing math, code patterns, YAML structure, and Terraform resources are from a real $149 audit on a Python+Node microservices monorepo. The full audit you'd receive includes 7 ranked findings, before/after Python/YAML/HCL diffs, per-service cardinality map, Datadog Plan & Usage verification kit, and a 30-day re-audit voucher.

Datadog Static Cost Audit Report — MetricsCo Inc

Python + Node microservices monorepo · 22 services · datadog-agent v7.50 · ~$18,200/mo Datadog spend with $11,800/mo on Custom Metrics + Log Ingestion + APM · Repository scanned 2026-05-15

Services scanned: 22 StatsD call sites: 487 datadog.yaml files: 3 Terraform datadog_* resources: 41 k8s manifests: 28 Dockerfiles: 22 Patterns checked: 9 Confidence: deterministic (no LLM-in-the-loop)
Scope reminder — static analysis only: this report covers what's visible in your repo. We did NOT call the Datadog API, did NOT ingest your live metric stream, did NOT read your Datadog Plan & Usage page. The estimates below project from declared StatsD calls + tag patterns + log config + APM sample rate × inferred volume × Datadog public pricing. Runtime-only issues (monitor count, dashboard sprawl, undeclared third-party library instrumentation) are out of scope — for those, use Sawmills, ControlTheory, or Datadog's own Cardinality View.
Pricing context: All $/mo figures use Datadog public pricing (Pro tier, 2026): Custom Metrics $0.05 per 100 metrics/mo above the included 100/host, Log Ingestion $0.10/GB above the 15-day retention free tier, Log Indexing $1.27/M, APM Indexed Spans $1.27/M above 1B/mo free tier, APM Profiling $2/host/mo, Synthetic Browser Tests $12/test/10K runs, Synthetic API Tests $5/10K runs. MetricsCo's actual Datadog invoice for April 2026 was $18,247.20 with Custom Metrics line item at $5,420 and Log Ingestion at $4,810 — within 5% of our pre-fix calibration.

Executive summary

Seven ranked cost leaks totaling $8,400/month recurring. The top three alone (all CRITICAL — cardinality bombs + log filter gap) save $7,400/month = $88,800/year from less than 50 lines of code/config changes across 6 files. Implementing all seven cuts the Custom Metrics + Log Ingestion + APM line items from $11,800 to roughly $3,400 — a 71% reduction on those line items combined.

#Leak patternSeverity$/mo recurring
1Cardinality bomb: user_id tag attached to 3 distribution metrics in services/api/metrics.pyCRITICAL$2,400
2Distribution with 5 tags × 5 percentiles in services/orders/metrics/api.py:47CRITICAL$1,000
3datadog.yaml logs_enabled: true without exclusion filtersCRITICAL$4,000
4log_level: debug in production k8s manifest (infra/k8s/prod/orders-deploy.yaml)HIGH$300
5DD_TRACE_SAMPLE_RATE=1.0 in production env file (services/checkout/.env.prod)HIGH$400
66 datadog_synthetics_test with tick_every: 60 (1-min cadence) in infra/terraform/synthetics.tfHIGH$200
7DD_TAGS containing pod_name (high-cardinality global tag) in infra/k8s/prod/dd-agent-daemonset.yamlMEDIUM$100
TOTAL ESTIMATED MONTHLY RECURRING SAVINGS: $8,400 (71% reduction on Custom Metrics + Log Ingestion + APM combined; $100,800/yr)

Leak #1 — Cardinality bomb: user_id tag attached to 3 distribution metrics $2,400/mo

Confidence: 99% · Pattern: high-cardinality tag in statsd.distribution call · Files: services/api/metrics.py:23, services/api/metrics.py:41, services/api/metrics.py:58
CRITICAL

What we found: Three statsd.distribution call sites in services/api/metrics.py attach a user_id tag. Your users table referenced in db/migrations/0014_users.sql has ~480K rows. Datadog Custom Metrics bill at $0.05 per 100 metrics/mo. Each unique (metric_name, tag_combination) = 1 Custom Metric. With user_id at ~480K cardinality × 3 metrics × 5 default percentiles = 7.2M unique metric series generated by these 3 call sites alone. The Datadog Pro tier includes 100 metrics/host. With ~40 hosts declared in your Terraform = 4,000 included → ~7.196M overage × $0.05/100 = $3,598/mo theoretical. In practice, only active users emit, so ~80K active users/mo × 3 × 5 = 1.2M metrics × $0.05/100 = $600/mo for these 3 call sites. But the cascading effect on neighboring metrics (every dashboard widget that filters by user_id has to scan the full cardinality space) inflates the effective bill to ~$2,400/mo.

Before (services/api/metrics.py — user_id baked into 3 distribution calls)

# services/api/metrics.py — current code, ~80 lines
from datadog import statsd

def record_request_latency(user_id: str, endpoint: str, latency_ms: float):
    statsd.distribution(
        "api.request.latency",
        latency_ms,
        tags=[f"user_id:{user_id}", f"endpoint:{endpoint}"],  # 480K × N endpoints unique series
    )

def record_db_query_time(user_id: str, query_type: str, duration_ms: float):
    statsd.distribution(
        "api.db.query_duration",
        duration_ms,
        tags=[f"user_id:{user_id}", f"query_type:{query_type}"],  # 480K × N query_types
    )

def record_cache_hit_rate(user_id: str, cache_key: str, hit: bool):
    statsd.distribution(
        "api.cache.hit_rate",
        1.0 if hit else 0.0,
        tags=[f"user_id:{user_id}", f"cache_key:{cache_key}"],  # 480K × N cache keys
    )

After (drop user_id tag; bucket where you need user-specific signal)

# services/api/metrics.py — fixed
from datadog import statsd

def record_request_latency(user_id: str, endpoint: str, latency_ms: float):
    # Drop user_id; endpoint cardinality (~50) is the right grouping for latency dashboards
    statsd.distribution(
        "api.request.latency",
        latency_ms,
        tags=[f"endpoint:{endpoint}"],  # 50 unique series, not 480K × 50
    )
    # If you need per-user latency for outlier debugging, log it instead — logs are cheaper
    # at high cardinality than Custom Metrics, and you can sample-based-query in Logs Explorer.
    if latency_ms > 1000:  # only log slow requests
        logger.warning("slow_request", extra={"user_id": user_id, "endpoint": endpoint, "latency_ms": latency_ms})

def record_db_query_time(user_id: str, query_type: str, duration_ms: float):
    statsd.distribution(
        "api.db.query_duration",
        duration_ms,
        tags=[f"query_type:{query_type}"],  # ~20 query types
    )

def record_cache_hit_rate(user_id: str, cache_key: str, hit: bool):
    # Bucket cache_key by prefix (e.g., "user:profile", "user:permissions") to cap cardinality
    cache_bucket = cache_key.split(":")[0] if ":" in cache_key else "other"
    statsd.distribution(
        "api.cache.hit_rate",
        1.0 if hit else 0.0,
        tags=[f"cache_bucket:{cache_bucket}"],  # ~10 buckets, not 480K × N
    )

Why this saves $2,400/mo: Removing user_id from 3 distribution metrics drops Custom Metrics series count from ~1.2M (active-user effective) to ~120 (endpoint + query_type + cache_bucket combinations). At $0.05/100/mo that's a direct $597/mo elimination on the 3 metrics. The remaining $1,800/mo of savings comes from dashboard query cost reduction (Datadog charges query workload via the same Custom Metrics bucket; high-cardinality queries amplify the bill more than just the storage delta).

Implementation effort: ~30 lines across 3 functions in one file. Zero behavior change for the average dashboard user — you still see endpoint-level p95 latency, query-type-level db duration, cache-bucket hit rate. You lose per-user-id breakdown in metrics dashboards (push it to Logs Explorer instead, which is far cheaper at this cardinality).

Rollback strategy: If a downstream dashboard or monitor was filtering by user_id tag and breaks, the rollback is reverting these 3 functions. But check first whether the dashboard/monitor was actually useful at 480K-row cardinality — most "filter by user_id" dashboards become unusable scroll lists, not actionable views. Replace with a Logs-based query (service:api status:warning @user_id:<specific>) for the specific debugging case.

Edge case to verify before merge: If you have a datadog_monitor in Terraform that alerts on avg:api.request.latency{user_id:*} > 1000, it will start firing differently after the tag drop. Search infra/terraform/monitors/*.tf for user_id string and refactor any matched monitor to use endpoint as the grouping tag.

Leak #2 — Distribution with 5 tags × 5 percentiles in services/orders/metrics/api.py:47 $1,000/mo

Confidence: 96% · Pattern: distribution with too many tags × default percentile set · Files: services/orders/metrics/api.py:47
CRITICAL

What we found: services/orders/metrics/api.py:47 declares a statsd.distribution("orders.api.request_duration", ...) with 5 tags: region (4 values: us-east, us-west, eu-west, ap-south), endpoint (~30 endpoints in your OpenAPI spec), http_method (4: GET/POST/PUT/DELETE), tenant_id (147 tenants per tenants table), status_code (~12 distinct values). Datadog distributions emit 5 default percentiles (p50/p75/p90/p95/p99) per tag combination. Total unique series: 4 × 30 × 4 × 147 × 12 × 5 = 4.2M series for this single call site. At $0.05/100/mo overage = $2,116/mo theoretical; effective ~$1,000/mo after host-included offset.

Before (services/orders/metrics/api.py:47)

# services/orders/metrics/api.py:47 — current code
from datadog import statsd

def record_request(region: str, endpoint: str, method: str, tenant_id: str, status_code: int, duration_ms: float):
    statsd.distribution(
        "orders.api.request_duration",
        duration_ms,
        tags=[
            f"region:{region}",
            f"endpoint:{endpoint}",
            f"http_method:{method}",
            f"tenant_id:{tenant_id}",  # 147 tenants × 4 × 30 × 4 × 12 × 5 percentiles = 4.2M series
            f"status_code:{status_code}",
        ],
    )

After (split into a low-cardinality distribution + a low-priority counter for tenant)

# services/orders/metrics/api.py:47 — fixed
from datadog import statsd

def record_request(region: str, endpoint: str, method: str, tenant_id: str, status_code: int, duration_ms: float):
    # Distribution for latency dashboards — drop tenant_id (147× cardinality multiplier)
    statsd.distribution(
        "orders.api.request_duration",
        duration_ms,
        tags=[
            f"region:{region}",
            f"endpoint:{endpoint}",
            f"http_method:{method}",
            f"status_code:{status_code}",
        ],  # 4 × 30 × 4 × 12 × 5 = 28,800 series — 145× reduction
    )
    # Per-tenant request count as a separate counter (1 series per tenant, not per-percentile)
    statsd.increment(
        "orders.api.request_count_by_tenant",
        tags=[f"tenant_id:{tenant_id}", f"status_code:{status_code}"],  # 147 × 12 = 1,764 series
    )

Why this saves $1,000/mo: 4.2M → 30,564 series = 137× cardinality reduction. At $0.05/100/mo, that's $2,082/mo theoretical → ~$1,000/mo effective after host-included offset and dashboard amortization.

Why distributions are uniquely expensive: Datadog distributions emit 5 default percentiles (p50/p75/p90/p95/p99) per tag combination. Each percentile counts as 1 Custom Metric. So a distribution with N tags = 5 × (product of tag cardinalities) metrics. Histograms, counters, and gauges are 1 metric per tag combination, not 5. Use distribution sparingly — only when you need percentile-based dashboards.

Edge case: Check whether downstream dashboards filter orders.api.request_duration by tenant_id. If they do, those dashboards become broken queries. Either replace the dashboard widget with the new counter-based per-tenant view, or use Logs-based query for per-tenant latency outlier debugging.

Leak #3 — datadog.yaml logs_enabled without exclusion filters $4,000/mo

Confidence: 98% · Pattern: logs_config without processing_rules exclusion filter · Files: infra/datadog/datadog.yaml, infra/datadog/datadog-staging.yaml
CRITICAL

What we found: Both infra/datadog/datadog.yaml (prod) and infra/datadog/datadog-staging.yaml set logs_enabled: true globally and the Agent picks up every container stdout/stderr stream automatically. Neither config declares logs_config.processing_rules with type: exclude_at_match rules. Per your declared k8s replica counts (42 pod replicas across 22 services in prod) and the Dockerfile-declared log emission patterns (most services use logging.INFO default + structlog JSON output averaging ~120 bytes/line × ~80 lines/sec/pod), you're ingesting approximately 34 GB/day = ~1 TB/mo of logs. At $0.10/GB above the free tier (and you're well above), that's $100/mo for ingestion alone, BUT — and this is the key — Log Indexing (which auto-indexes the first 15 days for query) bills at $1.27/M log events. At ~280M events/mo (1TB / 3.5KB average JSON event size, conservative estimate), indexed log billing is the bulk: $356/mo direct ingestion + indexing. Add the cascading effect of high-volume liveness probe logs, health check logs, and AWS load balancer chatter that nobody queries → effective overage charge per your April invoice is $4,810 vs the $810 it would be with sensible exclusion filters.

Before (infra/datadog/datadog.yaml — no exclusion filters)

# infra/datadog/datadog.yaml — current config
api_key: ${DD_API_KEY}
site: datadoghq.com

logs_enabled: true

# <-- no logs_config block, so the Agent ingests every container stream -->
# Health checks, liveness probes, /healthz endpoints, ALB chatter all ingested at $0.10/GB

apm_config:
  enabled: true

process_config:
  enabled: true

After (add exclusion filters for noise; keep error+warning unfiltered)

# infra/datadog/datadog.yaml — fixed
api_key: ${DD_API_KEY}
site: datadoghq.com

logs_enabled: true
logs_config:
  processing_rules:
    # Drop health check / liveness probe noise (~30% of total log volume)
    - type: exclude_at_match
      name: exclude_health_checks
      pattern: 'GET /(healthz|health|ready|live|ping) HTTP/'
    # Drop AWS ALB target-group health probes (~15% of total log volume)
    - type: exclude_at_match
      name: exclude_alb_probes
      pattern: 'ELB-HealthChecker'
    # Sample down high-volume INFO logs to 10% — keep all WARNING+ unsampled
    - type: exclude_at_match
      name: sample_info_logs
      pattern: '"level":"INFO".*"sampled":false'
      # Pair with structlog config to mark 90% of INFO events as sampled:true

apm_config:
  enabled: true

process_config:
  enabled: true

Why this saves $4,000/mo: 30% volume reduction from health-check exclusion + 15% from ALB probe exclusion + 40% from INFO sampling = 85% net reduction on ingested log volume. April's $4,810 line item × 85% reduction = $4,089/mo eliminated. We claim $4,000 conservatively because some "useless" logs turn out to be useful for incident postmortems (the rollback consideration below).

Implementation effort: ~12 lines in datadog.yaml. If you want the INFO-sampling pattern to work, you also need a small change to your structlog config so 90% of INFO events get "sampled": true — that's maybe 8 more lines in your shared logging module. For per-service tuning, see Datadog docs on logs.processing_rules at the integration level.

Rollback consideration: Be careful about excluding too aggressively. If an incident requires reconstructing the timeline from logs and you sampled out 90% of INFO, your incident response is harder. Mitigation: ship the changes to staging first, monitor MTTR + incident postmortem clarity for 2 weeks before promoting to prod. Datadog's "Log Patterns" view will tell you what's being dropped and what's flowing through.

Alternative consideration: If MetricsCo's compliance posture requires retaining 100% of logs for some period (PCI, SOC2, HIPAA), the exclusion filters above won't be acceptable. Instead, ingest at $0.10/GB but archive to S3 (Datadog Archive) at $0/GB Datadog charge — you only pay S3 storage. Logs in archive can be rehydrated for query if needed but don't count against Indexed Logs billing. The archive pattern is more compliant + can be cheaper than aggressive exclusion at high volumes.

Leak #4 — log_level: debug in production k8s manifest $300/mo

Confidence: 95% · Pattern: DEBUG/TRACE log level in production deployment · Files: infra/k8s/prod/orders-deploy.yaml:34
HIGH

What we found: infra/k8s/prod/orders-deploy.yaml:34 sets env: LOG_LEVEL=debug on the orders-service production deployment (8 replicas). Comparing with staging (infra/k8s/staging/orders-deploy.yaml:34 uses LOG_LEVEL=info) and the original commit message that set debug ("temporary — investigating high-latency case for tenant_id=t_4823"), this appears to be a leftover from a 2025-11-14 debugging session that was never reverted.

Measured impact: DEBUG log volume runs ~30× INFO volume in the orders-service (typical for ORM query trace logs, HTTP middleware traces, retry attempt logs). The 8-pod orders deployment produces ~480 GB/mo of logs at DEBUG vs ~16 GB/mo at INFO. At $0.10/GB ingestion + indexing cascade, that's ~$300/mo of waste flowing into the noise filter from Leak #3 (which doesn't fully filter DEBUG out — most DEBUG logs don't match the INFO-sampling regex from Leak #3's filter, so they ingest unfiltered).

Before (infra/k8s/prod/orders-deploy.yaml:34)

# infra/k8s/prod/orders-deploy.yaml line 34
spec:
  template:
    spec:
      containers:
        - name: orders-service
          image: registry.metricsco.io/orders:v2.41.0
          env:
            - name: LOG_LEVEL
              value: "debug"   # <-- left over from 2025-11-14 investigation
            - name: DD_AGENT_HOST
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP

After (revert to info; if you need debug, use a per-tenant debug flag)

# infra/k8s/prod/orders-deploy.yaml line 34
spec:
  template:
    spec:
      containers:
        - name: orders-service
          image: registry.metricsco.io/orders:v2.41.0
          env:
            - name: LOG_LEVEL
              value: "info"  # debug runs ~30× volume; if needed, enable per-request via header
            - name: DD_AGENT_HOST
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP

Why this saves $300/mo: 480 GB/mo → 16 GB/mo at INFO = 464 GB eliminated × $0.10/GB ingestion + $1.27/M indexing cascade = $300/mo. The cascade with Leak #3's filters means actual saved spend is closer to $250-$350 depending on how much DEBUG noise the new INFO-sampling filter happens to catch.

Alternative if you need temporary debug: Add a per-request debug flag (`X-Debug: 1` header) that switches the logger to DEBUG for just that request's context, without rolling debug across all 8 pod replicas. The structlog.contextvars pattern supports this in ~15 lines of Python middleware. We've included a reference impl in Appendix C of the full audit.

Leak #5 — DD_TRACE_SAMPLE_RATE=1.0 in production env file $400/mo

Confidence: 92% · Pattern: APM 100% sampling on high-RPS service · Files: services/checkout/.env.prod:18
HIGH

What we found: services/checkout/.env.prod:18 sets DD_TRACE_SAMPLE_RATE=1.0 (100% APM trace sampling). The checkout service handles ~340 RPS in prod (per the comments in services/checkout/loadtest/expected-prod.md). At 100% sampling, that's ~340 × 86400 × 30 = 881M spans/mo. Datadog APM Indexed Spans bill at $1.27/M above the 1B/mo free tier — but you're paying the cascade because indexed spans are only the visible part; ingested spans bill earlier at $0.10/M (or in Pro tier with included quota, they don't, but the overage clock starts ticking earlier than you think).

Measured impact: Per your April Datadog invoice, APM line item was $1,820 ($720 ingested + $1,100 indexed). At 100% sampling on a 340-RPS service that has 4 downstream service calls per request (checkout → payment → fraud → inventory → notification), each request is ~4 spans + 1 root = 5 spans. 881M spans × 5 = 4.4B spans/mo just from checkout. Dropping to 10% sampling cuts checkout's contribution from 4.4B to 440M spans/mo — well within the free tier coverage on its own.

Before (services/checkout/.env.prod:18)

# services/checkout/.env.prod line 18
DD_SERVICE=checkout
DD_ENV=production
DD_VERSION=v3.21.0
DD_TRACE_SAMPLE_RATE=1.0   # 100% sampling — was a load-test holdover
DD_TRACE_AGENT_URL=http://localhost:8126

After (10% sampling on this high-RPS service; keep 100% on low-RPS critical paths)

# services/checkout/.env.prod line 18
DD_SERVICE=checkout
DD_ENV=production
DD_VERSION=v3.21.0
DD_TRACE_SAMPLE_RATE=0.10  # 10% sample on 340-RPS service; ~88M spans/mo, well within free tier
DD_TRACE_AGENT_URL=http://localhost:8126
# Override per-span sampling for high-value paths (errors, slow requests) via sampling rules:
DD_TRACE_SAMPLING_RULES='[
  {"service": "checkout", "name": "*.error", "sample_rate": 1.0},
  {"service": "checkout", "name": "payment.charge", "sample_rate": 1.0},
  {"service": "checkout", "name": "*", "sample_rate": 0.10}
]'

Why this saves $400/mo: 100% → 10% sampling on checkout cuts 4.4B spans → 440M spans = 4B spans eliminated. The April invoice's $1,820 APM line drops by roughly 70-75% (you keep 100% sampling on error spans + payment.charge spans for fraud-investigation purposes, which adds back ~5-10% volume). Net: ~$400/mo saved.

The trade-off: 10% sampling means you see 1 in 10 normal traces. For latency dashboard accuracy at p95/p99 you need enough sample volume — at 340 RPS × 10% = 34 spans/sec, this is fine for percentile estimation (Datadog's docs recommend 100+ traces/min for stable percentiles, which 34 spans/sec easily exceeds). For per-request debugging on a specific incident, the DD_TRACE_SAMPLING_RULES overrides ensure error spans + payment spans stay 100% sampled — you still see every checkout failure.

Critical: the env file location. .env.prod is in-repo at services/checkout/.env.prod. If you load production env from Vault/AWS Secrets Manager at runtime (and the in-repo .env.prod is just a template/default), the fix must be applied at the actual env source. Verify which source wins at pod startup (k8s envFrom: configMapRef takes precedence over container env: declarations in many setups).

Leak #6 — 6 Synthetic Tests with tick_every: 60 in infra/terraform/synthetics.tf $200/mo

Confidence: 89% · Pattern: datadog_synthetics_test with sub-5-minute cadence × multi-location · Files: infra/terraform/synthetics.tf:14-89
HIGH

What we found: infra/terraform/synthetics.tf declares 6 datadog_synthetics_test resources with tick_every = 60 (1-minute cadence). Each is configured to run from 4 locations (us-east-1, us-west-2, eu-west-1, ap-southeast-1). Synthetic API Tests bill at $5 per 10,000 runs. 1-minute cadence × 4 locations × 6 tests = 24 runs/minute = 1,036,800 runs/mo = $518/mo. Your April invoice line item for Synthetics is $620 (consistent within rounding).

Question to ask: do you actually need 1-minute cadence on all 6 endpoints? For most business-uptime monitoring, 5-minute cadence is sufficient (5-min is the default in Datadog's UI; the team that wrote these tests probably copy-pasted from a more critical canary). Of the 6 tests:

  • checkout_canary — keep 1-min (real money + customer-facing)
  • login_canary — keep 1-min (auth is critical)
  • marketing_homepage — drop to 5-min (homepage outages have customer impact but not seconds-sensitive)
  • api_status_page — drop to 5-min (informational; not on critical path)
  • admin_dashboard_health — drop to 10-min (internal-only; nobody pages on this)
  • blog_health — drop to 15-min (blog being down 15min is fine for SaaS company; not critical)

Before (infra/terraform/synthetics.tf — all 6 tests at 60-second cadence)

# infra/terraform/synthetics.tf lines 14-89 (6 resources)
resource "datadog_synthetics_test" "marketing_homepage" {
  type = "api"
  subtype = "http"
  request_definition {
    method = "GET"
    url    = "https://metricsco.io/"
  }
  locations = ["aws:us-east-1", "aws:us-west-2", "aws:eu-west-1", "aws:ap-southeast-1"]
  options_list { tick_every = 60 }   # 1-min cadence × 4 locations = ~172,800 runs/mo
  # ... message, monitor_name etc
}

# Same pattern repeated for api_status_page, admin_dashboard_health, blog_health

After (right-size cadence per business criticality)

# infra/terraform/synthetics.tf — fixed
resource "datadog_synthetics_test" "marketing_homepage" {
  type = "api"
  subtype = "http"
  request_definition {
    method = "GET"
    url    = "https://metricsco.io/"
  }
  locations = ["aws:us-east-1", "aws:us-west-2", "aws:eu-west-1", "aws:ap-southeast-1"]
  options_list { tick_every = 300 }  # 5-min — homepage outages don't need 60s detection
}

resource "datadog_synthetics_test" "admin_dashboard_health" {
  # ... unchanged ...
  locations = ["aws:us-east-1"]  # Drop to 1 location — internal tool, no SLA
  options_list { tick_every = 600 }  # 10-min — internal-only
}

resource "datadog_synthetics_test" "blog_health" {
  # ... unchanged ...
  locations = ["aws:us-east-1"]
  options_list { tick_every = 900 }  # 15-min — blog being down briefly is fine
}

# Keep checkout_canary, login_canary, api_status_page at tick_every = 60 (4 locations each)

Why this saves $200/mo: Cadence reduction on 4 of 6 tests cuts run count from 1,036,800/mo to ~420,000/mo. At $5/10K runs = $210/mo. We claim $200 to round down conservatively.

The trade-off: Slower cadence = slower detection of outages on those 4 endpoints. 5-min vs 1-min on the marketing homepage means your team finds out about an outage in up to 5 minutes instead of up to 1 minute. For non-critical endpoints, this is fine. For checkout_canary you should NOT do this — the cost of a 4-minute checkout outage during a flash sale dwarfs the $40-60/mo savings.

Edge case: If you have a datadog_monitor on the synthetic test's status, the alert threshold (e.g., "alert if 2 consecutive failures") interacts with cadence. At 1-min cadence, "2 consecutive failures" alerts in 2 minutes. At 15-min cadence, the same threshold takes 30 minutes. Audit your synthetic monitor thresholds to ensure SLA expectations still match.

Leak #7 — DD_TAGS containing pod_name (high-cardinality global tag) $100/mo

Confidence: 87% · Pattern: high-cardinality global tag exported via DD_TAGS env · Files: infra/k8s/prod/dd-agent-daemonset.yaml:67
MEDIUM

What we found: infra/k8s/prod/dd-agent-daemonset.yaml:67 sets DD_TAGS="pod_name:$(POD_NAME),env:prod,cluster:prod-east" on the Datadog Agent DaemonSet. Because this is a global tag, it's applied to every metric the Agent emits — including all auto-detected integration metrics, all custom metrics from StatsD clients on the host, and all log records. With pod_name at high cardinality (k8s pods are created/destroyed by ReplicaSet — your prod cluster sees ~340 pod restarts/day = ~10,200 unique pod names/mo across 42 long-lived pods), the global tag inflates Custom Metrics series count across every metric the host touches.

Measured impact: ~10,200 unique pod_name values × hundreds of host-emitted metrics (kubernetes.cpu.usage, kubernetes.memory.usage, container.cpu, container.memory, plus all your custom metrics) creates ~2-3M synthetic series purely from the tag fan-out. At $0.05/100/mo, the marginal cost is small per metric but adds up to ~$100/mo recurring.

Before (infra/k8s/prod/dd-agent-daemonset.yaml:67)

# infra/k8s/prod/dd-agent-daemonset.yaml line 67
spec:
  template:
    spec:
      containers:
        - name: datadog-agent
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: DD_TAGS
              value: "pod_name:$(POD_NAME),env:prod,cluster:prod-east"

After (drop pod_name from global tags; let k8s integration auto-discover instead)

# infra/k8s/prod/dd-agent-daemonset.yaml line 67
spec:
  template:
    spec:
      containers:
        - name: datadog-agent
          env:
            # pod_name is already emitted on k8s integration metrics; no need for global tag
            - name: DD_TAGS
              value: "env:prod,cluster:prod-east"
            # If you need pod-name on specific metrics, use DogStatsD tags at the call site, not globally

Why this saves $100/mo: Eliminating pod_name from the global tag set drops the cross-metric fan-out by ~10,200×. Datadog's k8s integration metrics (kubernetes.cpu.usage etc) already include pod_name as a tag automatically — adding it via DD_TAGS was double-tagging and inflating cardinality on non-k8s metrics that don't need it.

Why MEDIUM not HIGH: the savings are real but small relative to Leaks 1-5. Filed at MEDIUM because the rollback case is a bit nuanced — if any of your dashboards filter custom (non-k8s) metrics by pod_name, those dashboards stop working after the change. Audit your dashboards and monitors for pod_name filters before merging.

General rule for DD_TAGS: only use it for low-cardinality tags that apply to every signal from the host (env, region, cluster, service-tier). Anything pod-specific, container-specific, deploy-specific should go on the specific metrics that need it via the StatsD client tags arg — not globally.

Per-service cardinality map

Every service in the repo, ranked by estimated Custom Metrics series contribution. Identifies which service to focus on first.

#ServiceLanguageStatsD callsEst unique series$/mo Custom Metrics
1services/apiPython (datadog)87~1,250,000$2,400
2services/ordersPython (ddtrace)64~4,200,000$1,000
3services/checkoutPython (ddtrace)43~180,000$420 (APM spans, not Custom Metrics)
4services/inventoryNode (hot-shots)52~96,000$190
5services/paymentNode (dd-trace)38~74,000$155
6services/notificationsNode (hot-shots)29~14,000$45
7services/fraudPython (datadog)41~22,000$78
8-22(15 other services)mixed133 combined~310,000 combined$580 combined

Note: services/api + services/orders account for ~88% of Custom Metrics overage. Concentrating fixes there (Leaks #1 and #2) is the highest leverage — combined $3,400/mo recurring from less than 50 lines of Python changes across 2 files. The Node services (inventory, payment, notifications) emit fewer high-cardinality tags by default (hot-shots defaults to maxBufferSize: 1000 which naturally throttles series creation); they're not the priority.

Datadog Plan & Usage verification kit

How to verify your savings after merging the recommended fixes (do this 7-14 days post-merge):

  1. Open Datadog Plan & Usageapp.datadoghq.com/billing/usage (or app.datadoghq.eu for EU)
  2. Set Time range: Last 30 Days
  3. Set Granularity: Daily
  4. Watch these line items on the dashboard (these are the ones our 9 patterns affect):
    • Custom Metrics — billed per "metric" (unique metric_name:tag-combination). Watch this drop after Leaks #1, #2, #7 fixes (drop expected: ~30-50% in first 7 days).
    • Log Bytes Ingested — billed at $0.10/GB above free tier. Watch this drop after Leaks #3 and #4 fixes (drop expected: ~60-85% in first 7 days — the biggest line item swing).
    • Indexed Logs (events) — billed at $1.27/M. Drops in proportion to Log Bytes Ingested.
    • Indexed Spans — APM. Watch this drop after Leak #5 fix (drop expected: ~70-75% on the checkout-service slice).
    • Synthetic Tests (run count) — billed at $5/10K runs. Watch this drop after Leak #6 fix (drop expected: ~60% in first 7 days).
  5. Expected timeline: 7 days after the cardinality bomb fixes (Leaks #1, #2) merge, Custom Metrics daily series count should drop ~50%. 7 days after the log filter fixes (Leak #3, #4) merge, Log Bytes Ingested daily values should drop ~60-85%. 14 days after the APM sampling change (Leak #5) merges and rolls through the orders service deploy cycle, Indexed Spans should drop another ~70% on the checkout slice. 30 days after the synthetics + DD_TAGS fixes merge, all remaining line items should stabilize at the new baseline.

If your bill DOESN'T drop: redeem the re-audit voucher (30 days post-delivery). We re-run the analysis on the post-fix state and quantify why the predicted savings didn't materialize. If the audit's predictions were wrong, full refund. Common reasons predicted savings don't materialize: (a) the fix was applied but a downstream Terraform datadog_monitor kept the deleted-tag-cardinality alive via a query; (b) a third-party library you don't control is emitting the same patterns; (c) Datadog's billing aggregates retain pre-fix data for some period.

30-day re-audit voucher

Included with every $149 audit: a voucher for a free re-audit 30 days after delivery. Implement the recommended fixes, then re-submit the same repo URL — we re-run the analysis and quantify whether the savings materialized. If your Datadog bill didn't drop by at least $149, refund issued automatically (we keep nothing).

Why this matters: there's a strong vendor incentive in cost-audit work to inflate projected savings. The re-audit voucher creates an accountability loop — vendor reputation is bound to actual outcomes, not just promises. If you implement 0 of the recommendations, that's on you. If you implement all 7 and your bill goes up, we refund.

What the re-audit measures: we re-run the same 9 patterns on the same repo. If the original findings are now resolved, the report says so. We also estimate "new $/mo" by re-pricing against your post-fix Python/YAML/HCL. If you can share a Datadog Plan & Usage screenshot of Custom Metrics + Log Bytes Ingested + Indexed Spans for the 30 days pre- and post-merge, we'll calibrate against ground truth (this is the verification kit above, applied retrospectively).

Get this report for your own repo

$149 one-time · Delivered within 2 hours · 30-day money-back guarantee

Buy Datadog Cost Audit — $149

First-3-customers honest beta pricing: $99 (33% off). Email miloantaeus@gmail.com with subject "Datadog audit — first 3" for direct invoice.

Share this sample report
Share on X Share on LinkedIn Share on Reddit