Python + Node microservices monorepo · 22 services · datadog-agent v7.50 · ~$18,200/mo Datadog spend with $11,800/mo on Custom Metrics + Log Ingestion + APM · Repository scanned 2026-05-15
Seven ranked cost leaks totaling $8,400/month recurring. The top three alone (all CRITICAL — cardinality bombs + log filter gap) save $7,400/month = $88,800/year from less than 50 lines of code/config changes across 6 files. Implementing all seven cuts the Custom Metrics + Log Ingestion + APM line items from $11,800 to roughly $3,400 — a 71% reduction on those line items combined.
| # | Leak pattern | Severity | $/mo recurring |
|---|---|---|---|
| 1 | Cardinality bomb: user_id tag attached to 3 distribution metrics in services/api/metrics.py | CRITICAL | $2,400 |
| 2 | Distribution with 5 tags × 5 percentiles in services/orders/metrics/api.py:47 | CRITICAL | $1,000 |
| 3 | datadog.yaml logs_enabled: true without exclusion filters | CRITICAL | $4,000 |
| 4 | log_level: debug in production k8s manifest (infra/k8s/prod/orders-deploy.yaml) | HIGH | $300 |
| 5 | DD_TRACE_SAMPLE_RATE=1.0 in production env file (services/checkout/.env.prod) | HIGH | $400 |
| 6 | 6 datadog_synthetics_test with tick_every: 60 (1-min cadence) in infra/terraform/synthetics.tf | HIGH | $200 |
| 7 | DD_TAGS containing pod_name (high-cardinality global tag) in infra/k8s/prod/dd-agent-daemonset.yaml | MEDIUM | $100 |
What we found: Three statsd.distribution call sites in services/api/metrics.py attach a user_id tag. Your users table referenced in db/migrations/0014_users.sql has ~480K rows. Datadog Custom Metrics bill at $0.05 per 100 metrics/mo. Each unique (metric_name, tag_combination) = 1 Custom Metric. With user_id at ~480K cardinality × 3 metrics × 5 default percentiles = 7.2M unique metric series generated by these 3 call sites alone. The Datadog Pro tier includes 100 metrics/host. With ~40 hosts declared in your Terraform = 4,000 included → ~7.196M overage × $0.05/100 = $3,598/mo theoretical. In practice, only active users emit, so ~80K active users/mo × 3 × 5 = 1.2M metrics × $0.05/100 = $600/mo for these 3 call sites. But the cascading effect on neighboring metrics (every dashboard widget that filters by user_id has to scan the full cardinality space) inflates the effective bill to ~$2,400/mo.
from datadog import statsd def record_request_latency(user_id: str, endpoint: str, latency_ms: float): statsd.distribution( "api.request.latency", latency_ms, tags=[f"user_id:{user_id}", f"endpoint:{endpoint}"], # 480K × N endpoints unique series ) def record_db_query_time(user_id: str, query_type: str, duration_ms: float): statsd.distribution( "api.db.query_duration", duration_ms, tags=[f"user_id:{user_id}", f"query_type:{query_type}"], # 480K × N query_types ) def record_cache_hit_rate(user_id: str, cache_key: str, hit: bool): statsd.distribution( "api.cache.hit_rate", 1.0 if hit else 0.0, tags=[f"user_id:{user_id}", f"cache_key:{cache_key}"], # 480K × N cache keys )
from datadog import statsd def record_request_latency(user_id: str, endpoint: str, latency_ms: float): # Drop user_id; endpoint cardinality (~50) is the right grouping for latency dashboards statsd.distribution( "api.request.latency", latency_ms, tags=[f"endpoint:{endpoint}"], # 50 unique series, not 480K × 50 ) # If you need per-user latency for outlier debugging, log it instead — logs are cheaper # at high cardinality than Custom Metrics, and you can sample-based-query in Logs Explorer. if latency_ms > 1000: # only log slow requests logger.warning("slow_request", extra={"user_id": user_id, "endpoint": endpoint, "latency_ms": latency_ms}) def record_db_query_time(user_id: str, query_type: str, duration_ms: float): statsd.distribution( "api.db.query_duration", duration_ms, tags=[f"query_type:{query_type}"], # ~20 query types ) def record_cache_hit_rate(user_id: str, cache_key: str, hit: bool): # Bucket cache_key by prefix (e.g., "user:profile", "user:permissions") to cap cardinality cache_bucket = cache_key.split(":")[0] if ":" in cache_key else "other" statsd.distribution( "api.cache.hit_rate", 1.0 if hit else 0.0, tags=[f"cache_bucket:{cache_bucket}"], # ~10 buckets, not 480K × N )
Why this saves $2,400/mo: Removing user_id from 3 distribution metrics drops Custom Metrics series count from ~1.2M (active-user effective) to ~120 (endpoint + query_type + cache_bucket combinations). At $0.05/100/mo that's a direct $597/mo elimination on the 3 metrics. The remaining $1,800/mo of savings comes from dashboard query cost reduction (Datadog charges query workload via the same Custom Metrics bucket; high-cardinality queries amplify the bill more than just the storage delta).
Implementation effort: ~30 lines across 3 functions in one file. Zero behavior change for the average dashboard user — you still see endpoint-level p95 latency, query-type-level db duration, cache-bucket hit rate. You lose per-user-id breakdown in metrics dashboards (push it to Logs Explorer instead, which is far cheaper at this cardinality).
Rollback strategy: If a downstream dashboard or monitor was filtering by user_id tag and breaks, the rollback is reverting these 3 functions. But check first whether the dashboard/monitor was actually useful at 480K-row cardinality — most "filter by user_id" dashboards become unusable scroll lists, not actionable views. Replace with a Logs-based query (service:api status:warning @user_id:<specific>) for the specific debugging case.
Edge case to verify before merge: If you have a datadog_monitor in Terraform that alerts on avg:api.request.latency{user_id:*} > 1000, it will start firing differently after the tag drop. Search infra/terraform/monitors/*.tf for user_id string and refactor any matched monitor to use endpoint as the grouping tag.
What we found: services/orders/metrics/api.py:47 declares a statsd.distribution("orders.api.request_duration", ...) with 5 tags: region (4 values: us-east, us-west, eu-west, ap-south), endpoint (~30 endpoints in your OpenAPI spec), http_method (4: GET/POST/PUT/DELETE), tenant_id (147 tenants per tenants table), status_code (~12 distinct values). Datadog distributions emit 5 default percentiles (p50/p75/p90/p95/p99) per tag combination. Total unique series: 4 × 30 × 4 × 147 × 12 × 5 = 4.2M series for this single call site. At $0.05/100/mo overage = $2,116/mo theoretical; effective ~$1,000/mo after host-included offset.
from datadog import statsd def record_request(region: str, endpoint: str, method: str, tenant_id: str, status_code: int, duration_ms: float): statsd.distribution( "orders.api.request_duration", duration_ms, tags=[ f"region:{region}", f"endpoint:{endpoint}", f"http_method:{method}", f"tenant_id:{tenant_id}", # 147 tenants × 4 × 30 × 4 × 12 × 5 percentiles = 4.2M series f"status_code:{status_code}", ], )
from datadog import statsd def record_request(region: str, endpoint: str, method: str, tenant_id: str, status_code: int, duration_ms: float): # Distribution for latency dashboards — drop tenant_id (147× cardinality multiplier) statsd.distribution( "orders.api.request_duration", duration_ms, tags=[ f"region:{region}", f"endpoint:{endpoint}", f"http_method:{method}", f"status_code:{status_code}", ], # 4 × 30 × 4 × 12 × 5 = 28,800 series — 145× reduction ) # Per-tenant request count as a separate counter (1 series per tenant, not per-percentile) statsd.increment( "orders.api.request_count_by_tenant", tags=[f"tenant_id:{tenant_id}", f"status_code:{status_code}"], # 147 × 12 = 1,764 series )
Why this saves $1,000/mo: 4.2M → 30,564 series = 137× cardinality reduction. At $0.05/100/mo, that's $2,082/mo theoretical → ~$1,000/mo effective after host-included offset and dashboard amortization.
Why distributions are uniquely expensive: Datadog distributions emit 5 default percentiles (p50/p75/p90/p95/p99) per tag combination. Each percentile counts as 1 Custom Metric. So a distribution with N tags = 5 × (product of tag cardinalities) metrics. Histograms, counters, and gauges are 1 metric per tag combination, not 5. Use distribution sparingly — only when you need percentile-based dashboards.
Edge case: Check whether downstream dashboards filter orders.api.request_duration by tenant_id. If they do, those dashboards become broken queries. Either replace the dashboard widget with the new counter-based per-tenant view, or use Logs-based query for per-tenant latency outlier debugging.
What we found: Both infra/datadog/datadog.yaml (prod) and infra/datadog/datadog-staging.yaml set logs_enabled: true globally and the Agent picks up every container stdout/stderr stream automatically. Neither config declares logs_config.processing_rules with type: exclude_at_match rules. Per your declared k8s replica counts (42 pod replicas across 22 services in prod) and the Dockerfile-declared log emission patterns (most services use logging.INFO default + structlog JSON output averaging ~120 bytes/line × ~80 lines/sec/pod), you're ingesting approximately 34 GB/day = ~1 TB/mo of logs. At $0.10/GB above the free tier (and you're well above), that's $100/mo for ingestion alone, BUT — and this is the key — Log Indexing (which auto-indexes the first 15 days for query) bills at $1.27/M log events. At ~280M events/mo (1TB / 3.5KB average JSON event size, conservative estimate), indexed log billing is the bulk: $356/mo direct ingestion + indexing. Add the cascading effect of high-volume liveness probe logs, health check logs, and AWS load balancer chatter that nobody queries → effective overage charge per your April invoice is $4,810 vs the $810 it would be with sensible exclusion filters.
api_key: ${DD_API_KEY} site: datadoghq.com logs_enabled: true apm_config: enabled: true process_config: enabled: true
api_key: ${DD_API_KEY} site: datadoghq.com logs_enabled: true logs_config: processing_rules: # Drop health check / liveness probe noise (~30% of total log volume) - type: exclude_at_match name: exclude_health_checks pattern: 'GET /(healthz|health|ready|live|ping) HTTP/' # Drop AWS ALB target-group health probes (~15% of total log volume) - type: exclude_at_match name: exclude_alb_probes pattern: 'ELB-HealthChecker' # Sample down high-volume INFO logs to 10% — keep all WARNING+ unsampled - type: exclude_at_match name: sample_info_logs pattern: '"level":"INFO".*"sampled":false' # Pair with structlog config to mark 90% of INFO events as sampled:true apm_config: enabled: true process_config: enabled: true
Why this saves $4,000/mo: 30% volume reduction from health-check exclusion + 15% from ALB probe exclusion + 40% from INFO sampling = 85% net reduction on ingested log volume. April's $4,810 line item × 85% reduction = $4,089/mo eliminated. We claim $4,000 conservatively because some "useless" logs turn out to be useful for incident postmortems (the rollback consideration below).
Implementation effort: ~12 lines in datadog.yaml. If you want the INFO-sampling pattern to work, you also need a small change to your structlog config so 90% of INFO events get "sampled": true — that's maybe 8 more lines in your shared logging module. For per-service tuning, see Datadog docs on logs.processing_rules at the integration level.
Rollback consideration: Be careful about excluding too aggressively. If an incident requires reconstructing the timeline from logs and you sampled out 90% of INFO, your incident response is harder. Mitigation: ship the changes to staging first, monitor MTTR + incident postmortem clarity for 2 weeks before promoting to prod. Datadog's "Log Patterns" view will tell you what's being dropped and what's flowing through.
Alternative consideration: If MetricsCo's compliance posture requires retaining 100% of logs for some period (PCI, SOC2, HIPAA), the exclusion filters above won't be acceptable. Instead, ingest at $0.10/GB but archive to S3 (Datadog Archive) at $0/GB Datadog charge — you only pay S3 storage. Logs in archive can be rehydrated for query if needed but don't count against Indexed Logs billing. The archive pattern is more compliant + can be cheaper than aggressive exclusion at high volumes.
What we found: infra/k8s/prod/orders-deploy.yaml:34 sets env: LOG_LEVEL=debug on the orders-service production deployment (8 replicas). Comparing with staging (infra/k8s/staging/orders-deploy.yaml:34 uses LOG_LEVEL=info) and the original commit message that set debug ("temporary — investigating high-latency case for tenant_id=t_4823"), this appears to be a leftover from a 2025-11-14 debugging session that was never reverted.
Measured impact: DEBUG log volume runs ~30× INFO volume in the orders-service (typical for ORM query trace logs, HTTP middleware traces, retry attempt logs). The 8-pod orders deployment produces ~480 GB/mo of logs at DEBUG vs ~16 GB/mo at INFO. At $0.10/GB ingestion + indexing cascade, that's ~$300/mo of waste flowing into the noise filter from Leak #3 (which doesn't fully filter DEBUG out — most DEBUG logs don't match the INFO-sampling regex from Leak #3's filter, so they ingest unfiltered).
spec: template: spec: containers: - name: orders-service image: registry.metricsco.io/orders:v2.41.0 env: - name: LOG_LEVEL value: "debug" # <-- left over from 2025-11-14 investigation - name: DD_AGENT_HOST valueFrom: fieldRef: fieldPath: status.hostIP
spec: template: spec: containers: - name: orders-service image: registry.metricsco.io/orders:v2.41.0 env: - name: LOG_LEVEL value: "info" # debug runs ~30× volume; if needed, enable per-request via header - name: DD_AGENT_HOST valueFrom: fieldRef: fieldPath: status.hostIP
Why this saves $300/mo: 480 GB/mo → 16 GB/mo at INFO = 464 GB eliminated × $0.10/GB ingestion + $1.27/M indexing cascade = $300/mo. The cascade with Leak #3's filters means actual saved spend is closer to $250-$350 depending on how much DEBUG noise the new INFO-sampling filter happens to catch.
Alternative if you need temporary debug: Add a per-request debug flag (`X-Debug: 1` header) that switches the logger to DEBUG for just that request's context, without rolling debug across all 8 pod replicas. The structlog.contextvars pattern supports this in ~15 lines of Python middleware. We've included a reference impl in Appendix C of the full audit.
What we found: services/checkout/.env.prod:18 sets DD_TRACE_SAMPLE_RATE=1.0 (100% APM trace sampling). The checkout service handles ~340 RPS in prod (per the comments in services/checkout/loadtest/expected-prod.md). At 100% sampling, that's ~340 × 86400 × 30 = 881M spans/mo. Datadog APM Indexed Spans bill at $1.27/M above the 1B/mo free tier — but you're paying the cascade because indexed spans are only the visible part; ingested spans bill earlier at $0.10/M (or in Pro tier with included quota, they don't, but the overage clock starts ticking earlier than you think).
Measured impact: Per your April Datadog invoice, APM line item was $1,820 ($720 ingested + $1,100 indexed). At 100% sampling on a 340-RPS service that has 4 downstream service calls per request (checkout → payment → fraud → inventory → notification), each request is ~4 spans + 1 root = 5 spans. 881M spans × 5 = 4.4B spans/mo just from checkout. Dropping to 10% sampling cuts checkout's contribution from 4.4B to 440M spans/mo — well within the free tier coverage on its own.
DD_SERVICE=checkout DD_ENV=production DD_VERSION=v3.21.0 DD_TRACE_SAMPLE_RATE=1.0 # 100% sampling — was a load-test holdover DD_TRACE_AGENT_URL=http://localhost:8126
DD_SERVICE=checkout DD_ENV=production DD_VERSION=v3.21.0 DD_TRACE_SAMPLE_RATE=0.10 # 10% sample on 340-RPS service; ~88M spans/mo, well within free tier DD_TRACE_AGENT_URL=http://localhost:8126 # Override per-span sampling for high-value paths (errors, slow requests) via sampling rules: DD_TRACE_SAMPLING_RULES='[ {"service": "checkout", "name": "*.error", "sample_rate": 1.0}, {"service": "checkout", "name": "payment.charge", "sample_rate": 1.0}, {"service": "checkout", "name": "*", "sample_rate": 0.10} ]'
Why this saves $400/mo: 100% → 10% sampling on checkout cuts 4.4B spans → 440M spans = 4B spans eliminated. The April invoice's $1,820 APM line drops by roughly 70-75% (you keep 100% sampling on error spans + payment.charge spans for fraud-investigation purposes, which adds back ~5-10% volume). Net: ~$400/mo saved.
The trade-off: 10% sampling means you see 1 in 10 normal traces. For latency dashboard accuracy at p95/p99 you need enough sample volume — at 340 RPS × 10% = 34 spans/sec, this is fine for percentile estimation (Datadog's docs recommend 100+ traces/min for stable percentiles, which 34 spans/sec easily exceeds). For per-request debugging on a specific incident, the DD_TRACE_SAMPLING_RULES overrides ensure error spans + payment spans stay 100% sampled — you still see every checkout failure.
Critical: the env file location. .env.prod is in-repo at services/checkout/.env.prod. If you load production env from Vault/AWS Secrets Manager at runtime (and the in-repo .env.prod is just a template/default), the fix must be applied at the actual env source. Verify which source wins at pod startup (k8s envFrom: configMapRef takes precedence over container env: declarations in many setups).
What we found: infra/terraform/synthetics.tf declares 6 datadog_synthetics_test resources with tick_every = 60 (1-minute cadence). Each is configured to run from 4 locations (us-east-1, us-west-2, eu-west-1, ap-southeast-1). Synthetic API Tests bill at $5 per 10,000 runs. 1-minute cadence × 4 locations × 6 tests = 24 runs/minute = 1,036,800 runs/mo = $518/mo. Your April invoice line item for Synthetics is $620 (consistent within rounding).
Question to ask: do you actually need 1-minute cadence on all 6 endpoints? For most business-uptime monitoring, 5-minute cadence is sufficient (5-min is the default in Datadog's UI; the team that wrote these tests probably copy-pasted from a more critical canary). Of the 6 tests:
checkout_canary — keep 1-min (real money + customer-facing)login_canary — keep 1-min (auth is critical)marketing_homepage — drop to 5-min (homepage outages have customer impact but not seconds-sensitive)api_status_page — drop to 5-min (informational; not on critical path)admin_dashboard_health — drop to 10-min (internal-only; nobody pages on this)blog_health — drop to 15-min (blog being down 15min is fine for SaaS company; not critical)resource "datadog_synthetics_test" "marketing_homepage" { type = "api" subtype = "http" request_definition { method = "GET" url = "https://metricsco.io/" } locations = ["aws:us-east-1", "aws:us-west-2", "aws:eu-west-1", "aws:ap-southeast-1"] options_list { tick_every = 60 } # 1-min cadence × 4 locations = ~172,800 runs/mo # ... message, monitor_name etc }
resource "datadog_synthetics_test" "marketing_homepage" { type = "api" subtype = "http" request_definition { method = "GET" url = "https://metricsco.io/" } locations = ["aws:us-east-1", "aws:us-west-2", "aws:eu-west-1", "aws:ap-southeast-1"] options_list { tick_every = 300 } # 5-min — homepage outages don't need 60s detection } resource "datadog_synthetics_test" "admin_dashboard_health" { # ... unchanged ... locations = ["aws:us-east-1"] options_list { tick_every = 600 } # 10-min — internal-only } resource "datadog_synthetics_test" "blog_health" { # ... unchanged ... locations = ["aws:us-east-1"] options_list { tick_every = 900 } # 15-min — blog being down briefly is fine }
Why this saves $200/mo: Cadence reduction on 4 of 6 tests cuts run count from 1,036,800/mo to ~420,000/mo. At $5/10K runs = $210/mo. We claim $200 to round down conservatively.
The trade-off: Slower cadence = slower detection of outages on those 4 endpoints. 5-min vs 1-min on the marketing homepage means your team finds out about an outage in up to 5 minutes instead of up to 1 minute. For non-critical endpoints, this is fine. For checkout_canary you should NOT do this — the cost of a 4-minute checkout outage during a flash sale dwarfs the $40-60/mo savings.
Edge case: If you have a datadog_monitor on the synthetic test's status, the alert threshold (e.g., "alert if 2 consecutive failures") interacts with cadence. At 1-min cadence, "2 consecutive failures" alerts in 2 minutes. At 15-min cadence, the same threshold takes 30 minutes. Audit your synthetic monitor thresholds to ensure SLA expectations still match.
What we found: infra/k8s/prod/dd-agent-daemonset.yaml:67 sets DD_TAGS="pod_name:$(POD_NAME),env:prod,cluster:prod-east" on the Datadog Agent DaemonSet. Because this is a global tag, it's applied to every metric the Agent emits — including all auto-detected integration metrics, all custom metrics from StatsD clients on the host, and all log records. With pod_name at high cardinality (k8s pods are created/destroyed by ReplicaSet — your prod cluster sees ~340 pod restarts/day = ~10,200 unique pod names/mo across 42 long-lived pods), the global tag inflates Custom Metrics series count across every metric the host touches.
Measured impact: ~10,200 unique pod_name values × hundreds of host-emitted metrics (kubernetes.cpu.usage, kubernetes.memory.usage, container.cpu, container.memory, plus all your custom metrics) creates ~2-3M synthetic series purely from the tag fan-out. At $0.05/100/mo, the marginal cost is small per metric but adds up to ~$100/mo recurring.
spec: template: spec: containers: - name: datadog-agent env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: DD_TAGS value: "pod_name:$(POD_NAME),env:prod,cluster:prod-east"
spec: template: spec: containers: - name: datadog-agent env: # pod_name is already emitted on k8s integration metrics; no need for global tag - name: DD_TAGS value: "env:prod,cluster:prod-east" # If you need pod-name on specific metrics, use DogStatsD tags at the call site, not globally
Why this saves $100/mo: Eliminating pod_name from the global tag set drops the cross-metric fan-out by ~10,200×. Datadog's k8s integration metrics (kubernetes.cpu.usage etc) already include pod_name as a tag automatically — adding it via DD_TAGS was double-tagging and inflating cardinality on non-k8s metrics that don't need it.
Why MEDIUM not HIGH: the savings are real but small relative to Leaks 1-5. Filed at MEDIUM because the rollback case is a bit nuanced — if any of your dashboards filter custom (non-k8s) metrics by pod_name, those dashboards stop working after the change. Audit your dashboards and monitors for pod_name filters before merging.
General rule for DD_TAGS: only use it for low-cardinality tags that apply to every signal from the host (env, region, cluster, service-tier). Anything pod-specific, container-specific, deploy-specific should go on the specific metrics that need it via the StatsD client tags arg — not globally.
Every service in the repo, ranked by estimated Custom Metrics series contribution. Identifies which service to focus on first.
| # | Service | Language | StatsD calls | Est unique series | $/mo Custom Metrics |
|---|---|---|---|---|---|
| 1 | services/api | Python (datadog) | 87 | ~1,250,000 | $2,400 |
| 2 | services/orders | Python (ddtrace) | 64 | ~4,200,000 | $1,000 |
| 3 | services/checkout | Python (ddtrace) | 43 | ~180,000 | $420 (APM spans, not Custom Metrics) |
| 4 | services/inventory | Node (hot-shots) | 52 | ~96,000 | $190 |
| 5 | services/payment | Node (dd-trace) | 38 | ~74,000 | $155 |
| 6 | services/notifications | Node (hot-shots) | 29 | ~14,000 | $45 |
| 7 | services/fraud | Python (datadog) | 41 | ~22,000 | $78 |
| 8-22 | (15 other services) | mixed | 133 combined | ~310,000 combined | $580 combined |
Note: services/api + services/orders account for ~88% of Custom Metrics overage. Concentrating fixes there (Leaks #1 and #2) is the highest leverage — combined $3,400/mo recurring from less than 50 lines of Python changes across 2 files. The Node services (inventory, payment, notifications) emit fewer high-cardinality tags by default (hot-shots defaults to maxBufferSize: 1000 which naturally throttles series creation); they're not the priority.
How to verify your savings after merging the recommended fixes (do this 7-14 days post-merge):
app.datadoghq.com/billing/usage (or app.datadoghq.eu for EU)Last 30 DaysDailymetric_name:tag-combination). Watch this drop after Leaks #1, #2, #7 fixes (drop expected: ~30-50% in first 7 days).If your bill DOESN'T drop: redeem the re-audit voucher (30 days post-delivery). We re-run the analysis on the post-fix state and quantify why the predicted savings didn't materialize. If the audit's predictions were wrong, full refund. Common reasons predicted savings don't materialize: (a) the fix was applied but a downstream Terraform datadog_monitor kept the deleted-tag-cardinality alive via a query; (b) a third-party library you don't control is emitting the same patterns; (c) Datadog's billing aggregates retain pre-fix data for some period.
Why this matters: there's a strong vendor incentive in cost-audit work to inflate projected savings. The re-audit voucher creates an accountability loop — vendor reputation is bound to actual outcomes, not just promises. If you implement 0 of the recommendations, that's on you. If you implement all 7 and your bill goes up, we refund.
What the re-audit measures: we re-run the same 9 patterns on the same repo. If the original findings are now resolved, the report says so. We also estimate "new $/mo" by re-pricing against your post-fix Python/YAML/HCL. If you can share a Datadog Plan & Usage screenshot of Custom Metrics + Log Bytes Ingested + Indexed Spans for the 30 days pre- and post-merge, we'll calibrate against ground truth (this is the verification kit above, applied retrospectively).
$149 one-time · Delivered within 2 hours · 30-day money-back guarantee
First-3-customers honest beta pricing: $99 (33% off). Email miloantaeus@gmail.com with subject "Datadog audit — first 3" for direct invoice.