-
-
Notifications
You must be signed in to change notification settings - Fork 953
feat(runs): use metrics instead of spans in the Runs Replication service #2851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
WalkthroughAdds OpenTelemetry metrics to the runs replication flow: a Meter is imported and threaded into RunsReplicationService, which creates and records multiple metrics (replication lag, batches flushed, batch size, task runs inserted, payloads inserted, insert retries, events processed, flush duration). Tests gain an in-memory metrics helper and are updated to assert metric emissions. A new Grafana dashboard JSON is added to visualize the metrics. Documentation clarifies Prometheus naming for OTLP-exported metrics. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: Repository UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🧰 Additional context used📓 Path-based instructions (11)**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
{packages/core,apps/webapp}/**/*.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
**/*.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
apps/webapp/app/**/*.{ts,tsx}📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Files:
apps/webapp/app/services/**/*.server.{ts,tsx}📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Files:
apps/webapp/**/*.{ts,tsx}📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Files:
**/*.{js,ts,jsx,tsx,json,md,css,scss}📄 CodeRabbit inference engine (AGENTS.md)
Files:
**/*.ts📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
Files:
**/*.{test,spec}.{ts,tsx}📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Files:
apps/webapp/**/*.test.{ts,tsx}📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Files:
**/*.test.{ts,tsx,js,jsx}📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (12)📚 Learning: 2025-11-27T16:27:35.304ZApplied to files:
📚 Learning: 2025-11-27T16:27:35.304ZApplied to files:
📚 Learning: 2026-01-08T10:08:07.206ZApplied to files:
📚 Learning: 2025-11-27T16:27:35.304ZApplied to files:
📚 Learning: 2025-07-12T18:06:04.133ZApplied to files:
📚 Learning: 2025-11-27T16:27:35.304ZApplied to files:
📚 Learning: 2025-11-27T16:26:37.432ZApplied to files:
📚 Learning: 2025-11-27T16:27:35.304ZApplied to files:
📚 Learning: 2025-11-27T16:27:35.304ZApplied to files:
📚 Learning: 2025-11-27T16:27:48.109ZApplied to files:
📚 Learning: 2025-11-27T16:27:35.304ZApplied to files:
📚 Learning: 2025-11-27T16:27:35.304ZApplied to files:
🧬 Code graph analysis (2)apps/webapp/app/services/runsReplicationService.server.ts (1)
apps/webapp/test/runsReplicationService.part1.test.ts (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (25)
🔇 Additional comments (10)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Review: feat(runs): use metrics instead of spans in the Runs Replication serviceThis PR replaces span-based observability with OpenTelemetry metrics in the ✅ Strengths
|
Review CompleteYour review story is ready! Comment !reviewfast on this PR to re-generate the story. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @docker/config/grafana/provisioning/dashboards/runs-replication.json:
- Around line 421-429: The Grafana panel references
triggerdotdev_runs_replication_flush_queue_depth which is not defined; either
add a Gauge to runsReplicationService.server.ts named
triggerdotdev_runs_replication_flush_queue_depth (using the same Prom client and
naming/label conventions as the existing metrics) and update it wherever the
flush queue is mutated (increment/decrement or set to actual queue length when
enqueuing/dequeuing/processing), or remove the panel from runs-replication.json;
ensure the metric registration matches other metrics so the dashboard can
discover it.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.cursor/rules/otel-metrics.mdcapps/webapp/app/services/runsReplicationInstance.server.tsapps/webapp/app/services/runsReplicationService.server.tsapps/webapp/test/runsReplicationService.part1.test.tsapps/webapp/test/utils/tracing.tsdocker/config/grafana/provisioning/dashboards/runs-replication.json
🧰 Additional context used
📓 Path-based instructions (11)
**/*.{js,ts,jsx,tsx,json,md,css,scss}
📄 CodeRabbit inference engine (AGENTS.md)
Format code using Prettier
Files:
docker/config/grafana/provisioning/dashboards/runs-replication.jsonapps/webapp/test/utils/tracing.tsapps/webapp/app/services/runsReplicationService.server.tsapps/webapp/test/runsReplicationService.part1.test.tsapps/webapp/app/services/runsReplicationInstance.server.ts
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
Files:
apps/webapp/test/utils/tracing.tsapps/webapp/app/services/runsReplicationService.server.tsapps/webapp/test/runsReplicationService.part1.test.tsapps/webapp/app/services/runsReplicationInstance.server.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use zod for validation in packages/core and apps/webapp
Files:
apps/webapp/test/utils/tracing.tsapps/webapp/app/services/runsReplicationService.server.tsapps/webapp/test/runsReplicationService.part1.test.tsapps/webapp/app/services/runsReplicationInstance.server.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
Files:
apps/webapp/test/utils/tracing.tsapps/webapp/app/services/runsReplicationService.server.tsapps/webapp/test/runsReplicationService.part1.test.tsapps/webapp/app/services/runsReplicationInstance.server.ts
apps/webapp/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
apps/webapp/**/*.{ts,tsx}: When importing from@trigger.dev/corein the webapp, use subpath exports from the package.json instead of importing from the root path
Follow the Remix 2.1.0 and Express server conventions when updating the main trigger.dev webapp
Files:
apps/webapp/test/utils/tracing.tsapps/webapp/app/services/runsReplicationService.server.tsapps/webapp/test/runsReplicationService.part1.test.tsapps/webapp/app/services/runsReplicationInstance.server.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: OpenTelemetry metric attributes must have low cardinality - use only enums, booleans, bounded error codes, or bounded shard IDs as attribute values
Do not use high-cardinality attributes in OpenTelemetry metrics: avoid UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
Files:
apps/webapp/test/utils/tracing.tsapps/webapp/app/services/runsReplicationService.server.tsapps/webapp/test/runsReplicationService.part1.test.tsapps/webapp/app/services/runsReplicationInstance.server.ts
apps/webapp/app/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Access all environment variables through the
envexport ofenv.server.tsinstead of directly accessingprocess.envin the Trigger.dev webapp
Files:
apps/webapp/app/services/runsReplicationService.server.tsapps/webapp/app/services/runsReplicationInstance.server.ts
apps/webapp/app/services/**/*.server.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Separate testable services from configuration files; follow the pattern of
realtimeClient.server.ts(testable service) andrealtimeClientGlobal.server.ts(configuration) in the webapp
Files:
apps/webapp/app/services/runsReplicationService.server.tsapps/webapp/app/services/runsReplicationInstance.server.ts
**/*.{test,spec}.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use vitest for all tests in the Trigger.dev repository
Files:
apps/webapp/test/runsReplicationService.part1.test.ts
apps/webapp/**/*.test.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)
Test files should only import classes and functions from
app/**/*.tsfiles and should not importenv.server.tsdirectly or indirectly; pass configuration through options instead
Files:
apps/webapp/test/runsReplicationService.part1.test.ts
**/*.test.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.test.{ts,tsx,js,jsx}: Test files should live beside the files under test and use descriptivedescribeanditblocks
Avoid mocks or stubs in tests; use helpers from@internal/testcontainerswhen Redis or Postgres are needed
Use vitest for unit tests
Files:
apps/webapp/test/runsReplicationService.part1.test.ts
🧠 Learnings (14)
📚 Learning: 2026-01-08T10:08:07.206Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/otel-metrics.mdc:0-0
Timestamp: 2026-01-08T10:08:07.206Z
Learning: Applies to **/*.ts : OpenTelemetry metric attributes must have low cardinality - use only enums, booleans, bounded error codes, or bounded shard IDs as attribute values
Applied to files:
.cursor/rules/otel-metrics.mdc
📚 Learning: 2026-01-08T10:08:07.206Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/otel-metrics.mdc:0-0
Timestamp: 2026-01-08T10:08:07.206Z
Learning: Applies to **/*.ts : Do not use high-cardinality attributes in OpenTelemetry metrics: avoid UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
Applied to files:
.cursor/rules/otel-metrics.mdc
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure OpenTelemetry instrumentations and exporters in trigger.config.ts for enhanced logging
Applied to files:
.cursor/rules/otel-metrics.mdcapps/webapp/test/utils/tracing.tsapps/webapp/app/services/runsReplicationService.server.tsapps/webapp/app/services/runsReplicationInstance.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use metadata methods (set, del, replace, append, remove, increment, decrement, stream, flush) to update metadata during task execution
Applied to files:
apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Attach metadata to task runs using the metadata option when triggering, and access/update it inside runs using metadata functions
Applied to files:
apps/webapp/app/services/runsReplicationService.server.tsapps/webapp/test/runsReplicationService.part1.test.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Configure task retry behavior using the `retry` property with options like maxAttempts, factor, and timeout values
Applied to files:
apps/webapp/app/services/runsReplicationService.server.ts
📚 Learning: 2025-07-12T18:06:04.133Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2264
File: apps/webapp/app/services/runsRepository.server.ts:172-174
Timestamp: 2025-07-12T18:06:04.133Z
Learning: In apps/webapp/app/services/runsRepository.server.ts, the in-memory status filtering after fetching runs from Prisma is intentionally used as a workaround for ClickHouse data delays. This approach is acceptable because the result set is limited to a maximum of 100 runs due to pagination, making the performance impact negligible.
Applied to files:
apps/webapp/test/runsReplicationService.part1.test.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `schemaTask()` from `trigger.dev/sdk/v3` with Zod schema for payload validation
Applied to files:
apps/webapp/test/runsReplicationService.part1.test.ts
📚 Learning: 2025-11-27T16:26:37.432Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-27T16:26:37.432Z
Learning: Applies to {packages/core,apps/webapp}/**/*.{ts,tsx} : Use zod for validation in packages/core and apps/webapp
Applied to files:
apps/webapp/test/runsReplicationService.part1.test.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Subscribe to run updates using `runs.subscribeToRun()` for realtime monitoring of task execution
Applied to files:
apps/webapp/test/runsReplicationService.part1.test.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use the `task()` function from `trigger.dev/sdk/v3` to define tasks with id and run properties
Applied to files:
apps/webapp/test/runsReplicationService.part1.test.ts
📚 Learning: 2025-11-27T16:27:48.109Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-27T16:27:48.109Z
Learning: Applies to **/*.test.{ts,tsx,js,jsx} : Avoid mocks or stubs in tests; use helpers from `internal/testcontainers` when Redis or Postgres are needed
Applied to files:
apps/webapp/test/runsReplicationService.part1.test.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `runs.subscribeToBatch()` to subscribe to changes for all runs in a batch
Applied to files:
apps/webapp/test/runsReplicationService.part1.test.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.triggerAndWait()` to batch trigger multiple different tasks and wait for results
Applied to files:
apps/webapp/test/runsReplicationService.part1.test.ts
🧬 Code graph analysis (1)
apps/webapp/app/services/runsReplicationService.server.ts (1)
internal-packages/tracing/src/index.ts (5)
Meter(16-16)Histogram(24-24)Counter(17-17)trace(39-39)getMeter(56-58)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)
- GitHub Check: Cursor Bugbot
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: typecheck / typecheck
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (12)
docker/config/grafana/provisioning/dashboards/runs-replication.json (1)
1-593: Dashboard configuration looks well-structured.The dashboard properly:
- Uses low-cardinality attributes (
success,event_type,operation) which aligns with the coding guidelines- Provides comprehensive coverage of the new metrics
- Includes appropriate thresholds for success rates and retry alerts
- Uses histogram quantiles (p50, p95, p99) for latency metrics
apps/webapp/app/services/runsReplicationInstance.server.ts (2)
5-5: LGTM!The meter import follows the existing pattern for tracer configuration and is correctly wired into the service options.
64-65: LGTM!The meter is correctly passed to the service alongside the tracer, maintaining consistency with the existing instrumentation pattern.
apps/webapp/test/utils/tracing.ts (1)
28-65: Well-designed test utility for in-memory metrics.The implementation correctly:
- Uses cumulative aggregation temporality for easier assertion in tests
- Provides convenient helpers (
getMetrics,reset,shutdown) for test lifecycle management- Sets a fast export interval (100ms) suitable for test scenarios
One consideration: ensure tests always call
shutdown()in cleanup (e.g.,afterEachor test finally blocks) to prevent resource leaks from the periodic reader.apps/webapp/test/runsReplicationService.part1.test.ts (2)
1132-1354: Comprehensive metrics test coverage.The test properly validates:
- Counter metrics (
batches_flushed,task_runs_inserted,payloads_inserted,transactions_processed)- Histogram metrics (
batch_size,replication_lag_ms,flush_duration_ms)- Attribute values (
success,event_type)- Cleanup with
metricsHelper.shutdown()The helper functions for metric extraction are well-designed and handle various OpenTelemetry data point formats.
120-121: LGTM!The span assertion correctly changed from
handle_transactiontoflushBatchto reflect the updated instrumentation approach.apps/webapp/app/services/runsReplicationService.server.ts (5)
117-125: Well-structured metrics field declarations.The private fields for metrics are properly typed with
HistogramandCounterfrom@internal/tracing, maintaining type safety.
137-187: Metrics initialization follows best practices.All metrics are properly initialized with:
- Descriptive names using consistent
runs_replication.prefix- Clear descriptions
- Appropriate units (
ms,items,inserts)The naming conventions align with the OpenTelemetry semantic conventions.
501-507: Low-cardinality attributes correctly used.The
event_typeattribute only has three possible values (insert,update,delete), which is properly bounded and follows the coding guidelines for OpenTelemetry metrics.
657-672: Metrics recording in flush batch is well-implemented.The implementation correctly:
- Records batch size and flush duration
- Uses boolean
successattribute (low cardinality)- Only increments insert counters on successful operations
- Records counts after confirming no errors
704-706: Low-cardinalityoperationattribute.The operation attribute is bounded to two values (
task_runs,payloads), maintaining low cardinality as required by the coding guidelines..cursor/rules/otel-metrics.mdc (1)
49-54: The table documentation is accurate and aligns with OTLP-to-Prometheus exporter specifications. No changes needed.The OTLP spec confirms that unit suffixes are converted and appended before type suffixes. For counters with
unit: "inserts", the resulting Prometheus name is correctly shown as{name}_inserts_total, not just{name}_total. Similarly, histograms withunit: "items"correctly become{name}_items_bucket. These patterns are validated by the actual metrics inrunsReplicationService.server.ts(lines 149–160), which use these exact naming conventions.
docker/config/grafana/provisioning/dashboards/runs-replication.json
Outdated
Show resolved
Hide resolved
docker/config/grafana/provisioning/dashboards/runs-replication.json
Outdated
Show resolved
Hide resolved
PR Review: feat(runs): use metrics instead of spans in the Runs Replication serviceThis PR migrates from tracing spans to OpenTelemetry metrics for the Runs Replication service observability. It adds 8 new metrics (counters and histograms), removes a verbose handle_transaction span, and includes a comprehensive Grafana dashboard. Code Quality and Best PracticesStrengths:
Suggestions:
Potential Issues
PerformanceThe migration from spans to metrics is a solid performance improvement with reduced overhead and less storage. SecurityNo security concerns. Metrics record only operational data with no PII. Test CoverageThe new metrics test is comprehensive, covering all 8 metrics. Consider adding a test for insert_retries by mocking failures. Grafana DashboardWell-organized with logical sections, appropriate thresholds, and P50/P95/P99 percentiles. Verdict: Approve with minor suggestions as optional improvements. |
No description provided.