feat(runs): use metrics instead of spans in the Runs Replication service #2851

ericallam · 2026-01-08T14:50:36Z

No description provided.

changeset-bot · 2026-01-08T14:50:40Z

⚠️ No Changeset found

Latest commit: 2cf1434

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-01-08T14:50:58Z

Walkthrough

Adds OpenTelemetry metrics to the runs replication flow: a Meter is imported and threaded into RunsReplicationService, which creates and records multiple metrics (replication lag, batches flushed, batch size, task runs inserted, payloads inserted, insert retries, events processed, flush duration). Tests gain an in-memory metrics helper and are updated to assert metric emissions. A new Grafana dashboard JSON is added to visualize the metrics. Documentation clarifies Prometheus naming for OTLP-exported metrics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is entirely missing. The template requires sections for Testing, Changelog, and Screenshots, along with a checklist, but none are provided.	Add a complete PR description following the template, including the issue number, checklist items, testing steps, changelog entry, and any relevant screenshots or context.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: introducing metrics instead of spans in the Runs Replication service, which is the primary focus across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a7ed77 and 2cf1434.

📒 Files selected for processing (3)

apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts
docker/config/grafana/provisioning/dashboards/runs-replication.json

🧰 Additional context used

📓 Path-based instructions (11)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts

{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts

apps/webapp/app/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Access all environment variables through the env export of env.server.ts instead of directly accessing process.env in the Trigger.dev webapp

Files:

apps/webapp/app/services/runsReplicationService.server.ts

apps/webapp/app/services/**/*.server.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Separate testable services from configuration files; follow the pattern of realtimeClient.server.ts (testable service) and realtimeClientGlobal.server.ts (configuration) in the webapp

Files:

apps/webapp/app/services/runsReplicationService.server.ts

apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: When importing from @trigger.dev/core in the webapp, use subpath exports from the package.json instead of importing from the root path
Follow the Remix 2.1.0 and Express server conventions when updating the main trigger.dev webapp

Files:

apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts

**/*.{js,ts,jsx,tsx,json,md,css,scss}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier

Files:

apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts
docker/config/grafana/provisioning/dashboards/runs-replication.json

**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: OpenTelemetry metric attributes must have low cardinality - use only enums, booleans, bounded error codes, or bounded shard IDs as attribute values
Do not use high-cardinality attributes in OpenTelemetry metrics: avoid UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)

Files:

apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts

**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

apps/webapp/test/runsReplicationService.part1.test.ts

apps/webapp/**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Test files should only import classes and functions from app/**/*.ts files and should not import env.server.ts directly or indirectly; pass configuration through options instead

Files:

apps/webapp/test/runsReplicationService.part1.test.ts

**/*.test.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.test.{ts,tsx,js,jsx}: Test files should live beside the files under test and use descriptive describe and it blocks
Avoid mocks or stubs in tests; use helpers from @internal/testcontainers when Redis or Postgres are needed
Use vitest for unit tests

Files:

apps/webapp/test/runsReplicationService.part1.test.ts

🧠 Learnings (12)

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure OpenTelemetry instrumentations and exporters in trigger.config.ts for enhanced logging

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Attach metadata to task runs using the metadata option when triggering, and access/update it inside runs using metadata functions

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2026-01-08T10:08:07.206Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/otel-metrics.mdc:0-0
Timestamp: 2026-01-08T10:08:07.206Z
Learning: Applies to **/*.ts : Do not use high-cardinality attributes in OpenTelemetry metrics: avoid UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Configure task retry behavior using the `retry` property with options like maxAttempts, factor, and timeout values

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2025-07-12T18:06:04.133Z

Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2264
File: apps/webapp/app/services/runsRepository.server.ts:172-174
Timestamp: 2025-07-12T18:06:04.133Z
Learning: In apps/webapp/app/services/runsRepository.server.ts, the in-memory status filtering after fetching runs from Prisma is intentionally used as a workaround for ClickHouse data delays. This approach is acceptable because the result set is limited to a maximum of 100 runs due to pagination, making the performance impact negligible.

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `schemaTask()` from `trigger.dev/sdk/v3` with Zod schema for payload validation

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:26:37.432Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-27T16:26:37.432Z
Learning: Applies to {packages/core,apps/webapp}/**/*.{ts,tsx} : Use zod for validation in packages/core and apps/webapp

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Subscribe to run updates using `runs.subscribeToRun()` for realtime monitoring of task execution

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use the `task()` function from `trigger.dev/sdk/v3` to define tasks with id and run properties

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:48.109Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-27T16:27:48.109Z
Learning: Applies to **/*.test.{ts,tsx,js,jsx} : Avoid mocks or stubs in tests; use helpers from `internal/testcontainers` when Redis or Postgres are needed

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `runs.subscribeToBatch()` to subscribe to changes for all runs in a batch

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.triggerAndWait()` to batch trigger multiple different tasks and wait for results

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

🧬 Code graph analysis (2)

apps/webapp/app/services/runsReplicationService.server.ts (1)

internal-packages/tracing/src/index.ts (5)

Meter (16-16)

Histogram (24-24)

Counter (17-17)

trace (39-39)

getMeter (56-58)

apps/webapp/test/runsReplicationService.part1.test.ts (1)

apps/webapp/test/utils/tracing.ts (2)

createInMemoryTracing (11-26)

createInMemoryMetrics (28-65)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (25)

GitHub Check: Cursor Bugbot
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: typecheck / typecheck
GitHub Check: claude-review
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (10)

apps/webapp/test/runsReplicationService.part1.test.ts (3)

1131-1324: Well-structured metrics validation test.

The test comprehensively validates metric emission across all key metrics. The helper functions (getMetricData, sumCounterValues, histogramHasData, getCounterAttributeValues) are well-designed for flexible metric inspection.

Minor observations:

The histogramHasData function at lines 1261-1273 handles multiple histogram data point formats robustly, which is good for compatibility.

Metrics shutdown at line 1322 is properly awaited.

120-121: Span assertion updated correctly.

The change from checking for handle_transaction spans to flushBatch spans aligns with the service's transition from span-based to metrics-based instrumentation for transaction handling.

328-330: Assertion correctly validates no flush spans when no data is produced.

apps/webapp/app/services/runsReplicationService.server.ts (6)

117-125: Well-organized metrics instrumentation fields.

The private metric instrument fields are logically grouped and clearly named. Each metric has a specific purpose with appropriate types (Counter vs Histogram).

137-187: Metrics initialization follows OpenTelemetry best practices.

The metrics are properly initialized with:

Clear, descriptive names following the runs_replication.* namespace

Appropriate units (ms, items, inserts)

Helpful descriptions

The histogram choices for replication_lag_ms, batch_size, and flush_duration_ms are appropriate since these are distribution metrics where percentiles are valuable.

501-507: Low-cardinality attributes used correctly.

The event_type attribute with values limited to "insert", "update", or "delete" adheres to the coding guidelines requiring low-cardinality attributes for OpenTelemetry metrics. As per coding guidelines, high-cardinality attributes like UUIDs, unbounded integers, or free-form strings are correctly avoided.

657-672: Batch flush metrics recorded appropriately.

The metrics recording:

Uses success boolean attribute (low cardinality) ✓

Only increments insert counters on success to avoid counting failed inserts

Records batch size and duration for every flush attempt (useful for debugging failed flushes)

704-706: Retry metric uses bounded operation attribute.

The operation attribute is correctly bounded to only two values: "task_runs" or "payloads", maintaining low cardinality as required by the coding guidelines.

63-63: Good API design for testability.

The optional meter parameter allows tests to inject an in-memory meter for assertions while providing a sensible default via getMeter("runs-replication") for production use.

docker/config/grafana/provisioning/dashboards/runs-replication.json (1)

1-546: Dashboard configuration and metric naming are correctly configured.

The dashboard is well-organized with logical groupings (Throughput, Latency & Lag, Batching & Queue, Retries by Operation). The Prometheus queries correctly use histogram_quantile for percentile calculations and rate() for counter metrics.

The metric names in the dashboard (e.g., triggerdotdev_runs_replication_task_runs_inserted_inserts_total) correctly follow the OTLP-to-Prometheus conversion pattern. The triggerdotdev prefix is added by the OpenTelemetry Collector's Prometheus exporter, which has namespace: triggerdotdev configured in docker/config/otel-collector-config.yaml. The metric base names (e.g., runs_replication.task_runs_inserted) are defined in apps/webapp/app/services/runsReplicationService.server.ts, and units (e.g., inserts) are properly included, resulting in the expected Prometheus metric format with the _total suffix for counters.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-01-08T14:51:52Z

PR Review: feat(runs): use metrics instead of spans in the Runs Replication service

This PR replaces span-based observability with OpenTelemetry metrics in the RunsReplicationService. This is a solid architectural improvement - metrics are more appropriate for aggregate measurements and have lower overhead than spans for high-frequency operations.

✅ Strengths

Well-designed metrics instrumentation
- Good selection of metric types: histograms for distributions (batch size, flush duration, replication lag), counters for throughput (inserts, retries, batches flushed)
- Appropriate use of attributes for dimensionality (e.g., success on batches, event_type on transactions, operation on retries)
- Follows low-cardinality attribute pattern as recommended in the codebase guidelines
Comprehensive Grafana dashboard
- Dashboard covers all key operational concerns: throughput, latency/lag, batching, and error tracking
- Good use of percentile calculations (p50, p95, p99) for latency panels
- Success rate calculation properly handles edge cases
Good test coverage
- New test case "should record metrics with correct values when replicating runs" comprehensively validates all metrics
- Helper functions for metric extraction are well-designed (getMetricData, sumCounterValues, histogramHasData)
- Tests validate both counters and histograms with appropriate assertions
Clean test utilities
- createInMemoryMetrics() helper is well-structured with proper cleanup via shutdown() method
- Uses AggregationTemporality.CUMULATIVE which is appropriate for testing

⚠️ Suggestions & Minor Issues

Metric naming inconsistency (lines 137-186 in runsReplicationService.server.ts)
- The metric runs_replication.transactions_processed is counting events not transactions. The counter increments once per event in the loop:
```
for (const event of transaction.events) {
  this._transactionsProcessedCounter.add(1, { event_type: event.tag });
}
```
Consider renaming to runs_replication.events_processed for clarity, or changing the logic to count once per transaction with a separate events count attribute.
Dashboard metric references a non-existent metric
- Line 423-424 in the dashboard references triggerdotdev_runs_replication_flush_queue_depth which is not created in this PR. This panel will show no data.
Unused helper functions in test (lines 1261-1276, 1278-1289 in test file)
- getCounterValueByAttributes and getHistogramCount are defined but never used
- Consider removing or adding test assertions that use them
Minor: Test helper could use proper TypeScript types
- The getMetricData and related functions use any types extensively. While acceptable for test utilities, typed versions would catch issues earlier.

🔒 Security

No security concerns identified. The changes are purely observability-related and don't affect data handling or access control.

⚡ Performance

This change should improve performance:

Metrics have lower overhead than spans, especially at high throughput
Removed the handle_transaction span that was created for every transaction
The flushBatch span is retained which is appropriate for tracing specific operations

📋 Summary

This is a well-executed observability improvement. The main actionable item is the naming of transactions_processed which counts events rather than transactions. The dashboard has one panel referencing a non-existent metric that should either be removed or the metric should be added.

Overall: Looks good to merge with the minor suggestions above considered.

🤖 Generated with Claude Code

vibe-kanban-cloud · 2026-01-08T14:55:50Z

Review Complete

Your review story is ready!

View Story

Comment !reviewfast on this PR to re-generate the story.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @docker/config/grafana/provisioning/dashboards/runs-replication.json:
- Around line 421-429: The Grafana panel references
triggerdotdev_runs_replication_flush_queue_depth which is not defined; either
add a Gauge to runsReplicationService.server.ts named
triggerdotdev_runs_replication_flush_queue_depth (using the same Prom client and
naming/label conventions as the existing metrics) and update it wherever the
flush queue is mutated (increment/decrement or set to actual queue length when
enqueuing/dequeuing/processing), or remove the panel from runs-replication.json;
ensure the metric registration matches other metrics so the dashboard can
discover it.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7d29f5a and 3a7ed77.

📒 Files selected for processing (6)

.cursor/rules/otel-metrics.mdc
apps/webapp/app/services/runsReplicationInstance.server.ts
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts
apps/webapp/test/utils/tracing.ts
docker/config/grafana/provisioning/dashboards/runs-replication.json

🧰 Additional context used

📓 Path-based instructions (11)

**/*.{js,ts,jsx,tsx,json,md,css,scss}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier

Files:

docker/config/grafana/provisioning/dashboards/runs-replication.json
apps/webapp/test/utils/tracing.ts
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts
apps/webapp/app/services/runsReplicationInstance.server.ts

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

apps/webapp/test/utils/tracing.ts
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts
apps/webapp/app/services/runsReplicationInstance.server.ts

{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

apps/webapp/test/utils/tracing.ts
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts
apps/webapp/app/services/runsReplicationInstance.server.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

apps/webapp/test/utils/tracing.ts
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts
apps/webapp/app/services/runsReplicationInstance.server.ts

apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: When importing from @trigger.dev/core in the webapp, use subpath exports from the package.json instead of importing from the root path
Follow the Remix 2.1.0 and Express server conventions when updating the main trigger.dev webapp

Files:

apps/webapp/test/utils/tracing.ts
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts
apps/webapp/app/services/runsReplicationInstance.server.ts

**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: OpenTelemetry metric attributes must have low cardinality - use only enums, booleans, bounded error codes, or bounded shard IDs as attribute values
Do not use high-cardinality attributes in OpenTelemetry metrics: avoid UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)

Files:

apps/webapp/test/utils/tracing.ts
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts
apps/webapp/app/services/runsReplicationInstance.server.ts

apps/webapp/app/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Access all environment variables through the env export of env.server.ts instead of directly accessing process.env in the Trigger.dev webapp

Files:

apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/app/services/runsReplicationInstance.server.ts

apps/webapp/app/services/**/*.server.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Separate testable services from configuration files; follow the pattern of realtimeClient.server.ts (testable service) and realtimeClientGlobal.server.ts (configuration) in the webapp

Files:

apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/app/services/runsReplicationInstance.server.ts

**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

apps/webapp/test/runsReplicationService.part1.test.ts

apps/webapp/**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Test files should only import classes and functions from app/**/*.ts files and should not import env.server.ts directly or indirectly; pass configuration through options instead

Files:

apps/webapp/test/runsReplicationService.part1.test.ts

**/*.test.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.test.{ts,tsx,js,jsx}: Test files should live beside the files under test and use descriptive describe and it blocks
Avoid mocks or stubs in tests; use helpers from @internal/testcontainers when Redis or Postgres are needed
Use vitest for unit tests

Files:

apps/webapp/test/runsReplicationService.part1.test.ts

🧠 Learnings (14)

📚 Learning: 2026-01-08T10:08:07.206Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/otel-metrics.mdc:0-0
Timestamp: 2026-01-08T10:08:07.206Z
Learning: Applies to **/*.ts : OpenTelemetry metric attributes must have low cardinality - use only enums, booleans, bounded error codes, or bounded shard IDs as attribute values

Applied to files:

.cursor/rules/otel-metrics.mdc

📚 Learning: 2026-01-08T10:08:07.206Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/otel-metrics.mdc:0-0
Timestamp: 2026-01-08T10:08:07.206Z
Learning: Applies to **/*.ts : Do not use high-cardinality attributes in OpenTelemetry metrics: avoid UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)

Applied to files:

.cursor/rules/otel-metrics.mdc

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure OpenTelemetry instrumentations and exporters in trigger.config.ts for enhanced logging

Applied to files:

.cursor/rules/otel-metrics.mdc
apps/webapp/test/utils/tracing.ts
apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/app/services/runsReplicationInstance.server.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use metadata methods (set, del, replace, append, remove, increment, decrement, stream, flush) to update metadata during task execution

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Attach metadata to task runs using the metadata option when triggering, and access/update it inside runs using metadata functions

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts
apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Configure task retry behavior using the `retry` property with options like maxAttempts, factor, and timeout values

Applied to files:

apps/webapp/app/services/runsReplicationService.server.ts

📚 Learning: 2025-07-12T18:06:04.133Z

Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2264
File: apps/webapp/app/services/runsRepository.server.ts:172-174
Timestamp: 2025-07-12T18:06:04.133Z
Learning: In apps/webapp/app/services/runsRepository.server.ts, the in-memory status filtering after fetching runs from Prisma is intentionally used as a workaround for ClickHouse data delays. This approach is acceptable because the result set is limited to a maximum of 100 runs due to pagination, making the performance impact negligible.

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `schemaTask()` from `trigger.dev/sdk/v3` with Zod schema for payload validation

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:26:37.432Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-27T16:26:37.432Z
Learning: Applies to {packages/core,apps/webapp}/**/*.{ts,tsx} : Use zod for validation in packages/core and apps/webapp

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Subscribe to run updates using `runs.subscribeToRun()` for realtime monitoring of task execution

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use the `task()` function from `trigger.dev/sdk/v3` to define tasks with id and run properties

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:48.109Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-27T16:27:48.109Z
Learning: Applies to **/*.test.{ts,tsx,js,jsx} : Avoid mocks or stubs in tests; use helpers from `internal/testcontainers` when Redis or Postgres are needed

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `runs.subscribeToBatch()` to subscribe to changes for all runs in a batch

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.triggerAndWait()` to batch trigger multiple different tasks and wait for results

Applied to files:

apps/webapp/test/runsReplicationService.part1.test.ts

🧬 Code graph analysis (1)

apps/webapp/app/services/runsReplicationService.server.ts (1)

internal-packages/tracing/src/index.ts (5)

Meter (16-16)

Histogram (24-24)

Counter (17-17)

trace (39-39)

getMeter (56-58)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (24)

GitHub Check: Cursor Bugbot
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: typecheck / typecheck
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (12)

docker/config/grafana/provisioning/dashboards/runs-replication.json (1)

1-593: Dashboard configuration looks well-structured.

The dashboard properly:

Uses low-cardinality attributes (success, event_type, operation) which aligns with the coding guidelines

Provides comprehensive coverage of the new metrics

Includes appropriate thresholds for success rates and retry alerts

Uses histogram quantiles (p50, p95, p99) for latency metrics

apps/webapp/app/services/runsReplicationInstance.server.ts (2)

5-5: LGTM!

The meter import follows the existing pattern for tracer configuration and is correctly wired into the service options.

64-65: LGTM!

The meter is correctly passed to the service alongside the tracer, maintaining consistency with the existing instrumentation pattern.

apps/webapp/test/utils/tracing.ts (1)

28-65: Well-designed test utility for in-memory metrics.

The implementation correctly:

Uses cumulative aggregation temporality for easier assertion in tests

Provides convenient helpers (getMetrics, reset, shutdown) for test lifecycle management

Sets a fast export interval (100ms) suitable for test scenarios

One consideration: ensure tests always call shutdown() in cleanup (e.g., afterEach or test finally blocks) to prevent resource leaks from the periodic reader.

apps/webapp/test/runsReplicationService.part1.test.ts (2)

1132-1354: Comprehensive metrics test coverage.

The test properly validates:

Counter metrics (batches_flushed, task_runs_inserted, payloads_inserted, transactions_processed)

Histogram metrics (batch_size, replication_lag_ms, flush_duration_ms)

Attribute values (success, event_type)

Cleanup with metricsHelper.shutdown()

The helper functions for metric extraction are well-designed and handle various OpenTelemetry data point formats.

120-121: LGTM!

The span assertion correctly changed from handle_transaction to flushBatch to reflect the updated instrumentation approach.

apps/webapp/app/services/runsReplicationService.server.ts (5)

117-125: Well-structured metrics field declarations.

The private fields for metrics are properly typed with Histogram and Counter from @internal/tracing, maintaining type safety.

137-187: Metrics initialization follows best practices.

All metrics are properly initialized with:

Descriptive names using consistent runs_replication. prefix

Clear descriptions

Appropriate units (ms, items, inserts)

The naming conventions align with the OpenTelemetry semantic conventions.

501-507: Low-cardinality attributes correctly used.

The event_type attribute only has three possible values (insert, update, delete), which is properly bounded and follows the coding guidelines for OpenTelemetry metrics.

657-672: Metrics recording in flush batch is well-implemented.

The implementation correctly:

Records batch size and flush duration

Uses boolean success attribute (low cardinality)

Only increments insert counters on successful operations

Records counts after confirming no errors

704-706: Low-cardinality operation attribute.

The operation attribute is bounded to two values (task_runs, payloads), maintaining low cardinality as required by the coding guidelines.

.cursor/rules/otel-metrics.mdc (1)

49-54: The table documentation is accurate and aligns with OTLP-to-Prometheus exporter specifications. No changes needed.

The OTLP spec confirms that unit suffixes are converted and appended before type suffixes. For counters with unit: "inserts", the resulting Prometheus name is correctly shown as {name}_inserts_total, not just {name}_total. Similarly, histograms with unit: "items" correctly become {name}_items_bucket. These patterns are validated by the actual metrics in runsReplicationService.server.ts (lines 149–160), which use these exact naming conventions.

docker/config/grafana/provisioning/dashboards/runs-replication.json

apps/webapp/app/services/runsReplicationService.server.ts

claude · 2026-01-08T15:47:00Z

PR Review: feat(runs): use metrics instead of spans in the Runs Replication service

This PR migrates from tracing spans to OpenTelemetry metrics for the Runs Replication service observability. It adds 8 new metrics (counters and histograms), removes a verbose handle_transaction span, and includes a comprehensive Grafana dashboard.

Code Quality and Best Practices

Strengths:

Well-structured metric initialization with clear descriptions and units
Good use of low-cardinality attributes (event_type, operation, success)
Clean separation between metrics recording and business logic
The createInMemoryMetrics() test helper is well-designed
The otel-metrics.mdc documentation about Prometheus naming conventions is helpful

Suggestions:

Consider grouping metric initialization into a private method to keep the constructor cleaner
Consider extracting test helper functions to the test utils file for reuse

Potential Issues

Missing error counter: Consider adding a runs_replication.insert_errors counter for alerting
Histogram bucket defaults: Consider explicit bucket boundaries matching SLO expectations
Boolean attribute serialization: Worth noting in docs that booleans become string labels in Prometheus

Performance

The migration from spans to metrics is a solid performance improvement with reduced overhead and less storage.

Security

No security concerns. Metrics record only operational data with no PII.

Test Coverage

The new metrics test is comprehensive, covering all 8 metrics. Consider adding a test for insert_retries by mocking failures.

Grafana Dashboard

Well-organized with logical sections, appropriate thresholds, and P50/P95/P99 percentiles.

Verdict: Approve with minor suggestions as optional improvements.

feat(runs): use metrics instead of spans in the Runs Replication service

3a7ed77

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

docker/config/grafana/provisioning/dashboards/runs-replication.json Outdated Show resolved Hide resolved

matt-aitken approved these changes Jan 8, 2026

View reviewed changes

cursor bot reviewed Jan 8, 2026

View reviewed changes

docker/config/grafana/provisioning/dashboards/runs-replication.json Outdated Show resolved Hide resolved

apps/webapp/app/services/runsReplicationService.server.ts Show resolved Hide resolved

make it clear we are tracking events, not transactions

2cf1434

ericallam merged commit 57ba252 into main Jan 8, 2026
28 of 33 checks passed

ericallam deleted the ea-branch-113 branch January 8, 2026 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(runs): use metrics instead of spans in the Runs Replication service #2851

feat(runs): use metrics instead of spans in the Runs Replication service #2851

Uh oh!

ericallam commented Jan 8, 2026

Uh oh!

changeset-bot bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 8, 2026

Uh oh!

vibe-kanban-cloud bot commented Jan 8, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

feat(runs): use metrics instead of spans in the Runs Replication service #2851

feat(runs): use metrics instead of spans in the Runs Replication service #2851

Uh oh!

Conversation

ericallam commented Jan 8, 2026

Uh oh!

changeset-bot bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

claude bot commented Jan 8, 2026

PR Review: feat(runs): use metrics instead of spans in the Runs Replication service

✅ Strengths

⚠️ Suggestions & Minor Issues

🔒 Security

⚡ Performance

📋 Summary

Uh oh!

vibe-kanban-cloud bot commented Jan 8, 2026

Review Complete

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot commented Jan 8, 2026

PR Review: feat(runs): use metrics instead of spans in the Runs Replication service

Code Quality and Best Practices

Potential Issues

Performance

Security

Test Coverage

Grafana Dashboard

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

changeset-bot bot commented Jan 8, 2026 •

edited

Loading

coderabbitai bot commented Jan 8, 2026 •

edited

Loading