Skip to content

Conversation

@AkhileshNegi
Copy link
Collaborator

@AkhileshNegi AkhileshNegi commented Jan 20, 2026

Summary

Target issue is #520

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

  • Enhanced evaluation score handling to preserve and merge existing scores with newly fetched data, improving data consistency.
  • Standardized evaluation score storage format for better compatibility.

Summary by CodeRabbit

  • New Features

    • Evaluation endpoints reorganized into clearer evaluation and dataset routes for improved API structure.
  • Improvements

    • Scores now use a standardized summary+traces format and introduce typed/structured score data for consistent results.
    • Score fetching now caches and merges previous summary data with fresh fetches to reduce redundant work.
  • Bug Fixes

    • More consistent preservation and merging of summary score data across fetches; clearer behavior when requesting trace info for incomplete evaluations.
  • Tests

    • Tests updated to assert the new summary_scores format.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 20, 2026

📝 Walkthrough

Walkthrough

Splits the evaluations API router into separate dataset and evaluation routers and standardizes score storage as summary_scores, adding cache-aware fetch-and-merge logic that merges Langfuse-fetched summary_scores/traces with existing summary_scores and persists the merged EvaluationScore.

Changes

Cohort / File(s) Summary
API Routing Reorganization
backend/app/api/main.py, backend/app/api/routes/evaluations/__init__.py, backend/app/api/routes/evaluations/dataset.py, backend/app/api/routes/evaluations/evaluation.py
Removes package-level evaluations APIRouter export; registers dataset and evaluation routers directly. dataset uses prefix="/evaluations/datasets", evaluation uses prefix="/evaluations", both tagged "Evaluation".
Score Schema Types
backend/app/crud/evaluations/score.py, backend/app/crud/evaluations/__init__.py
Adds TypedDict models (TraceScore, TraceData, NumericSummaryScore, CategoricalSummaryScore, SummaryScore, EvaluationScore) and re-exports them at package level.
Fetch / Merge / Persist Logic
backend/app/crud/evaluations/core.py, backend/app/crud/evaluations/langfuse.py, backend/app/services/evaluations/evaluation.py
Switches to typed EvaluationScore returns; introduces cache-aware behavior: extract existing summary_scores, fetch Langfuse score, merge (Langfuse precedence), include traces, and save merged score.
Score Storage Shape Change
backend/app/crud/evaluations/processing.py
Replaces per-item/nested score structure with compact summary_scores array (e.g., cosine_similarity entry with avg, std, total_pairs, data_type).
Tests Updated
backend/app/tests/crud/evaluations/test_processing.py
Adjusts assertions to read summary_scores list and assert the cosine_similarity entry's avg value.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client/API Request
    participant API as API Router
    participant Svc as Evaluation Service
    participant CRUD as CRUD Layer
    participant DB as Database
    participant Langfuse as External API (Langfuse)

    Client->>API: GET /evaluations/:id[?get_trace_info]
    API->>Svc: get_evaluation_with_scores(params)
    Svc->>CRUD: get_evaluation_run(id)
    CRUD->>DB: Query eval_run
    DB-->>CRUD: eval_run (may include cached summary_scores/traces)
    CRUD-->>Svc: eval_run

    alt Need fetch or resync
        Svc->>Langfuse: Fetch traces & summary_scores
        Langfuse-->>Svc: langfuse_score (summary_scores + traces)
        Svc->>Svc: Extract existing_summary_scores from eval_run.score
        Svc->>Svc: Merge existing_summary_scores + langfuse_summary_scores (langfuse precedence)
        Svc->>Svc: Build final score {merged_summary_scores, traces}
        Svc->>CRUD: save_score(eval_run.id, final_score)
        CRUD->>DB: Update eval_run.score
        DB-->>CRUD: Confirm
        CRUD-->>Svc: persisted
    else Cached data sufficient
        Svc->>Svc: Return cached eval_run.score
    end

    Svc-->>Client: Evaluation with merged scores (+ traces if requested)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement, ready-for-review

Suggested reviewers

  • Prajna1999

"I hopped through routers, quick and spry,
Merged old scores with Langfuse in the sky.
Traces stitched, summaries all aligned,
Paths split tidy, types well-defined. 🥕"

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Evaluation: Fix score format' clearly and specifically describes the main change: fixing the evaluation score format structure throughout the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@AkhileshNegi AkhileshNegi marked this pull request as ready for review January 21, 2026 04:49
@AkhileshNegi AkhileshNegi linked an issue Jan 21, 2026 that may be closed by this pull request
@AkhileshNegi AkhileshNegi added the bug Something isn't working label Jan 21, 2026
@AkhileshNegi AkhileshNegi self-assigned this Jan 21, 2026
@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/crud/evaluations/core.py 8.33% 11 Missing ⚠️
backend/app/services/evaluations/evaluation.py 43.75% 9 Missing ⚠️
backend/app/crud/evaluations/langfuse.py 20.00% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

eval_run: EvaluationRun,
langfuse: Langfuse,
force_refetch: bool = False,
) -> dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add strict type safety here?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/crud/evaluations/core.py (1)

227-263: Avoid returning legacy scores without summary_scores.

If eval_run.score has traces but lacks summary_scores (legacy rows), the early return skips the Langfuse merge and returns an incomplete EvaluationScore payload. Treat missing summary_scores as a cache miss so callers always receive the new schema.

✅ Suggested fix
-    has_traces = eval_run.score is not None and "traces" in eval_run.score
-    if not force_refetch and has_traces:
+    has_traces = eval_run.score is not None and "traces" in eval_run.score
+    has_summary_scores = (
+        eval_run.score is not None and "summary_scores" in eval_run.score
+    )
+    if not force_refetch and has_traces and has_summary_scores:
         logger.info(
             f"[get_or_fetch_score] Returning existing score | evaluation_id={eval_run.id}"
         )
         return eval_run.score

@AkhileshNegi AkhileshNegi merged commit d491488 into main Jan 22, 2026
2 of 3 checks passed
@AkhileshNegi AkhileshNegi deleted the hotfix/evaluation-score-format branch January 22, 2026 06:43
@coderabbitai coderabbitai bot mentioned this pull request Jan 22, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluation: Fix score format

3 participants