Evaluation: Fix score format #549

AkhileshNegi · 2026-01-20T07:51:26Z

Summary

Target issue is #520

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Enhanced evaluation score handling to preserve and merge existing scores with newly fetched data, improving data consistency.
Standardized evaluation score storage format for better compatibility.

Summary by CodeRabbit

New Features
- Evaluation endpoints reorganized into clearer evaluation and dataset routes for improved API structure.
Improvements
- Scores now use a standardized summary+traces format and introduce typed/structured score data for consistent results.
- Score fetching now caches and merges previous summary data with fresh fetches to reduce redundant work.
Bug Fixes
- More consistent preservation and merging of summary score data across fetches; clearer behavior when requesting trace info for incomplete evaluations.
Tests
- Tests updated to assert the new summary_scores format.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-20T07:51:35Z

📝 Walkthrough

Walkthrough

Splits the evaluations API router into separate dataset and evaluation routers and standardizes score storage as summary_scores, adding cache-aware fetch-and-merge logic that merges Langfuse-fetched summary_scores/traces with existing summary_scores and persists the merged EvaluationScore.

Changes

Cohort / File(s)	Summary
API Routing Reorganization `backend/app/api/main.py`, `backend/app/api/routes/evaluations/__init__.py`, `backend/app/api/routes/evaluations/dataset.py`, `backend/app/api/routes/evaluations/evaluation.py`	Removes package-level `evaluations` APIRouter export; registers `dataset` and `evaluation` routers directly. `dataset` uses `prefix="/evaluations/datasets"`, `evaluation` uses `prefix="/evaluations"`, both tagged `"Evaluation"`.
Score Schema Types `backend/app/crud/evaluations/score.py`, `backend/app/crud/evaluations/__init__.py`	Adds TypedDict models (`TraceScore`, `TraceData`, `NumericSummaryScore`, `CategoricalSummaryScore`, `SummaryScore`, `EvaluationScore`) and re-exports them at package level.
Fetch / Merge / Persist Logic `backend/app/crud/evaluations/core.py`, `backend/app/crud/evaluations/langfuse.py`, `backend/app/services/evaluations/evaluation.py`	Switches to typed `EvaluationScore` returns; introduces cache-aware behavior: extract existing `summary_scores`, fetch Langfuse score, merge (Langfuse precedence), include traces, and save merged score.
Score Storage Shape Change `backend/app/crud/evaluations/processing.py`	Replaces per-item/nested score structure with compact `summary_scores` array (e.g., `cosine_similarity` entry with `avg`, `std`, `total_pairs`, `data_type`).
Tests Updated `backend/app/tests/crud/evaluations/test_processing.py`	Adjusts assertions to read `summary_scores` list and assert the `cosine_similarity` entry's `avg` value.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client/API Request
    participant API as API Router
    participant Svc as Evaluation Service
    participant CRUD as CRUD Layer
    participant DB as Database
    participant Langfuse as External API (Langfuse)

    Client->>API: GET /evaluations/:id[?get_trace_info]
    API->>Svc: get_evaluation_with_scores(params)
    Svc->>CRUD: get_evaluation_run(id)
    CRUD->>DB: Query eval_run
    DB-->>CRUD: eval_run (may include cached summary_scores/traces)
    CRUD-->>Svc: eval_run

    alt Need fetch or resync
        Svc->>Langfuse: Fetch traces & summary_scores
        Langfuse-->>Svc: langfuse_score (summary_scores + traces)
        Svc->>Svc: Extract existing_summary_scores from eval_run.score
        Svc->>Svc: Merge existing_summary_scores + langfuse_summary_scores (langfuse precedence)
        Svc->>Svc: Build final score {merged_summary_scores, traces}
        Svc->>CRUD: save_score(eval_run.id, final_score)
        CRUD->>DB: Update eval_run.score
        DB-->>CRUD: Confirm
        CRUD-->>Svc: persisted
    else Cached data sufficient
        Svc->>Svc: Return cached eval_run.score
    end

    Svc-->>Client: Evaluation with merged scores (+ traces if requested)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Evaluation #405 — modifies evaluation routing and router registrations similar to this PR.
Evaluation: Refactor #503 — touches evaluations API and CRUD/service logic; likely overlaps merge/persistence behavior.
Kaapi v1.0: Enhancing the test suite #488 — updates evaluation tests and score shape assertions, directly related to test changes here.

Suggested labels

enhancement, ready-for-review

Suggested reviewers

Prajna1999

"I hopped through routers, quick and spry,
Merged old scores with Langfuse in the sky.
Traces stitched, summaries all aligned,
Paths split tidy, types well-defined. 🥕"

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Evaluation: Fix score format' clearly and specifically describes the main change: fixing the evaluation score format structure throughout the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-21T05:38:16Z

Codecov Report

❌ Patch coverage is 66.66667% with 24 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/crud/evaluations/core.py	8.33%	11 Missing ⚠️
backend/app/services/evaluations/evaluation.py	43.75%	9 Missing ⚠️
backend/app/crud/evaluations/langfuse.py	20.00%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

Prajna1999 · 2026-01-21T05:42:24Z

backend/app/crud/evaluations/core.py

    eval_run: EvaluationRun,
    langfuse: Langfuse,
    force_refetch: bool = False,
 ) -> dict[str, Any]:


Should we add strict type safety here?

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/crud/evaluations/core.py (1)
227-263: Avoid returning legacy scores without summary_scores.

If eval_run.score has traces but lacks summary_scores (legacy rows), the early return skips the Langfuse merge and returns an incomplete EvaluationScore payload. Treat missing summary_scores as a cache miss so callers always receive the new schema.
✅ Suggested fix
-    has_traces = eval_run.score is not None and "traces" in eval_run.score
-    if not force_refetch and has_traces:
+    has_traces = eval_run.score is not None and "traces" in eval_run.score
+    has_summary_scores = (
+        eval_run.score is not None and "summary_scores" in eval_run.score
+    )
+    if not force_refetch and has_traces and has_summary_scores:
         logger.info(
             f"[get_or_fetch_score] Returning existing score | evaluation_id={eval_run.id}"
         )
         return eval_run.score

AkhileshNegi added 3 commits January 20, 2026 13:05

fix score format

f8f1c9d

cleanup documentation

7c9ca37

cleanup router

60512d2

AkhileshNegi marked this pull request as ready for review January 21, 2026 04:49

Merge branch 'main' into hotfix/evaluation-score-format

7b2e3c4

AkhileshNegi linked an issue Jan 21, 2026 that may be closed by this pull request

Evaluation: Fix score format #520

Closed

AkhileshNegi added 3 commits January 21, 2026 10:59

updated testcase

2b68617

updated testcase

7ac771a

updated testcase

8655b2d

AkhileshNegi requested a review from Prajna1999 January 21, 2026 05:33

AkhileshNegi added the bug Something isn't working label Jan 21, 2026

AkhileshNegi self-assigned this Jan 21, 2026

Prajna1999 approved these changes Jan 21, 2026

View reviewed changes

added types

dd80a36

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

AkhileshNegi merged commit d491488 into main Jan 22, 2026
2 of 3 checks passed

AkhileshNegi deleted the hotfix/evaluation-score-format branch January 22, 2026 06:43

coderabbitai bot mentioned this pull request Jan 22, 2026

Evaluation: Add question id #553

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation: Fix score format #549

Evaluation: Fix score format #549

Uh oh!

AkhileshNegi commented Jan 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 20, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

codecov bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

Prajna1999 Jan 21, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Evaluation: Fix score format #549

Evaluation: Fix score format #549

Uh oh!

Conversation

AkhileshNegi commented Jan 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

codecov bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Prajna1999 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AkhileshNegi commented Jan 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 20, 2026 •

edited

Loading

codecov bot commented Jan 21, 2026 •

edited

Loading