Skip to content

Conversation

@beastoin
Copy link
Collaborator

@beastoin beastoin commented Feb 10, 2026

Summary

Closes #4712 (sub-issue 1/4 of #4651)

  • Remove explicit detect_language() Google Cloud API calls — the translate API already auto-detects source language for free in every call, but we were paying $20/M chars for redundant explicit detection
  • Return TranslationResult from translate_text() with detected_language_code captured from the translate response (zero extra cost)
  • Fix brittle same-language check — replace translated_text == segment_text with detected_language_code == translation_language (Google normalizes punctuation/spacing, causing false negatives)

Impact

~43% reduction in Google Translate API spend — every foreign-language segment was being billed 2x (once for detect, once for translate).

Changes

File Change
backend/utils/translation.py Remove _detect_with_google_cloud(), add TranslationResult NamedTuple, translate_text() returns TranslationResult with detected language, translate_text_by_sentence() aggregates detection across sentences (all must agree), detect_language() uses only free langdetect
backend/utils/translation_cache.py Remove unused split_into_sentences import
backend/routers/transcribe.py Use detected_language_code == translation_language instead of text equality
backend/tests/unit/test_translation.py 32 unit tests across 8 test classes

Test plan

  • 32 unit tests covering:
    • TranslationResult NamedTuple fields and defaults
    • detect_language: only uses langdetect (no Google Cloud API), caching, empty/non-lexical/unreliable handling
    • translate_text: returns TranslationResult, cache hit, error fallback, empty detection → None
    • translate_text_by_sentence: empty text, all-same detection, partial detection → None, mixed → None
    • split_into_sentences: empty, single, multiple, commas, whitespace
    • Detection cache: hit avoids langdetect call, eviction at MAX_DETECTION_CACHE_SIZE
    • TranscriptSegmentLanguageCache: foreign detection, sticky false, empty text, delete, undetectable, repeated calls dedup
    • Cache key normalization: cleaned text as key, None not cached
  • Existing voice_message_language tests pass (no regression)
  • Verify zero detect_language API calls in Cloud Console after deploy

Risks / edge cases

  • langdetect is less accurate than Google Cloud for very short text (≤5 words) — but this is a pre-filter only; translation still happens and catches it via detected_language_code
  • Mixed-language sentences: detected_language_code is None when sentences disagree, so translation is preserved (safe default)
  • Aggregation requires ALL sentences to have non-null detection AND agree — prevents false skips

by AI for @beastoin

Remove explicit Google Cloud detect_language() calls — the translate API
already auto-detects source language for free. This eliminates ~43% of
Translate API spend (every foreign-language segment was billed 2x).

Changes:
- Remove _detect_with_google_cloud() from translation.py
- detect_language() now uses only free langdetect library
- translate_text() returns TranslationResult(text, detected_language_code)
  capturing the free detection from the translate response
- translate_text_by_sentence() aggregates detected language across sentences
- transcribe.py: replace brittle translated_text == segment_text check with
  detected_language_code == translation_language
- Clean up unused import in translation_cache.py
- Add 16 unit tests covering all new behavior

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request makes excellent progress on optimizing API costs by removing redundant language detection calls and improving the logic for handling same-language text. The introduction of TranslationResult is a clean way to pass more data from the translation service. However, the review identifies critical issues regarding concurrency and blocking I/O. The translation methods use synchronous network calls, which will block the asyncio event loop, and shared caches are accessed without proper locking, creating race conditions. Addressing these is crucial for application stability and performance.

Address review feedback: the aggregation logic now requires ALL sentences
to return a non-null detected_language_code AND agree, preventing a single
detected sentence from skipping translation when others failed detection.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@beastoin
Copy link
Collaborator Author

Addressed review feedback (commit 6f858e9):

  1. Fixed aggregation logicdetected_language_code is now set only when ALL sentences return a non-null code AND they all match. Previously a single detected sentence could be treated as authoritative when others had no detection.

  2. Verified no other callerstranslate_text() and translate_text_by_sentence() are only called internally within TranslationService and transcribe.py, both already updated. No external callers affected.

Added test test_partial_detection_returns_none to cover this edge case. All 17 tests pass.

by AI for @beastoin

beastoin and others added 2 commits February 10, 2026 09:28
…, split_into_sentences

Address tester feedback — 12 new tests covering:
- split_into_sentences boundary cases (empty, single, commas, whitespace)
- detection cache hit avoids re-calling langdetect
- detection cache eviction at MAX_DETECTION_CACHE_SIZE
- TranscriptSegmentLanguageCache: sticky false, empty text, delete, undetectable

29 total tests, all passing.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…tests

Address final tester feedback — 3 additional tests:
- Repeated is_in_target_language calls hit sticky cache (no re-detect)
- Cache key uses cleaned text when remove_non_lexical=True
- Undetectable (None) results are not cached

32 total tests, all passing.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@beastoin
Copy link
Collaborator Author

Live local dev test results (real Google Cloud Translate API):

Test Input Output Detected Result
Spanish → EN "Hola, ¿cómo estás?" "Hello how are you?" es PASS
Japanese → EN "こんにちは世界" "Hello World" ja PASS
Same-language "Hello world, how are you?" "Hello world, how are you?" en PASS (skip)
Multi-sentence FR "Bonjour le monde. Comment allez-vous?" "Hello world. How are you doing?" fr PASS
Cache hit repeat of test 1 cached PASS (0 API calls)

API call summary:

  • translate_text calls: 5
  • detect_language calls: 0 (confirmed eliminated)

by AI for @beastoin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: eliminate redundant detect_language API calls (sub-issue 1/4 of #4651)

1 participant