-
Notifications
You must be signed in to change notification settings - Fork 2.7k
feat: parallelize LLM-as-judge evaluation using asyncio.gather() #3960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: parallelize LLM-as-judge evaluation using asyncio.gather() #3960
Conversation
Refactored LlmAsJudge.evaluate_invocations() to execute all N×M LLM calls concurrently instead of serially, reducing evaluation time from ~5 minutes to ~1 minute for typical workloads. Changes: - Added asyncio import for concurrent execution - Created _evaluate_single_sample() helper method to encapsulate individual LLM evaluation calls - Refactored evaluate_invocations() to use asyncio.gather() for parallel execution of all tasks - Results are grouped by invocation index to preserve aggregation behavior - Added demo_parallel_performance.py to demonstrate speedup Performance: - 9.98x faster in benchmark (5 invocations × 2 samples) - All existing tests pass (5 LLM-as-judge + 25 rubric-based evaluator tests) - 100% backward compatible - no API changes Resolves: Performance issue with serial LLM evaluation Tested: pytest tests/unittests/evaluation/test_llm_as_judge.py -v Tested: pytest tests/unittests/evaluation/test_rubric_based_evaluator.py -v
Summary of ChangesHello @aryanpatel2121, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant performance enhancement to the LLM-as-judge evaluation system by implementing parallel execution for LLM calls. By leveraging Python's Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly improves the performance of LLM-as-judge evaluations by parallelizing the LLM calls using asyncio.gather(). The refactoring is well-structured, isolating the single evaluation logic into a new _evaluate_single_sample helper method, which enhances readability. The inclusion of a performance demonstration script is also a great addition for verifying the speedup. My review includes a few suggestions to further improve code conciseness and add robustness through logging.
|
|
||
| import asyncio | ||
| import time | ||
| from typing import Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| results_by_invocation = {} | ||
| for idx, result in zip(invocation_indices, all_results): | ||
| if idx not in results_by_invocation: | ||
| results_by_invocation[idx] = [] | ||
| results_by_invocation[idx].append(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for grouping results by invocation index can be made more concise by using dict.setdefault(). This avoids the explicit check for the key's existence and is a common Python idiom for this pattern, improving readability.
results_by_invocation = {}
for idx, result in zip(invocation_indices, all_results):
results_by_invocation.setdefault(idx, []).append(result)| # This should not be reached for non-streaming calls, but added for safety | ||
| return PerInvocationResult( | ||
| actual_invocation=actual, | ||
| expected_invocation=expected, | ||
| eval_status=get_eval_status(None, self._eval_metric.threshold), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it's good to have a safety net for cases where the LLM judge doesn't return a response, this is an unexpected event for non-streaming calls. It would be beneficial to log a warning when this code path is executed. This will help in debugging potential issues with the LLM or the request setup, which might otherwise go unnoticed.
Note: You will need to add import logging at the top of the file.
# This should not be reached for non-streaming calls, but added for safety
logging.warning(
"LLM judge did not yield a response for a sample evaluation."
)
return PerInvocationResult(
actual_invocation=actual,
expected_invocation=expected,
eval_status=get_eval_status(None, self._eval_metric.threshold),
)| results_by_invocation = {} | ||
| for invocation_idx, result in zip(invocation_indices, all_results): | ||
| if invocation_idx not in results_by_invocation: | ||
| results_by_invocation[invocation_idx] = [] | ||
| results_by_invocation[invocation_idx].append(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic for grouping results by invocation can be simplified using dict.setdefault(). This makes the code more concise and is a common Python idiom for this pattern.
results_by_invocation = {}
for invocation_idx, result in zip(invocation_indices, all_results):
results_by_invocation.setdefault(invocation_idx, []).append(result)|
- Thanks for the review! I'll apply all the suggestions.
- Great points on dict.setdefault() and logging. I will implement them.
- Will remove the unused Optional import and add the suggested logging.
…On Thu, 18 Dec 2025 at 11:54 AM, gemini-code-assist[bot] < ***@***.***> wrote:
***@***.***[bot]* commented on this pull request.
Code Review
This pull request significantly improves the performance of LLM-as-judge
evaluations by parallelizing the LLM calls using asyncio.gather(). The
refactoring is well-structured, isolating the single evaluation logic into
a new _evaluate_single_sample helper method, which enhances readability.
The inclusion of a performance demonstration script is also a great
addition for verifying the speedup. My review includes a few suggestions to
further improve code conciseness and add robustness through logging.
------------------------------
In demo_parallel_performance.py
<#3960 (comment)>:
> @@ -0,0 +1,114 @@
+"""
+Performance demonstration for parallel LLM-as-judge evaluation.
+
+This script demonstrates the performance improvement from parallelizing
+LLM evaluation calls using asyncio.gather().
+"""
+
+import asyncio
+import time
+from typing import Optional
[image: medium]
<https://camo.githubusercontent.com/32601710f6703a1d3cdbb05c7f9f05d1d8c88abc4d4e4d4e25ff218874a45279/68747470733a2f2f7777772e677374617469632e636f6d2f636f64657265766965776167656e742f6d656469756d2d7072696f726974792e737667>
The Optional type is imported from typing but is not used within this
file. It's a good practice to remove unused imports to maintain code
cleanliness.
------------------------------
In demo_parallel_performance.py
<#3960 (comment)>:
> + results_by_invocation = {}
+ for idx, result in zip(invocation_indices, all_results):
+ if idx not in results_by_invocation:
+ results_by_invocation[idx] = []
+ results_by_invocation[idx].append(result)
[image: medium]
<https://camo.githubusercontent.com/32601710f6703a1d3cdbb05c7f9f05d1d8c88abc4d4e4d4e25ff218874a45279/68747470733a2f2f7777772e677374617469632e636f6d2f636f64657265766965776167656e742f6d656469756d2d7072696f726974792e737667>
The logic for grouping results by invocation index can be made more
concise by using dict.setdefault(). This avoids the explicit check for
the key's existence and is a common Python idiom for this pattern,
improving readability.
results_by_invocation = {}
for idx, result in zip(invocation_indices, all_results):
results_by_invocation.setdefault(idx, []).append(result)
------------------------------
In src/google/adk/evaluation/llm_as_judge.py
<#3960 (comment)>:
> + # This should not be reached for non-streaming calls, but added for safety
+ return PerInvocationResult(
+ actual_invocation=actual,
+ expected_invocation=expected,
+ eval_status=get_eval_status(None, self._eval_metric.threshold),
+ )
[image: medium]
<https://camo.githubusercontent.com/32601710f6703a1d3cdbb05c7f9f05d1d8c88abc4d4e4d4e25ff218874a45279/68747470733a2f2f7777772e677374617469632e636f6d2f636f64657265766965776167656e742f6d656469756d2d7072696f726974792e737667>
While it's good to have a safety net for cases where the LLM judge doesn't
return a response, this is an unexpected event for non-streaming calls. It
would be beneficial to log a warning when this code path is executed. This
will help in debugging potential issues with the LLM or the request setup,
which might otherwise go unnoticed.
Note: You will need to add import logging at the top of the file.
# This should not be reached for non-streaming calls, but added for safety
logging.warning(
"LLM judge did not yield a response for a sample evaluation."
)
return PerInvocationResult(
actual_invocation=actual,
expected_invocation=expected,
eval_status=get_eval_status(None, self._eval_metric.threshold),
)
------------------------------
In src/google/adk/evaluation/llm_as_judge.py
<#3960 (comment)>:
> + results_by_invocation = {}
+ for invocation_idx, result in zip(invocation_indices, all_results):
+ if invocation_idx not in results_by_invocation:
+ results_by_invocation[invocation_idx] = []
+ results_by_invocation[invocation_idx].append(result)
[image: medium]
<https://camo.githubusercontent.com/32601710f6703a1d3cdbb05c7f9f05d1d8c88abc4d4e4d4e25ff218874a45279/68747470733a2f2f7777772e677374617469632e636f6d2f636f64657265766965776167656e742f6d656469756d2d7072696f726974792e737667>
This logic for grouping results by invocation can be simplified using
dict.setdefault(). This makes the code more concise and is a common
Python idiom for this pattern.
results_by_invocation = {}
for invocation_idx, result in zip(invocation_indices, all_results):
results_by_invocation.setdefault(invocation_idx, []).append(result)
—
Reply to this email directly, view it on GitHub
<#3960 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BCD2NTMSGTRHW4TOBD6CVOL4CJCCXAVCNFSM6AAAAACPMSIIWKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTKOJQHE4TANRZGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
That's great to hear, @aryanpatel2121! I appreciate you taking the feedback into consideration. Let me know if you have any questions as you implement the changes. |
|
Hi @aryanpatel2121 , Thank you for your contribution! We appreciate you taking the time to submit this pull request. |
|
Thank you for the review and the helpful suggestions.
I will implement the requested changes and run autoformat.sh to fix the
lint errors. I expect to update the pull request shortly.
Thanks again for your guidance!
…On Fri, 19 Dec 2025 at 10:44 AM, Rohit Yanamadala ***@***.***> wrote:
*ryanaiagent* left a comment (google/adk-python#3960)
<#3960 (comment)>
Hi @aryanpatel2121 <https://github.com/aryanpatel2121> , Thank you for
your contribution! We appreciate you taking the time to submit this pull
request.
After implementing the suggestions can you also fix the lint erros by
running autoformat.sh.
—
Reply to this email directly, view it on GitHub
<#3960 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BCD2NTKW5QBPMSJWBQMO6QL4COCUPAVCNFSM6AAAAACPMSIIWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNZTGU3DKNRTGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
- Fixed import order in llm_as_judge.py - Fixed line length formatting in llm_as_judge.py - Removed extra blank lines in contributing samples
|
hiee @ryanaiagent What I did: Files updated (formatting only): Corrected import order (moved asyncio to the proper position). Removed an extra blank line between imports. Removed an extra blank line between imports. Please let me know if anything else needs to be adjusted. Thanks again for the review and guidance! |
What this change does
This PR improves the performance of
LlmAsJudge.evaluate_invocations()byrunning LLM evaluation calls concurrently instead of executing them one-by-one.
While profiling evaluation runs, I noticed that all LLM calls were executed
serially, which caused evaluation time to scale linearly with the number of
invocations and samples. By switching to asyncio-based concurrency, the same
workload now completes significantly faster without changing any public APIs.
What changed
evaluate_invocations()to execute all N×M LLM calls concurrentlyusing
asyncio.gather()_evaluate_single_sample()) to keep thelogic for individual LLM calls isolated and readable
behavior
demo_parallel_performance.pyto make the speedup easy to reproduceWhy this is useful
For typical workloads, evaluation time drops from ~5 minutes to ~1 minute.
This makes iterative development and experimentation much faster, especially
when working with multiple invocations or samples.
The change is fully backward compatible and does not modify any public
interfaces.
Performance results
Testing
pytest tests/unittests/evaluation/test_llm_as_judge.py -vpytest tests/unittests/evaluation/test_rubric_based_evaluator.py -vPlease let me know if you’d prefer this split into smaller commits or adjusted
to better match project conventions.