perf: Improve performance of native row-to-columnar transition used by JVM shuffle#3289
Open
andygrove wants to merge 24 commits intoapache:mainfrom
Open
perf: Improve performance of native row-to-columnar transition used by JVM shuffle#3289andygrove wants to merge 24 commits intoapache:mainfrom
andygrove wants to merge 24 commits intoapache:mainfrom
Conversation
Use bulk-append methods for primitive types in SparkUnsafeArray: - Non-nullable path uses append_slice() for optimal memcpy-style copy - Nullable path uses pointer iteration with efficient null bitset reading Supported types: i8, i16, i32, i64, f32, f64, date32, timestamp Benchmark results (10K elements): | Type | Baseline | Optimized | Speedup | |------|----------|-----------|---------| | i32/no_nulls | 6.08µs | 0.65µs | **9.3x** | | i32/with_nulls | 22.49µs | 16.21µs | **1.39x** | | i64/no_nulls | 6.15µs | 1.22µs | **5x** | | i64/with_nulls | 16.41µs | 16.41µs | 1x | | f64/no_nulls | 8.05µs | 1.22µs | **6.6x** | | f64/with_nulls | 16.52µs | 16.21µs | 1.02x | | date32/no_nulls | - | 0.66µs | ~9x | | timestamp/no_nulls | - | 1.21µs | ~5x | Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The #[inline] attribute on functions with loops iterating over thousands of elements provides no benefit - the function call overhead is negligible compared to loop body execution, and inlining large functions causes instruction cache pressure. Keep #[inline] only on small helper functions: - get_header_portion_in_bytes (tiny const fn) - is_null_at (small, hot path) - null_bitset_ptr (tiny accessor) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused ArrayBuilder import - Use div_ceil() instead of manual implementation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Optimize struct field processing in native shuffle by using field-major instead of row-major order. This moves type dispatch from O(rows × fields) to O(fields), eliminating per-row type matching overhead. Previously, for each row we iterated over all fields and called `append_field()` which did a type match for EVERY field in EVERY row. For a struct with N fields and M rows, that's N×M type matches. The new approach: 1. First pass: Loop over rows, build struct validity 2. Second pass: For each field, get typed builder once, then process all rows for that field This keeps type dispatch at O(fields) instead of O(rows × fields). For complex nested types (struct, list, map), falls back to existing `append_field` since they have their own recursive processing logic. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add a Criterion benchmark to measure the performance of struct column processing in native shuffle. Tests various struct sizes (5, 10, 20 fields) and row counts (1K, 10K rows). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This extends the field-major optimization (from commit 471fb2a) to recursively handle nested Struct fields. Previously, nested structs fell back to row-major processing via `append_field`, losing the benefit of field-major processing at each nesting level. Changes: - Add `append_nested_struct_fields_field_major` helper function that recursively processes nested struct fields using field-major order - Update `append_struct_fields_field_major` to use field-major processing for nested Struct fields instead of falling back to `append_field` - Add benchmarks for 2-level and 3-level nested structs The optimization: 1. Gets the nested StructBuilder once per field 2. Builds nested struct validity in one pass 3. Recursively applies field-major processing to nested struct fields List and Map fields continue to fall back to `append_field` since they have variable-length elements that are harder to optimize. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This was referenced Jan 26, 2026
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3289 +/- ##
============================================
+ Coverage 56.12% 60.13% +4.00%
- Complexity 976 1468 +492
============================================
Files 119 175 +56
Lines 11743 16085 +4342
Branches 2251 2665 +414
============================================
+ Hits 6591 9672 +3081
- Misses 4012 5066 +1054
- Partials 1140 1347 +207 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add batched processing for List and Map columns that moves type dispatch outside the row loop, similar to the struct field-major optimization. Changes: - Add `append_list_column_batch` that dispatches on element type once, then processes all rows with the typed builder - Add `append_map_column_batch` that dispatches on key/value types once, with optimized paths for common combinations (Int64/Int64, Int32/Int32, etc.) - Update `append_columns` to use the new batch functions - Add benchmark for List<Int64> column conversion The optimization: - List columns: Type dispatch goes from O(rows) to O(1) for primitive elements - Map columns: Type dispatch goes from O(rows × 2) to O(2) for primitive key/values - Complex element types fall back to per-row dispatch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds criterion benchmark for Map<Int64, Int64> conversion to ensure the batched map column processing optimization is covered by benchmarks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename benchmark file to better reflect its scope (struct, list, map) - Fix incorrect comment: "native shuffle" -> "JVM shuffle" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename complex_type_conversion.rs to jvm_shuffle.rs - Rename array_conversion.rs to array_element_append.rs - Merge row_columnar.rs primitive benchmark into jvm_shuffle.rs - Delete redundant row_columnar.rs - Update comments to clarify these benchmark JVM shuffle path The jvm_shuffle benchmark now covers: - Primitive types (100 Int64 columns) - Struct (flat, nested, deeply nested) - List - Map The array_element_append benchmark is a micro-benchmark for the inner loop of array element iteration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add macro `impl_append_to_builder` to generate bulk append methods, reducing ~190 lines of duplicated unsafe code in list.rs - Add comprehensive safety documentation to SparkUnsafeObject trait explaining memory layout invariants and JVM ownership guarantees - Add safety documentation to append_columns function - Add #[inline] annotations to trait accessor methods for better optimization - Keep unsafe pointer iteration for performance (benchmarks show 7-14% regression with safe accessor approach for some types) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace .unwrap() and .expect() calls on builder downcasts with proper error handling that returns CometError::Internal with descriptive messages including: - The expected type - The actual type (via type_id for downcast_builder_ref) - The field index (for get_field_builder) Added two macros: - downcast_builder_ref!: returns Result with type mismatch details - get_field_builder!: returns Result with field index and expected type Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add SAFETY comments to: - SparkUnsafeRow::is_null_at and set_not_null_at - SparkUnsafeArray::new and is_null_at - Batch processing functions (append_list_column_batch, append_map_column_batch, append_struct_fields_field_major) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace repeated unsafe pointer dereference patterns with a macro that encapsulates the safety invariants. This reduces code duplication and centralizes the safety documentation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use expect with the builder type name to provide better error messages if a downcast fails. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use if-let instead of is_some() + unwrap() in update_metrics - Use local root_op variable directly instead of re-reading from exec_context Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR combines numerous related optimizations for the row-to-columnar conversion that happens in JVM shuffle for row-based inputs (Spark-native plans).
Benchmarks are in separate PR #3290
Struct Conversion Benchmarks
Flat Structs (Struct)
2-Level Nested Structs (Struct<Struct>)
3-Level Nested Structs (Struct<Struct<Struct>>)
Array Conversion Benchmarks (10K elements)
append_slice()append_slice()append_slice()append_slice()append_slice()Note: Array benchmarks don't exist on main, so comparison is based on original PR benchmarks showing 5-9x speedup for non-nullable arrays.
Summary
Struct Processing Improvements
The improvement is more pronounced with:
Why the Improvement?
Field-major order: Type dispatch happens once per field instead of once per row per field
Better cache locality: Processing all rows for one field before moving to next field keeps data in CPU cache
Recursive optimization: For nested structs, the field-major approach is applied at each nesting level, compounding the benefits