Skip to content

perf: Improve performance of native row-to-columnar transition used by JVM shuffle#3289

Open
andygrove wants to merge 24 commits intoapache:mainfrom
andygrove:shuffle-complex-type-perf
Open

perf: Improve performance of native row-to-columnar transition used by JVM shuffle#3289
andygrove wants to merge 24 commits intoapache:mainfrom
andygrove:shuffle-complex-type-perf

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 26, 2026

This PR combines numerous related optimizations for the row-to-columnar conversion that happens in JVM shuffle for row-based inputs (Spark-native plans).

Benchmarks are in separate PR #3290

Struct Conversion Benchmarks

Flat Structs (Struct)

Benchmark Main Branch Optimized Branch Speedup
fields_5/rows_1000 77.51 µs 65.19 µs 1.19x
fields_5/rows_10000 382.51 µs 253.79 µs 1.51x
fields_10/rows_1000 109.82 µs 87.36 µs 1.26x
fields_10/rows_10000 687.35 µs 443.10 µs 1.55x
fields_20/rows_1000 183.76 µs 136.20 µs 1.35x
fields_20/rows_10000 1349.5 µs 873.29 µs 1.55x

2-Level Nested Structs (Struct<Struct>)

Benchmark Main Branch Optimized Branch Speedup
inner_fields_5/rows_1000 86.72 µs 71.60 µs 1.21x
inner_fields_5/rows_10000 470.14 µs 275.88 µs 1.70x
inner_fields_10/rows_1000 125.19 µs 92.97 µs 1.35x
inner_fields_10/rows_10000 842.73 µs 489.55 µs 1.72x
inner_fields_20/rows_1000 201.20 µs 145.00 µs 1.39x
inner_fields_20/rows_10000 1472.3 µs 908.24 µs 1.62x

3-Level Nested Structs (Struct<Struct<Struct>>)

Benchmark Main Branch Optimized Branch Speedup
inner_fields_5/rows_1000 95.64 µs 76.28 µs 1.25x
inner_fields_5/rows_10000 504.01 µs 332.28 µs 1.52x
inner_fields_10/rows_1000 132.42 µs 96.37 µs 1.37x
inner_fields_10/rows_10000 867.63 µs 511.80 µs 1.70x
inner_fields_20/rows_1000 209.67 µs 150.69 µs 1.39x
inner_fields_20/rows_10000 1535.0 µs 963.04 µs 1.59x

Array Conversion Benchmarks (10K elements)

Type Optimized Branch Notes
i32/no_nulls 0.67 µs Bulk memcpy via append_slice()
i32/with_nulls 16.26 µs Pointer iteration with null check
i64/no_nulls 1.24 µs Bulk memcpy via append_slice()
i64/with_nulls 16.48 µs Pointer iteration with null check
f64/no_nulls 1.24 µs Bulk memcpy via append_slice()
f64/with_nulls 16.21 µs Pointer iteration with null check
date32/no_nulls 0.63 µs Bulk memcpy via append_slice()
date32/with_nulls 16.14 µs Pointer iteration with null check
timestamp/no_nulls 1.26 µs Bulk memcpy via append_slice()
timestamp/with_nulls 16.47 µs Pointer iteration with null check

Note: Array benchmarks don't exist on main, so comparison is based on original PR benchmarks showing 5-9x speedup for non-nullable arrays.

Summary

Struct Processing Improvements

  • Flat structs: 1.19x - 1.55x speedup
  • 2-level nested: 1.21x - 1.72x speedup
  • 3-level nested: 1.25x - 1.70x speedup

The improvement is more pronounced with:

  • More rows (1.5x-1.7x for 10K rows vs 1.2x-1.4x for 1K rows)
  • More nesting levels (benefit compounds at each level)

Why the Improvement?

  1. Field-major order: Type dispatch happens once per field instead of once per row per field

    • Main branch: O(rows × fields) type matches
    • Optimized: O(fields) type matches
  2. Better cache locality: Processing all rows for one field before moving to next field keeps data in CPU cache

  3. Recursive optimization: For nested structs, the field-major approach is applied at each nesting level, compounding the benefits

andygrove and others added 9 commits January 20, 2026 11:31
Use bulk-append methods for primitive types in SparkUnsafeArray:
- Non-nullable path uses append_slice() for optimal memcpy-style copy
- Nullable path uses pointer iteration with efficient null bitset reading

Supported types: i8, i16, i32, i64, f32, f64, date32, timestamp

Benchmark results (10K elements):

| Type | Baseline | Optimized | Speedup |
|------|----------|-----------|---------|
| i32/no_nulls | 6.08µs | 0.65µs | **9.3x** |
| i32/with_nulls | 22.49µs | 16.21µs | **1.39x** |
| i64/no_nulls | 6.15µs | 1.22µs | **5x** |
| i64/with_nulls | 16.41µs | 16.41µs | 1x |
| f64/no_nulls | 8.05µs | 1.22µs | **6.6x** |
| f64/with_nulls | 16.52µs | 16.21µs | 1.02x |
| date32/no_nulls | - | 0.66µs | ~9x |
| timestamp/no_nulls | - | 1.21µs | ~5x |

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The #[inline] attribute on functions with loops iterating over thousands
of elements provides no benefit - the function call overhead is negligible
compared to loop body execution, and inlining large functions causes
instruction cache pressure.

Keep #[inline] only on small helper functions:
- get_header_portion_in_bytes (tiny const fn)
- is_null_at (small, hot path)
- null_bitset_ptr (tiny accessor)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused ArrayBuilder import
- Use div_ceil() instead of manual implementation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Optimize struct field processing in native shuffle by using field-major
instead of row-major order. This moves type dispatch from O(rows × fields)
to O(fields), eliminating per-row type matching overhead.

Previously, for each row we iterated over all fields and called
`append_field()` which did a type match for EVERY field in EVERY row.
For a struct with N fields and M rows, that's N×M type matches.

The new approach:
1. First pass: Loop over rows, build struct validity
2. Second pass: For each field, get typed builder once, then process
   all rows for that field

This keeps type dispatch at O(fields) instead of O(rows × fields).

For complex nested types (struct, list, map), falls back to existing
`append_field` since they have their own recursive processing logic.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add a Criterion benchmark to measure the performance of struct column
processing in native shuffle. Tests various struct sizes (5, 10, 20 fields)
and row counts (1K, 10K rows).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This extends the field-major optimization (from commit 471fb2a) to
recursively handle nested Struct fields. Previously, nested structs fell
back to row-major processing via `append_field`, losing the benefit of
field-major processing at each nesting level.

Changes:
- Add `append_nested_struct_fields_field_major` helper function that
  recursively processes nested struct fields using field-major order
- Update `append_struct_fields_field_major` to use field-major processing
  for nested Struct fields instead of falling back to `append_field`
- Add benchmarks for 2-level and 3-level nested structs

The optimization:
1. Gets the nested StructBuilder once per field
2. Builds nested struct validity in one pass
3. Recursively applies field-major processing to nested struct fields

List and Map fields continue to fall back to `append_field` since they
have variable-length elements that are harder to optimize.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codecov-commenter
Copy link

codecov-commenter commented Jan 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.13%. Comparing base (f09f8af) to head (4c5eb0b).
⚠️ Report is 898 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3289      +/-   ##
============================================
+ Coverage     56.12%   60.13%   +4.00%     
- Complexity      976     1468     +492     
============================================
  Files           119      175      +56     
  Lines         11743    16085    +4342     
  Branches       2251     2665     +414     
============================================
+ Hits           6591     9672    +3081     
- Misses         4012     5066    +1054     
- Partials       1140     1347     +207     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove and others added 2 commits January 26, 2026 10:19
Add batched processing for List and Map columns that moves type dispatch
outside the row loop, similar to the struct field-major optimization.

Changes:
- Add `append_list_column_batch` that dispatches on element type once,
  then processes all rows with the typed builder
- Add `append_map_column_batch` that dispatches on key/value types once,
  with optimized paths for common combinations (Int64/Int64, Int32/Int32, etc.)
- Update `append_columns` to use the new batch functions
- Add benchmark for List<Int64> column conversion

The optimization:
- List columns: Type dispatch goes from O(rows) to O(1) for primitive elements
- Map columns: Type dispatch goes from O(rows × 2) to O(2) for primitive key/values
- Complex element types fall back to per-row dispatch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds criterion benchmark for Map<Int64, Int64> conversion to ensure
the batched map column processing optimization is covered by benchmarks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove changed the title perf: Improve shuffle performance with complex types [WIP] perf: Improve JVM shuffle performance with complex types [WIP] Jan 26, 2026
andygrove and others added 2 commits January 26, 2026 10:34
- Rename benchmark file to better reflect its scope (struct, list, map)
- Fix incorrect comment: "native shuffle" -> "JVM shuffle"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename complex_type_conversion.rs to jvm_shuffle.rs
- Rename array_conversion.rs to array_element_append.rs
- Merge row_columnar.rs primitive benchmark into jvm_shuffle.rs
- Delete redundant row_columnar.rs
- Update comments to clarify these benchmark JVM shuffle path

The jvm_shuffle benchmark now covers:
- Primitive types (100 Int64 columns)
- Struct (flat, nested, deeply nested)
- List
- Map

The array_element_append benchmark is a micro-benchmark for the
inner loop of array element iteration.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
andygrove and others added 11 commits January 26, 2026 10:46
- Add macro `impl_append_to_builder` to generate bulk append methods,
  reducing ~190 lines of duplicated unsafe code in list.rs
- Add comprehensive safety documentation to SparkUnsafeObject trait
  explaining memory layout invariants and JVM ownership guarantees
- Add safety documentation to append_columns function
- Add #[inline] annotations to trait accessor methods for better optimization
- Keep unsafe pointer iteration for performance (benchmarks show 7-14%
  regression with safe accessor approach for some types)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace .unwrap() and .expect() calls on builder downcasts with proper
error handling that returns CometError::Internal with descriptive
messages including:
- The expected type
- The actual type (via type_id for downcast_builder_ref)
- The field index (for get_field_builder)

Added two macros:
- downcast_builder_ref!: returns Result with type mismatch details
- get_field_builder!: returns Result with field index and expected type

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add SAFETY comments to:
- SparkUnsafeRow::is_null_at and set_not_null_at
- SparkUnsafeArray::new and is_null_at
- Batch processing functions (append_list_column_batch,
  append_map_column_batch, append_struct_fields_field_major)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace repeated unsafe pointer dereference patterns with a macro
that encapsulates the safety invariants. This reduces code duplication
and centralizes the safety documentation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use expect with the builder type name to provide better error messages
if a downcast fails.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use if-let instead of is_some() + unwrap() in update_metrics
- Use local root_op variable directly instead of re-reading from exec_context

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove changed the title perf: Improve JVM shuffle performance with complex types [WIP] perf: Improve JVM shuffle performance with complex types Jan 26, 2026
@andygrove andygrove changed the title perf: Improve JVM shuffle performance with complex types perf: Improve performance of native row-to-columnar transition used by JVM shuffle Jan 26, 2026
@andygrove andygrove marked this pull request as ready for review January 26, 2026 21:44
@mbutrovich mbutrovich self-requested a review January 28, 2026 14:55
@andygrove andygrove added this to the 0.14.0 milestone Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants