perf: Improve performance of native row-to-columnar transition used by JVM shuffle by andygrove · Pull Request #3289 · apache/datafusion-comet

andygrove · 2026-01-26T16:07:08Z

This PR combines numerous related optimizations for the row-to-columnar conversion that happens in JVM shuffle for row-based inputs (Spark-native plans).

Benchmarks are in separate PR #3290

Struct Conversion Benchmarks

Flat Structs (Struct)

Benchmark	Main Branch	Optimized Branch	Speedup
fields_5/rows_1000	77.51 µs	65.19 µs	1.19x
fields_5/rows_10000	382.51 µs	253.79 µs	1.51x
fields_10/rows_1000	109.82 µs	87.36 µs	1.26x
fields_10/rows_10000	687.35 µs	443.10 µs	1.55x
fields_20/rows_1000	183.76 µs	136.20 µs	1.35x
fields_20/rows_10000	1349.5 µs	873.29 µs	1.55x

2-Level Nested Structs (Struct<Struct>)

Benchmark	Main Branch	Optimized Branch	Speedup
inner_fields_5/rows_1000	86.72 µs	71.60 µs	1.21x
inner_fields_5/rows_10000	470.14 µs	275.88 µs	1.70x
inner_fields_10/rows_1000	125.19 µs	92.97 µs	1.35x
inner_fields_10/rows_10000	842.73 µs	489.55 µs	1.72x
inner_fields_20/rows_1000	201.20 µs	145.00 µs	1.39x
inner_fields_20/rows_10000	1472.3 µs	908.24 µs	1.62x

3-Level Nested Structs (Struct<Struct<Struct>>)

Benchmark	Main Branch	Optimized Branch	Speedup
inner_fields_5/rows_1000	95.64 µs	76.28 µs	1.25x
inner_fields_5/rows_10000	504.01 µs	332.28 µs	1.52x
inner_fields_10/rows_1000	132.42 µs	96.37 µs	1.37x
inner_fields_10/rows_10000	867.63 µs	511.80 µs	1.70x
inner_fields_20/rows_1000	209.67 µs	150.69 µs	1.39x
inner_fields_20/rows_10000	1535.0 µs	963.04 µs	1.59x

Array Conversion Benchmarks (10K elements)

Type	Optimized Branch	Notes
i32/no_nulls	0.67 µs	Bulk memcpy via `append_slice()`
i32/with_nulls	16.26 µs	Pointer iteration with null check
i64/no_nulls	1.24 µs	Bulk memcpy via `append_slice()`
i64/with_nulls	16.48 µs	Pointer iteration with null check
f64/no_nulls	1.24 µs	Bulk memcpy via `append_slice()`
f64/with_nulls	16.21 µs	Pointer iteration with null check
date32/no_nulls	0.63 µs	Bulk memcpy via `append_slice()`
date32/with_nulls	16.14 µs	Pointer iteration with null check
timestamp/no_nulls	1.26 µs	Bulk memcpy via `append_slice()`
timestamp/with_nulls	16.47 µs	Pointer iteration with null check

Note: Array benchmarks don't exist on main, so comparison is based on original PR benchmarks showing 5-9x speedup for non-nullable arrays.

Summary

Struct Processing Improvements

Flat structs: 1.19x - 1.55x speedup
2-level nested: 1.21x - 1.72x speedup
3-level nested: 1.25x - 1.70x speedup

The improvement is more pronounced with:

More rows (1.5x-1.7x for 10K rows vs 1.2x-1.4x for 1K rows)
More nesting levels (benefit compounds at each level)

Why the Improvement?

Field-major order: Type dispatch happens once per field instead of once per row per field
- Main branch: O(rows × fields) type matches
- Optimized: O(fields) type matches
Better cache locality: Processing all rows for one field before moving to next field keeps data in CPU cache
Recursive optimization: For nested structs, the field-major approach is applied at each nesting level, compounding the benefits

Use bulk-append methods for primitive types in SparkUnsafeArray: - Non-nullable path uses append_slice() for optimal memcpy-style copy - Nullable path uses pointer iteration with efficient null bitset reading Supported types: i8, i16, i32, i64, f32, f64, date32, timestamp Benchmark results (10K elements): | Type | Baseline | Optimized | Speedup | |------|----------|-----------|---------| | i32/no_nulls | 6.08µs | 0.65µs | **9.3x** | | i32/with_nulls | 22.49µs | 16.21µs | **1.39x** | | i64/no_nulls | 6.15µs | 1.22µs | **5x** | | i64/with_nulls | 16.41µs | 16.41µs | 1x | | f64/no_nulls | 8.05µs | 1.22µs | **6.6x** | | f64/with_nulls | 16.52µs | 16.21µs | 1.02x | | date32/no_nulls | - | 0.66µs | ~9x | | timestamp/no_nulls | - | 1.21µs | ~5x | Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The #[inline] attribute on functions with loops iterating over thousands of elements provides no benefit - the function call overhead is negligible compared to loop body execution, and inlining large functions causes instruction cache pressure. Keep #[inline] only on small helper functions: - get_header_portion_in_bytes (tiny const fn) - is_null_at (small, hot path) - null_bitset_ptr (tiny accessor) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove unused ArrayBuilder import - Use div_ceil() instead of manual implementation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Optimize struct field processing in native shuffle by using field-major instead of row-major order. This moves type dispatch from O(rows × fields) to O(fields), eliminating per-row type matching overhead. Previously, for each row we iterated over all fields and called `append_field()` which did a type match for EVERY field in EVERY row. For a struct with N fields and M rows, that's N×M type matches. The new approach: 1. First pass: Loop over rows, build struct validity 2. Second pass: For each field, get typed builder once, then process all rows for that field This keeps type dispatch at O(fields) instead of O(rows × fields). For complex nested types (struct, list, map), falls back to existing `append_field` since they have their own recursive processing logic. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add a Criterion benchmark to measure the performance of struct column processing in native shuffle. Tests various struct sizes (5, 10, 20 fields) and row counts (1K, 10K rows). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This extends the field-major optimization (from commit 471fb2a) to recursively handle nested Struct fields. Previously, nested structs fell back to row-major processing via `append_field`, losing the benefit of field-major processing at each nesting level. Changes: - Add `append_nested_struct_fields_field_major` helper function that recursively processes nested struct fields using field-major order - Update `append_struct_fields_field_major` to use field-major processing for nested Struct fields instead of falling back to `append_field` - Add benchmarks for 2-level and 3-level nested structs The optimization: 1. Gets the nested StructBuilder once per field 2. Builds nested struct validity in one pass 3. Recursively applies field-major processing to nested struct fields List and Map fields continue to fall back to `append_field` since they have variable-length elements that are harder to optimize. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codecov-commenter · 2026-01-26T16:37:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.13%. Comparing base (f09f8af) to head (4c5eb0b).
⚠️ Report is 898 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3289      +/-   ##
============================================
+ Coverage     56.12%   60.13%   +4.00%     
- Complexity      976     1468     +492     
============================================
  Files           119      175      +56     
  Lines         11743    16085    +4342     
  Branches       2251     2665     +414     
============================================
+ Hits           6591     9672    +3081     
- Misses         4012     5066    +1054     
- Partials       1140     1347     +207

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add batched processing for List and Map columns that moves type dispatch outside the row loop, similar to the struct field-major optimization. Changes: - Add `append_list_column_batch` that dispatches on element type once, then processes all rows with the typed builder - Add `append_map_column_batch` that dispatches on key/value types once, with optimized paths for common combinations (Int64/Int64, Int32/Int32, etc.) - Update `append_columns` to use the new batch functions - Add benchmark for List<Int64> column conversion The optimization: - List columns: Type dispatch goes from O(rows) to O(1) for primitive elements - Map columns: Type dispatch goes from O(rows × 2) to O(2) for primitive key/values - Complex element types fall back to per-row dispatch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds criterion benchmark for Map<Int64, Int64> conversion to ensure the batched map column processing optimization is covered by benchmarks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Rename benchmark file to better reflect its scope (struct, list, map) - Fix incorrect comment: "native shuffle" -> "JVM shuffle" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Rename complex_type_conversion.rs to jvm_shuffle.rs - Rename array_conversion.rs to array_element_append.rs - Merge row_columnar.rs primitive benchmark into jvm_shuffle.rs - Delete redundant row_columnar.rs - Update comments to clarify these benchmark JVM shuffle path The jvm_shuffle benchmark now covers: - Primitive types (100 Int64 columns) - Struct (flat, nested, deeply nested) - List - Map The array_element_append benchmark is a micro-benchmark for the inner loop of array element iteration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add macro `impl_append_to_builder` to generate bulk append methods, reducing ~190 lines of duplicated unsafe code in list.rs - Add comprehensive safety documentation to SparkUnsafeObject trait explaining memory layout invariants and JVM ownership guarantees - Add safety documentation to append_columns function - Add #[inline] annotations to trait accessor methods for better optimization - Keep unsafe pointer iteration for performance (benchmarks show 7-14% regression with safe accessor approach for some types) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace .unwrap() and .expect() calls on builder downcasts with proper error handling that returns CometError::Internal with descriptive messages including: - The expected type - The actual type (via type_id for downcast_builder_ref) - The field index (for get_field_builder) Added two macros: - downcast_builder_ref!: returns Result with type mismatch details - get_field_builder!: returns Result with field index and expected type Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add SAFETY comments to: - SparkUnsafeRow::is_null_at and set_not_null_at - SparkUnsafeArray::new and is_null_at - Batch processing functions (append_list_column_batch, append_map_column_batch, append_struct_fields_field_major) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace repeated unsafe pointer dereference patterns with a macro that encapsulates the safety invariants. This reduces code duplication and centralizes the safety documentation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Use expect with the builder type name to provide better error messages if a downcast fails. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use if-let instead of is_some() + unwrap() in update_metrics - Use local root_op variable directly instead of re-reading from exec_context Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…perf

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

andygrove and others added 9 commits January 20, 2026 11:31

chore: format code

2ea5631

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: address clippy warnings in benchmark

fe54548

- Remove unused ArrayBuilder import - Use div_ceil() instead of manual implementation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

test: add benchmark for struct column processing

bbe2da7

Add a Criterion benchmark to measure the performance of struct column processing in native shuffle. Tests various struct sizes (5, 10, 20 fields) and row counts (1K, 10K rows). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: remove unused import and mut in benchmark

f8fe7ba

merge

b87de09

This was referenced Jan 26, 2026

perf: optimize shuffle array element iteration with slice-based append #3222

Closed

perf: optimize native shuffle struct field processing with field-major order #3224

Closed

andygrove and others added 2 commits January 26, 2026 10:19

test: add benchmark for map column processing

11a7bca

Adds criterion benchmark for Map<Int64, Int64> conversion to ensure the batched map column processing optimization is covered by benchmarks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

andygrove changed the title ~~perf: Improve shuffle performance with complex types [WIP]~~ perf: Improve JVM shuffle performance with complex types [WIP] Jan 26, 2026

andygrove and others added 2 commits January 26, 2026 10:34

refactor: rename struct_conversion benchmark to complex_type_conversion

9b91cfd

- Rename benchmark file to better reflect its scope (struct, list, map) - Fix incorrect comment: "native shuffle" -> "JVM shuffle" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

andygrove mentioned this pull request Jan 26, 2026

perf: Improve benchmarks for native row-to-columnar used by JVM shuffle #3290

Open

andygrove and others added 11 commits January 26, 2026 10:46

lint

0deac66

lint

ac8c182

refactor: replace unwrap with expect for builder downcasts

ca702ee

Use expect with the builder type name to provide better error messages if a downcast fails. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

smaller difff

970e30a

fix: resolve clippy errors in jni_api.rs

1b76054

- Use if-let instead of is_some() + unwrap() in update_metrics - Use local root_op variable directly instead of re-reading from exec_context Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Merge remote-tracking branch 'apache/main' into shuffle-complex-type-…

738c46e

…perf

revert: restore row_columnar.rs benchmark and remove jvm_shuffle.rs

4c5eb0b

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

andygrove changed the title ~~perf: Improve JVM shuffle performance with complex types [WIP]~~ perf: Improve JVM shuffle performance with complex types Jan 26, 2026

andygrove changed the title ~~perf: Improve JVM shuffle performance with complex types~~ perf: Improve performance of native row-to-columnar transition used by JVM shuffle Jan 26, 2026

andygrove marked this pull request as ready for review January 26, 2026 21:44

mbutrovich self-requested a review January 28, 2026 14:55

andygrove added this to the 0.14.0 milestone Feb 6, 2026

andygrove added the performance label Feb 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Improve performance of native row-to-columnar transition used by JVM shuffle#3289

perf: Improve performance of native row-to-columnar transition used by JVM shuffle#3289
andygrove wants to merge 24 commits intoapache:mainfrom
andygrove:shuffle-complex-type-perf

andygrove commented Jan 26, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jan 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Struct Conversion Benchmarks

Flat Structs (Struct)

2-Level Nested Structs (Struct<Struct>)

3-Level Nested Structs (Struct<Struct<Struct>>)

Array Conversion Benchmarks (10K elements)

Summary

Struct Processing Improvements

Why the Improvement?

Uh oh!

codecov-commenter commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented Jan 26, 2026 •

edited

Loading

codecov-commenter commented Jan 26, 2026 •

edited

Loading