Skip to content

refactor: Split read benchmarks and add addParquetScanCases helper#3407

Open
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:parquet-read-bench
Open

refactor: Split read benchmarks and add addParquetScanCases helper#3407
andygrove wants to merge 4 commits intoapache:mainfrom
andygrove:parquet-read-bench

Conversation

@andygrove
Copy link
Member

Summary

  • Extract iceberg benchmarks from CometReadBaseBenchmark into new CometIcebergReadBenchmark object
  • Add addParquetScanCases helper to CometBenchmarkBase that encapsulates the repeated 3-case pattern (Spark / native_datafusion / native_iceberg_compat) used by all parquet benchmarks
  • Refactor CometReadBaseBenchmark and CometPartitionColumnBenchmark to use the new helper, eliminating ~270 lines of duplicated boilerplate

Test plan

  • ./mvnw compile test-compile -pl spark -DskipTests passes
  • ./mvnw spotless:check -pl spark passes
  • Run CometReadBenchmark to verify parquet benchmarks work
  • Run CometIcebergReadBenchmark to verify iceberg benchmarks work
  • Run CometPartitionColumnBenchmark to verify partition benchmarks work

🤖 Generated with Claude Code

andygrove and others added 2 commits February 5, 2026 08:56
Extract iceberg benchmarks into CometIcebergReadBenchmark and add
addParquetScanCases helper to CometBenchmarkBase to eliminate the
repeated 3-case pattern (Spark / native_datafusion / native_iceberg_compat)
across all parquet benchmarks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove marked this pull request as ready for review February 5, 2026 16:44
andygrove and others added 2 commits February 5, 2026 10:44
Eliminate the JVM data round trip for data columns in native_iceberg_compat
scans. Data columns are read directly from the native BatchContext via
zero-copy Arc::clone, while only partition columns cross the JVM boundary
via Arrow FFI.

Previously, data made a wasteful round trip:
  Rust ParquetSource → per-column JNI export to JVM → JVM wraps as
  CometVector → JVM exports ALL cols back to Rust via Arrow FFI →
  Rust ScanExec deep-copies every column

Now in passthrough mode:
  Rust ParquetSource → batch stays in native BatchContext →
  Rust ScanExec reads data cols directly (zero-copy) →
  Only partition cols imported from JVM FFI (small, constant)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant