Skip to content

chore: Run Spark SQL tests with native_datafusion in CI#3393

Merged
andygrove merged 14 commits intoapache:mainfrom
andygrove:spark-sql-native-datafusion
Feb 5, 2026
Merged

chore: Run Spark SQL tests with native_datafusion in CI#3393
andygrove merged 14 commits intoapache:mainfrom
andygrove:spark-sql-native-datafusion

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Feb 4, 2026

Which issue does this PR close?

N/A - This PR enables running Spark SQL tests with native_datafusion scan in CI.

Rationale for this change

Running Spark SQL tests with native_datafusion scan helps ensure compatibility and catch regressions. This PR enables these tests in CI while ignoring known failing tests that are tracked in separate issues.

What changes are included in this PR?

  1. CI workflow changes: Added native_datafusion scan mode to the Spark SQL test matrix

  2. Test annotations: Added IgnoreCometNativeDataFusion annotations for failing tests, linked to tracking issues:

Issue Category Tests
#3311 Schema mismatch / type coercion ParquetQuerySuite, ParquetIOSuite, ParquetSchemaSuite, ParquetFilterSuite, FileBasedDataSourceSuite
#3312 input_file_name() not supported UDFSuite, ExtractPythonUDFsSuite
#3313 DPP (Dynamic Partition Pruning) DynamicPartitionPruningSuite, ExplainSuite
#3314 Missing files error handling differs SQLViewSuite, HiveSQLViewSuite
#3315 Parquet V2 / streaming sources FileDataSourceV2FallBackSuite, StreamingQuerySuite
#3316 Parquet field ID matching not supported ParquetFieldIdIOSuite (6 tests)
#3317 Row index metadata ParquetFileMetadataStructRowIndexSuite
#3319 Bucketed scan BucketedReadSuite, DisableUnnecessaryBucketedScanSuite
#3320 Predicate pushdown ParquetFilterSuite
#3401 Streaming self-union returns no results StreamingSelfUnionSuite (2 tests)

How are these changes tested?

The changes are tested by the CI workflow itself - tests should pass with the known failures ignored.

andygrove and others added 11 commits February 4, 2026 08:18
…sion tests

Added annotations for the following tests that fail with native_datafusion scan:

DynamicPartitionPruningSuite:
- static scan metrics → apache#3313

ParquetQuerySuite, ParquetIOSuite, ParquetSchemaSuite, ParquetFilterSuite:
- SPARK-36182: can't read TimestampLTZ as TimestampNTZ → apache#3311
- SPARK-34212 Parquet should read decimals correctly → apache#3311
- row group skipping doesn't overflow when reading into larger type → apache#3311
- SPARK-35640 tests → apache#3311
- schema mismatch failure error message tests → apache#3311
- SPARK-25207: duplicate fields case-insensitive → apache#3311
- SPARK-31026: fields with dots in names → apache#3320
- Filters should be pushed down at row group level → apache#3320

FileBasedDataSourceSuite:
- Spark native readers should respect spark.sql.caseSensitive → apache#3311

BucketedReadSuite, DisableUnnecessaryBucketedScanSuite:
- disable bucketing when output doesn't contain all bucketing columns → apache#3319
- bucket coalescing tests → apache#3319
- SPARK-32859: disable unnecessary bucketed table scan tests → apache#3319
- Aggregates with no groupby over tables having 1 BUCKET → apache#3319

ParquetFileMetadataStructRowIndexSuite:
- reading _tmp_metadata_row_index tests → apache#3317

FileDataSourceV2FallBackSuite:
- Fallback Parquet V2 to V1 → apache#3315

UDFSuite:
- SPARK-8005 input_file_name → apache#3312

ExtractPythonUDFsSuite:
- Python UDF should not break column pruning/filter pushdown -- Parquet V1 → apache#3312

StreamingQuerySuite:
- SPARK-41198: input row calculation with CTE → apache#3315
- SPARK-41199: input row calculation with mixed DSv1 and DSv2 sources → apache#3315

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added the import statement to test files that were missing it:
- FileDataSourceV2FallBackSuite.scala
- ParquetFileMetadataStructRowIndexSuite.scala
- ExtractPythonUDFsSuite.scala
- DisableUnnecessaryBucketedScanSuite.scala
- StreamingQuerySuite.scala
- UDFSuite.scala

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The method signature in IgnoreComet.scala was not properly formatted
according to scalafmt rules. This fixes the formatting to match
Spark's scalafmt configuration.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Set NOLINT_ON_COMPILE=true to skip scalastyle validation during
SBT compilation, reducing CI time for Spark SQL test runs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
andygrove and others added 3 commits February 5, 2026 06:10
Add IgnoreCometNativeDataFusion annotations for:
- ExplainSuite: DPP subquery check (apache#3313)
- SQLViewSuite/HiveSQLViewSuite: storeAnalyzedPlanForView (apache#3314)
- ParquetFieldIdIOSuite: all 6 field ID tests (apache#3316)
- StreamingSelfUnionSuite: 2 self-union tests (apache#3401)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…in native scan rule

The fallback checks for fileConstantMetadataColumns and row index
generation were logging info but not returning None, so the native
DataFusion scan was still used even when these unsupported features
were needed (e.g. input_file_name()), causing empty results.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fix three CI test failures in the Spark SQL test suite:

1. SQLTestUtils.test() - Move Comet tag checks before the
   DisableAdaptiveExecution handling so tests with both tags
   (like "static scan metrics") are properly skipped.

2. ColumnExpressionSuite - Skip "input_file_name...FileScanRDD" test
   since native DataFusion scan doesn't set InputFileBlockHolder.

3. HiveUDFSuite - Skip "SPARK-11522 select input_file_name" test
   for the same reason.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove changed the title chore: Run Spark SQL tests with native_datafusion in CI [WIP] chore: Run Spark SQL tests with native_datafusion in CI Feb 5, 2026
@andygrove andygrove marked this pull request as ready for review February 5, 2026 20:10
@andygrove andygrove requested a review from mbutrovich February 5, 2026 20:11
cd apache-spark
rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups
ENABLE_COMET=true ENABLE_COMET_ONHEAP=true ${{ matrix.config.scan-env }} ENABLE_COMET_LOG_FALLBACK_REASONS=${{ github.event.inputs.collect-fallback-logs || 'false' }} \
NOLINT_ON_COMPILE=true ENABLE_COMET=true ENABLE_COMET_ONHEAP=true COMET_PARQUET_SCAN_IMPL=${{ matrix.config.scan-impl }} ENABLE_COMET_LOG_FALLBACK_REASONS=${{ github.event.inputs.collect-fallback-logs || 'false' }} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did NOLINT_ON_COMPILE=true come from?

Copy link
Member Author

@andygrove andygrove Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got tired of trying to run spotless scalafmt against the Spark code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We clone Spark and apply our patch. I don't think we care if it is well formatted just to be able to run the tests.

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me! Thanks @andygrove!

@andygrove andygrove merged commit 2f64b60 into apache:main Feb 5, 2026
134 of 135 checks passed
@andygrove andygrove deleted the spark-sql-native-datafusion branch February 5, 2026 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants