chore: Run Spark SQL tests with `native_datafusion` in CI by andygrove · Pull Request #3393 · apache/datafusion-comet

andygrove · 2026-02-04T15:19:22Z

Which issue does this PR close?

N/A - This PR enables running Spark SQL tests with native_datafusion scan in CI.

Rationale for this change

Running Spark SQL tests with native_datafusion scan helps ensure compatibility and catch regressions. This PR enables these tests in CI while ignoring known failing tests that are tracked in separate issues.

What changes are included in this PR?

CI workflow changes: Added native_datafusion scan mode to the Spark SQL test matrix
Test annotations: Added IgnoreCometNativeDataFusion annotations for failing tests, linked to tracking issues:

Issue	Category	Tests
#3311	Schema mismatch / type coercion	ParquetQuerySuite, ParquetIOSuite, ParquetSchemaSuite, ParquetFilterSuite, FileBasedDataSourceSuite
#3312	input_file_name() not supported	UDFSuite, ExtractPythonUDFsSuite
#3313	DPP (Dynamic Partition Pruning)	DynamicPartitionPruningSuite, ExplainSuite
#3314	Missing files error handling differs	SQLViewSuite, HiveSQLViewSuite
#3315	Parquet V2 / streaming sources	FileDataSourceV2FallBackSuite, StreamingQuerySuite
#3316	Parquet field ID matching not supported	ParquetFieldIdIOSuite (6 tests)
#3317	Row index metadata	ParquetFileMetadataStructRowIndexSuite
#3319	Bucketed scan	BucketedReadSuite, DisableUnnecessaryBucketedScanSuite
#3320	Predicate pushdown	ParquetFilterSuite
#3401	Streaming self-union returns no results	StreamingSelfUnionSuite (2 tests)

How are these changes tested?

The changes are tested by the CI workflow itself - tests should pass with the known failures ignored.

…sion tests Added annotations for the following tests that fail with native_datafusion scan: DynamicPartitionPruningSuite: - static scan metrics → apache#3313 ParquetQuerySuite, ParquetIOSuite, ParquetSchemaSuite, ParquetFilterSuite: - SPARK-36182: can't read TimestampLTZ as TimestampNTZ → apache#3311 - SPARK-34212 Parquet should read decimals correctly → apache#3311 - row group skipping doesn't overflow when reading into larger type → apache#3311 - SPARK-35640 tests → apache#3311 - schema mismatch failure error message tests → apache#3311 - SPARK-25207: duplicate fields case-insensitive → apache#3311 - SPARK-31026: fields with dots in names → apache#3320 - Filters should be pushed down at row group level → apache#3320 FileBasedDataSourceSuite: - Spark native readers should respect spark.sql.caseSensitive → apache#3311 BucketedReadSuite, DisableUnnecessaryBucketedScanSuite: - disable bucketing when output doesn't contain all bucketing columns → apache#3319 - bucket coalescing tests → apache#3319 - SPARK-32859: disable unnecessary bucketed table scan tests → apache#3319 - Aggregates with no groupby over tables having 1 BUCKET → apache#3319 ParquetFileMetadataStructRowIndexSuite: - reading _tmp_metadata_row_index tests → apache#3317 FileDataSourceV2FallBackSuite: - Fallback Parquet V2 to V1 → apache#3315 UDFSuite: - SPARK-8005 input_file_name → apache#3312 ExtractPythonUDFsSuite: - Python UDF should not break column pruning/filter pushdown -- Parquet V1 → apache#3312 StreamingQuerySuite: - SPARK-41198: input row calculation with CTE → apache#3315 - SPARK-41199: input row calculation with mixed DSv1 and DSv2 sources → apache#3315 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Added the import statement to test files that were missing it: - FileDataSourceV2FallBackSuite.scala - ParquetFileMetadataStructRowIndexSuite.scala - ExtractPythonUDFsSuite.scala - DisableUnnecessaryBucketedScanSuite.scala - StreamingQuerySuite.scala - UDFSuite.scala Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The method signature in IgnoreComet.scala was not properly formatted according to scalafmt rules. This fixes the formatting to match Spark's scalafmt configuration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Set NOLINT_ON_COMPILE=true to skip scalastyle validation during SBT compilation, reducing CI time for Spark SQL test runs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add IgnoreCometNativeDataFusion annotations for: - ExplainSuite: DPP subquery check (apache#3313) - SQLViewSuite/HiveSQLViewSuite: storeAnalyzedPlanForView (apache#3314) - ParquetFieldIdIOSuite: all 6 field ID tests (apache#3316) - StreamingSelfUnionSuite: 2 self-union tests (apache#3401) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…in native scan rule The fallback checks for fileConstantMetadataColumns and row index generation were logging info but not returning None, so the native DataFusion scan was still used even when these unsupported features were needed (e.g. input_file_name()), causing empty results. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix three CI test failures in the Spark SQL test suite: 1. SQLTestUtils.test() - Move Comet tag checks before the DisableAdaptiveExecution handling so tests with both tags (like "static scan metrics") are properly skipped. 2. ColumnExpressionSuite - Skip "input_file_name...FileScanRDD" test since native DataFusion scan doesn't set InputFileBlockHolder. 3. HiveUDFSuite - Skip "SPARK-11522 select input_file_name" test for the same reason. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

mbutrovich · 2026-02-05T21:58:04Z

.github/workflows/spark_sql_test.yml

          cd apache-spark
          rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups
-          ENABLE_COMET=true ENABLE_COMET_ONHEAP=true ${{ matrix.config.scan-env }} ENABLE_COMET_LOG_FALLBACK_REASONS=${{ github.event.inputs.collect-fallback-logs || 'false' }} \
+          NOLINT_ON_COMPILE=true ENABLE_COMET=true ENABLE_COMET_ONHEAP=true COMET_PARQUET_SCAN_IMPL=${{ matrix.config.scan-impl }} ENABLE_COMET_LOG_FALLBACK_REASONS=${{ github.event.inputs.collect-fallback-logs || 'false' }} \


Where did NOLINT_ON_COMPILE=true come from?

I got tired of trying to run ~~spotless~~ scalafmt against the Spark code

We clone Spark and apply our patch. I don't think we care if it is well formatted just to be able to run the tests.

mbutrovich

Makes sense to me! Thanks @andygrove!

andygrove and others added 11 commits February 4, 2026 08:18

Run Spark SQL tests with native_datafusion in CI

9edac45

fix: apply scalafmt formatting to IgnoreComet.scala in diff

9a72f94

The method signature in IgnoreComet.scala was not properly formatted according to scalafmt rules. This fixes the formatting to match Spark's scalafmt configuration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

upmerge

3c05776

fix

b113140

ci: disable scalastyle checks in Spark SQL tests

06ab9f4

Set NOLINT_ON_COMPILE=true to skip scalastyle validation during SBT compilation, reducing CI time for Spark SQL test runs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix

1106b60

fix

d717f48

fix

e0e19e8

add fallback rules

7844eb8

andygrove mentioned this pull request Feb 5, 2026

feat: Add fallback checks in nativeDataFusionScan for unsupported features #3322

Closed

andygrove and others added 3 commits February 5, 2026 06:10

andygrove changed the title ~~chore: Run Spark SQL tests with native_datafusion in CI [WIP]~~ chore: Run Spark SQL tests with native_datafusion in CI Feb 5, 2026

andygrove marked this pull request as ready for review February 5, 2026 20:10

andygrove requested a review from mbutrovich February 5, 2026 20:11

mbutrovich reviewed Feb 5, 2026

View reviewed changes

mbutrovich approved these changes Feb 5, 2026

View reviewed changes

andygrove merged commit 2f64b60 into apache:main Feb 5, 2026
134 of 135 checks passed

andygrove deleted the spark-sql-native-datafusion branch February 5, 2026 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Run Spark SQL tests with `native_datafusion` in CI#3393

chore: Run Spark SQL tests with `native_datafusion` in CI#3393
andygrove merged 14 commits intoapache:mainfrom
andygrove:spark-sql-native-datafusion

andygrove commented Feb 4, 2026 •

edited

Loading

Uh oh!

mbutrovich Feb 5, 2026

Uh oh!

andygrove Feb 5, 2026 •

edited

Loading

Uh oh!

andygrove Feb 5, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented Feb 4, 2026 •

edited

Loading

andygrove Feb 5, 2026 •

edited

Loading