chore: Run Spark SQL tests with native_datafusion in CI#3393
Merged
andygrove merged 14 commits intoapache:mainfrom Feb 5, 2026
Merged
chore: Run Spark SQL tests with native_datafusion in CI#3393andygrove merged 14 commits intoapache:mainfrom
native_datafusion in CI#3393andygrove merged 14 commits intoapache:mainfrom
Conversation
…sion tests Added annotations for the following tests that fail with native_datafusion scan: DynamicPartitionPruningSuite: - static scan metrics → apache#3313 ParquetQuerySuite, ParquetIOSuite, ParquetSchemaSuite, ParquetFilterSuite: - SPARK-36182: can't read TimestampLTZ as TimestampNTZ → apache#3311 - SPARK-34212 Parquet should read decimals correctly → apache#3311 - row group skipping doesn't overflow when reading into larger type → apache#3311 - SPARK-35640 tests → apache#3311 - schema mismatch failure error message tests → apache#3311 - SPARK-25207: duplicate fields case-insensitive → apache#3311 - SPARK-31026: fields with dots in names → apache#3320 - Filters should be pushed down at row group level → apache#3320 FileBasedDataSourceSuite: - Spark native readers should respect spark.sql.caseSensitive → apache#3311 BucketedReadSuite, DisableUnnecessaryBucketedScanSuite: - disable bucketing when output doesn't contain all bucketing columns → apache#3319 - bucket coalescing tests → apache#3319 - SPARK-32859: disable unnecessary bucketed table scan tests → apache#3319 - Aggregates with no groupby over tables having 1 BUCKET → apache#3319 ParquetFileMetadataStructRowIndexSuite: - reading _tmp_metadata_row_index tests → apache#3317 FileDataSourceV2FallBackSuite: - Fallback Parquet V2 to V1 → apache#3315 UDFSuite: - SPARK-8005 input_file_name → apache#3312 ExtractPythonUDFsSuite: - Python UDF should not break column pruning/filter pushdown -- Parquet V1 → apache#3312 StreamingQuerySuite: - SPARK-41198: input row calculation with CTE → apache#3315 - SPARK-41199: input row calculation with mixed DSv1 and DSv2 sources → apache#3315 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added the import statement to test files that were missing it: - FileDataSourceV2FallBackSuite.scala - ParquetFileMetadataStructRowIndexSuite.scala - ExtractPythonUDFsSuite.scala - DisableUnnecessaryBucketedScanSuite.scala - StreamingQuerySuite.scala - UDFSuite.scala Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The method signature in IgnoreComet.scala was not properly formatted according to scalafmt rules. This fixes the formatting to match Spark's scalafmt configuration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Set NOLINT_ON_COMPILE=true to skip scalastyle validation during SBT compilation, reducing CI time for Spark SQL test runs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add IgnoreCometNativeDataFusion annotations for: - ExplainSuite: DPP subquery check (apache#3313) - SQLViewSuite/HiveSQLViewSuite: storeAnalyzedPlanForView (apache#3314) - ParquetFieldIdIOSuite: all 6 field ID tests (apache#3316) - StreamingSelfUnionSuite: 2 self-union tests (apache#3401) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…in native scan rule The fallback checks for fileConstantMetadataColumns and row index generation were logging info but not returning None, so the native DataFusion scan was still used even when these unsupported features were needed (e.g. input_file_name()), causing empty results. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fix three CI test failures in the Spark SQL test suite: 1. SQLTestUtils.test() - Move Comet tag checks before the DisableAdaptiveExecution handling so tests with both tags (like "static scan metrics") are properly skipped. 2. ColumnExpressionSuite - Skip "input_file_name...FileScanRDD" test since native DataFusion scan doesn't set InputFileBlockHolder. 3. HiveUDFSuite - Skip "SPARK-11522 select input_file_name" test for the same reason. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
native_datafusion in CI [WIP]native_datafusion in CI
mbutrovich
reviewed
Feb 5, 2026
| cd apache-spark | ||
| rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups | ||
| ENABLE_COMET=true ENABLE_COMET_ONHEAP=true ${{ matrix.config.scan-env }} ENABLE_COMET_LOG_FALLBACK_REASONS=${{ github.event.inputs.collect-fallback-logs || 'false' }} \ | ||
| NOLINT_ON_COMPILE=true ENABLE_COMET=true ENABLE_COMET_ONHEAP=true COMET_PARQUET_SCAN_IMPL=${{ matrix.config.scan-impl }} ENABLE_COMET_LOG_FALLBACK_REASONS=${{ github.event.inputs.collect-fallback-logs || 'false' }} \ |
Contributor
There was a problem hiding this comment.
Where did NOLINT_ON_COMPILE=true come from?
Member
Author
There was a problem hiding this comment.
I got tired of trying to run spotless scalafmt against the Spark code
Member
Author
There was a problem hiding this comment.
We clone Spark and apply our patch. I don't think we care if it is well formatted just to be able to run the tests.
mbutrovich
approved these changes
Feb 5, 2026
Contributor
mbutrovich
left a comment
There was a problem hiding this comment.
Makes sense to me! Thanks @andygrove!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
N/A - This PR enables running Spark SQL tests with
native_datafusionscan in CI.Rationale for this change
Running Spark SQL tests with
native_datafusionscan helps ensure compatibility and catch regressions. This PR enables these tests in CI while ignoring known failing tests that are tracked in separate issues.What changes are included in this PR?
CI workflow changes: Added
native_datafusionscan mode to the Spark SQL test matrixTest annotations: Added
IgnoreCometNativeDataFusionannotations for failing tests, linked to tracking issues:How are these changes tested?
The changes are tested by the CI workflow itself - tests should pass with the known failures ignored.