fix(spark-expr): handle array length mismatch in datediff for dictionary-backed timestamps by vigneshsiva11 · Pull Request #3278 · apache/datafusion-comet

vigneshsiva11 · 2026-01-26T05:01:25Z

Which issue does this PR close?

Rationale for this change

When reading Iceberg tables, timestamp columns may be dictionary-encoded in the underlying Parquet files.
In the current implementation of datediff, scalar and array arguments can be converted into arrays of different lengths, which leads to a runtime error:

Cannot perform binary operation on arrays of different length

This behavior differs from Spark, which correctly broadcasts scalar inputs and handles dictionary-backed columns without error.
This change ensures Comet’s datediff implementation aligns with Spark semantics and avoids execution failures.

What changes are included in this PR?

Ensure both datediff arguments are converted into arrays of the same length by correctly broadcasting scalars
Normalize dictionary-backed inputs to Date32Array before applying the binary operation
Prevent array length mismatches during vectorized execution

How are these changes tested?

Existing unit tests in datafusion-comet-spark-expr were run locally
Manual verification using Iceberg timestamp columns with to_date and datediff
Verified that operations succeed when Comet is enabled and match Spark behavior

Copilot

Pull request overview

Fixes Comet’s datediff execution on dictionary-encoded / Iceberg-backed timestamp inputs by ensuring scalar arguments are broadcast to the correct batch length and normalizing inputs before the binary op.

Changes:

Broadcast scalar datediff arguments to the columnar batch length to avoid array-length mismatches.
Cast inputs to Date32 prior to subtraction to handle dictionary-backed arrays.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-26T05:05:17Z

native/spark-expr/src/datetime_funcs/date_diff.rs

 // under the License.

 use arrow::array::{Array, Date32Array, Int32Array};
+use arrow::compute::cast;


use arrow::compute::cast; is unused (the code calls arrow::compute::cast(...) with a fully-qualified path). This will fail CI because clippy is run with -D warnings. Either remove this import or use cast(&end_arr, &DataType::Date32) / cast(&start_arr, &DataType::Date32) so the import is actually used.

Suggested change

use arrow::compute::cast;

Copilot · 2026-01-26T05:05:18Z

native/spark-expr/src/datetime_funcs/date_diff.rs


        Ok(ColumnarValue::Array(Arc::new(result)))
    }
-
-    fn aliases(&self) -> &[String] {
-        &self.aliases
-    }
 }


The aliases field still includes "datediff", but the fn aliases(&self) -> &[String] implementation was removed. If ScalarUDFImpl's default aliases() returns an empty slice (as used by other UDF impls in this crate), this will drop the datediff alias and can break Spark SQL/function resolution that relies on the alias rather than the primary name date_diff. Re-introduce aliases() (returning &self.aliases) or remove aliases entirely and ensure the UDF is registered under the intended name(s).

codecov-commenter · 2026-01-26T12:47:51Z

Codecov Report

❌ Patch coverage is 66.66667% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.95%. Comparing base (f09f8af) to head (7994115).
⚠️ Report is 907 commits behind head on main.

Files with missing lines	Patch %	Lines
...spark/sql/comet/CometNativeColumnarToRowExec.scala	65.27%	20 Missing and 5 partials ⚠️
...he/comet/rules/EliminateRedundantTransitions.scala	75.00%	0 Missing and 3 partials ⚠️
.../scala/org/apache/spark/sql/comet/util/Utils.scala	0.00%	1 Missing ⚠️
...src/main/scala/org/apache/comet/GenerateDocs.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3278      +/-   ##
============================================
+ Coverage     56.12%   59.95%   +3.82%     
- Complexity      976     1462     +486     
============================================
  Files           119      175      +56     
  Lines         11743    16167    +4424     
  Branches       2251     2682     +431     
============================================
+ Hits           6591     9693    +3102     
- Misses         4012     5126    +1114     
- Partials       1140     1348     +208

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vigneshsiva11 · 2026-01-27T16:53:57Z

All CI checks are now passing.
Happy to address any feedback, thanks for taking a look.

andygrove · 2026-01-28T16:39:47Z

native/spark-expr/src/datetime_funcs/date_diff.rs

-    fn aliases(&self) -> &[String] {
-        &self.aliases
-    }


Why is this removed?

andygrove · 2026-01-28T16:40:51Z

All CI checks are now passing. Happy to address any feedback, thanks for taking a look.

Could you add a test to demonstrate the problem this PR fixes? Without a supporting test, it is difficult to assess the correctness.

vigneshsiva11 · 2026-01-29T16:34:02Z

I’ve added a regression test in ParquetDatetimeRebaseSuite that exercises datediff on dictionary-encoded timestamp columns.
The test reproduces the original failure on main and passes with this fix, helping guard against future regressions.
Happy to adjust the test location or coverage if you prefer.

…imestamps

coderfender · 2026-02-03T07:05:27Z

spark/src/test/scala/org/apache/spark/sql/comet/ParquetDatetimeRebaseSuite.scala

    }
  }

+  test("COMET-XXXX: datediff works with dictionary-encoded timestamp columns") {


minor : we can probably create an issue and mark it here or rather marking XXXX

coderfender · 2026-02-03T07:06:58Z

spark/src/test/scala/org/apache/spark/sql/comet/ParquetDatetimeRebaseSuite.scala

+          .collect()
+
+        // Just verify it executes correctly (no CometNativeException)
+        assert(result.length == 2)


We might want to leverage checkSparkAnswerAndOperator which also verifies the plan.

Thanks for the suggestion! For this regression test I focused on reproducing the original failure and ensuring "datediff" executes correctly without a CometNativeException. I can switch to "checkSparkAnswerAndOperator" if plan verification is preferred.

coderfender · 2026-02-03T07:07:20Z

native/spark-expr/src/datetime_funcs/date_diff.rs

                )
            })?;

        // Date32 stores days since epoch, so difference is just subtraction


Perhaps this is unintended ?

coderfender · 2026-02-03T07:15:04Z

@vigneshsiva11 , I was thinking if you could probably provide a snippet in the PR or the related github issue to replicate the original issue ? Also not sure if I understand the regression test you've added ?

… timestamps

vigneshsiva11 · 2026-02-03T13:07:29Z

Thanks for the suggestion.

The original issue can be reproduced with a dictionary-encoded Parquet timestamp column and a datediff(current_date(), ts) query, which previously failed with an array length mismatch / CometNativeException.

The regression test added in ParquetDatetimeRebaseSuite writes a Parquet file with dictionary-encoded timestamps, reads it back, and runs datediff to ensure it executes successfully. The test fails on main and passes with this fix.

Happy to adjust the test or add more detail if needed.

Copilot AI review requested due to automatic review settings January 26, 2026 05:01

Copilot started reviewing on behalf of vigneshsiva11 January 26, 2026 05:01 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

vigneshsiva11 mentioned this pull request Jan 26, 2026

CometNativeException: "arrays of different length" when using to_date on Iceberg Timestamp column #3255

Open

vigneshsiva11 changed the title ~~Fix datediff array length mismatch for dictionary-backed timestamps~~ fix(spark-expr): handle array length mismatch in datediff for dictionary-backed timestamps Jan 26, 2026

andygrove reviewed Jan 28, 2026

View reviewed changes

vigneshsiva11 mentioned this pull request Jan 29, 2026

Dictionary-encoded Timestamp / TimestampNTZ not handled in extract_date_part #3324

Closed

vigneshsiva11 added 3 commits January 30, 2026 13:12

Fix datediff array length mismatch for dictionary-backed timestamps

bb7ddc8

fix(spark-expr): remove unused import in datediff

3783308

test(spark): add regression test for datediff on dictionary-encoded t…

5da8d77

…imestamps

vigneshsiva11 force-pushed the fix-datediff-array-length branch from 7994115 to 5da8d77 Compare January 30, 2026 13:15

coderfender reviewed Feb 3, 2026

View reviewed changes

test(spark): add regression test for datediff with dictionary-encoded…

1d0fab5

… timestamps

Conversation

vigneshsiva11 commented Jan 26, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vigneshsiva11 commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove commented Jan 28, 2026

Uh oh!

vigneshsiva11 commented Jan 29, 2026

Uh oh!

coderfender Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderfender Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

vigneshsiva11 Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderfender Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderfender commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vigneshsiva11 commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jan 26, 2026 •

edited

Loading

vigneshsiva11 commented Jan 27, 2026 •

edited

Loading

coderfender commented Feb 3, 2026 •

edited

Loading