fix(spark-expr): handle array length mismatch in datediff for dictionary-backed timestamps#3278
fix(spark-expr): handle array length mismatch in datediff for dictionary-backed timestamps#3278vigneshsiva11 wants to merge 4 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes Comet’s datediff execution on dictionary-encoded / Iceberg-backed timestamp inputs by ensuring scalar arguments are broadcast to the correct batch length and normalizing inputs before the binary op.
Changes:
- Broadcast scalar
datediffarguments to the columnar batch length to avoid array-length mismatches. - Cast inputs to
Date32prior to subtraction to handle dictionary-backed arrays.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // under the License. | ||
|
|
||
| use arrow::array::{Array, Date32Array, Int32Array}; | ||
| use arrow::compute::cast; |
There was a problem hiding this comment.
use arrow::compute::cast; is unused (the code calls arrow::compute::cast(...) with a fully-qualified path). This will fail CI because clippy is run with -D warnings. Either remove this import or use cast(&end_arr, &DataType::Date32) / cast(&start_arr, &DataType::Date32) so the import is actually used.
| use arrow::compute::cast; |
|
|
||
| Ok(ColumnarValue::Array(Arc::new(result))) | ||
| } | ||
|
|
||
| fn aliases(&self) -> &[String] { | ||
| &self.aliases | ||
| } | ||
| } |
There was a problem hiding this comment.
The aliases field still includes "datediff", but the fn aliases(&self) -> &[String] implementation was removed. If ScalarUDFImpl's default aliases() returns an empty slice (as used by other UDF impls in this crate), this will drop the datediff alias and can break Spark SQL/function resolution that relies on the alias rather than the primary name date_diff. Re-introduce aliases() (returning &self.aliases) or remove aliases entirely and ensure the UDF is registered under the intended name(s).
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3278 +/- ##
============================================
+ Coverage 56.12% 59.95% +3.82%
- Complexity 976 1462 +486
============================================
Files 119 175 +56
Lines 11743 16167 +4424
Branches 2251 2682 +431
============================================
+ Hits 6591 9693 +3102
- Misses 4012 5126 +1114
- Partials 1140 1348 +208 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
All CI checks are now passing. |
| fn aliases(&self) -> &[String] { | ||
| &self.aliases | ||
| } |
Could you add a test to demonstrate the problem this PR fixes? Without a supporting test, it is difficult to assess the correctness. |
|
I’ve added a regression test in ParquetDatetimeRebaseSuite that exercises datediff on dictionary-encoded timestamp columns. |
7994115 to
5da8d77
Compare
| } | ||
| } | ||
|
|
||
| test("COMET-XXXX: datediff works with dictionary-encoded timestamp columns") { |
There was a problem hiding this comment.
minor : we can probably create an issue and mark it here or rather marking XXXX
| .collect() | ||
|
|
||
| // Just verify it executes correctly (no CometNativeException) | ||
| assert(result.length == 2) |
There was a problem hiding this comment.
We might want to leverage checkSparkAnswerAndOperator which also verifies the plan.
There was a problem hiding this comment.
Thanks for the suggestion! For this regression test I focused on reproducing the original failure and ensuring "datediff" executes correctly without a CometNativeException. I can switch to "checkSparkAnswerAndOperator" if plan verification is preferred.
| ) | ||
| })?; | ||
|
|
||
| // Date32 stores days since epoch, so difference is just subtraction |
There was a problem hiding this comment.
Perhaps this is unintended ?
|
@vigneshsiva11 , I was thinking if you could probably provide a snippet in the PR or the related github issue to replicate the original issue ? Also not sure if I understand the regression test you've added ? |
|
Thanks for the suggestion. The original issue can be reproduced with a dictionary-encoded Parquet timestamp column and a datediff(current_date(), ts) query, which previously failed with an array length mismatch / CometNativeException. The regression test added in ParquetDatetimeRebaseSuite writes a Parquet file with dictionary-encoded timestamps, reads it back, and runs datediff to ensure it executes successfully. The test fails on main and passes with this fix. Happy to adjust the test or add more detail if needed. |
Which issue does this PR close?
Closes #3255
Rationale for this change
When reading Iceberg tables, timestamp columns may be dictionary-encoded in the underlying Parquet files.
In the current implementation of
datediff, scalar and array arguments can be converted into arrays of different lengths, which leads to a runtime error:This behavior differs from Spark, which correctly broadcasts scalar inputs and handles dictionary-backed columns without error.
This change ensures Comet’s
datediffimplementation aligns with Spark semantics and avoids execution failures.What changes are included in this PR?
datediffarguments are converted into arrays of the same length by correctly broadcasting scalarsDate32Arraybefore applying the binary operationHow are these changes tested?
datafusion-comet-spark-exprwere run locallyto_dateanddatediff