feat: support map_contains_key expression#3369
feat: support map_contains_key expression#3369peterxcli wants to merge 4 commits intoapache:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3369 +/- ##
============================================
+ Coverage 56.12% 59.98% +3.85%
- Complexity 976 1475 +499
============================================
Files 119 175 +56
Lines 11743 16172 +4429
Branches 2251 2681 +430
============================================
+ Hits 6591 9701 +3110
- Misses 4012 5119 +1107
- Partials 1140 1352 +212 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| val mapKeysExpr = scalarFunctionExprToProto("map_keys", mapExpr) | ||
|
|
||
| val mapContainsKeyExpr = scalarFunctionExprToProto("array_has", mapKeysExpr, keyExpr) |
There was a problem hiding this comment.
I was trying to add the spark reference link as comment, but the scala style check was still failing even after I ran make format. so i think I would just leave it in the PR description.
spark/src/test/scala/org/apache/comet/CometMapExpressionSuite.scala
Outdated
Show resolved
Hide resolved
coderfender
left a comment
There was a problem hiding this comment.
left test related comments
spark/src/test/scala/org/apache/comet/CometMapExpressionSuite.scala
Outdated
Show resolved
Hide resolved
|
Thanks @peterxcli! Can we add/migrate the tests to the new framework? https://datafusion.apache.org/comet/contributor-guide/sql-file-tests.html |
|
|
||
| -- ConfigMatrix: parquet.enable.dictionary=false,true | ||
|
|
||
| -- TODO: replace map_from_arrays with map whenever map is supported in Comet |
| select map_contains_key(map_from_arrays(array(1, 2), array('a', 'b')), 1) | ||
|
|
||
| -- Decimal type coercion tests | ||
| -- TODO: requires map cast to be supported in Comet |
| select map_contains_key(map_from_arrays(array(1.0, 2), array('a', 'b')), 1) | ||
|
|
||
| -- Empty map tests | ||
| -- TODO: requires casting from NullType to be supported in Comet |
thanks @mbutrovich for the heads up. I've updated them. |
andygrove
left a comment
There was a problem hiding this comment.
Thanks for implementing this, @peterxcli. The approach of rewriting map_contains_key to array_has(map_keys(map), key) follows Spark's internal implementation nicely, and I appreciate that you migrated the tests to the new SQL file test framework as requested.
I have one concern about null key handling that I wanted to raise. In Spark, ArrayContains (which underlies MapContainsKey) follows SQL three-valued logic: if the array contains null elements and no match is found, it returns null rather than false. This is because the result is indeterminate - the value might or might not match the null element.
DataFusion's array_has function historically followed DuckDB semantics instead, returning false in this case. There's an open Comet issue discussing a similar inconsistency with arrays_overlap (#2036).
Could you verify how this behaves when a map has a null key? For example:
SELECT map_contains_key(map(1, 'a', null, 'b'), 5)
-- Spark should return NULL (key not found, but null key exists = indeterminate)
-- DataFusion might return falseIf there is a difference, it might be worth either adding a note about the incompatibility or investigating whether there's a Spark-compatible implementation path in the spark-expr crate.
That said, I noticed all CI checks are passing, which suggests the test suite didn't catch any compatibility issues. The test coverage looks good for the core functionality - empty maps, null maps, different key types, and missing keys are all covered. It might be worth adding a test specifically for maps with null keys to document the expected behavior and catch any regressions if the underlying DataFusion behavior changes.
This review was generated with AI assistance.
Which issue does this PR close?
Closes: #3164
Rationale for this change
Comet does not currently support the Spark
map_contains_keyfunction, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.The MapContainsKey expression checks whether a given key exists in a map. It is implemented as a runtime-replaceable expression that internally uses ArrayContains on the map's keys to perform the lookup.
Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.
What changes are included in this PR?
MapContainsKey->CometMapContainsKeyCometMapContainsKeyto convert the expr toarray_has(map_keys(map), key)map_keys#1788 had implemented themap_keysfunction, so I also mark it as done.How are these changes tested?
map_contains_keyempty map tests in spark: