Skip to content

feat: support map_contains_key expression#3369

Open
peterxcli wants to merge 4 commits intoapache:mainfrom
peterxcli:feat/map_contains_key
Open

feat: support map_contains_key expression#3369
peterxcli wants to merge 4 commits intoapache:mainfrom
peterxcli:feat/map_contains_key

Conversation

@peterxcli
Copy link
Member

Which issue does this PR close?

Closes: #3164

Rationale for this change

Comet does not currently support the Spark map_contains_key function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.

The MapContainsKey expression checks whether a given key exists in a map. It is implemented as a runtime-replaceable expression that internally uses ArrayContains on the map's keys to perform the lookup.

Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.

What changes are included in this PR?

How are these changes tested?

  • gen map type data and use that to verify the correctness of map_contains_key
  • Empty-map tests

empty map tests in spark:

Copilot AI review requested due to automatic review settings February 3, 2026 02:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@peterxcli peterxcli changed the title [Feature] Support Spark expression: map_contains_key feat: support map_contains_key expression Feb 3, 2026
@codecov-commenter
Copy link

codecov-commenter commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 37.50000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.98%. Comparing base (f09f8af) to head (105cadd).
⚠️ Report is 918 commits behind head on main.

Files with missing lines Patch % Lines
...k/src/main/scala/org/apache/comet/serde/maps.scala 16.66% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3369      +/-   ##
============================================
+ Coverage     56.12%   59.98%   +3.85%     
- Complexity      976     1475     +499     
============================================
  Files           119      175      +56     
  Lines         11743    16172    +4429     
  Branches       2251     2681     +430     
============================================
+ Hits           6591     9701    +3110     
- Misses         4012     5119    +1107     
- Partials       1140     1352     +212     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


val mapKeysExpr = scalarFunctionExprToProto("map_keys", mapExpr)

val mapContainsKeyExpr = scalarFunctionExprToProto("array_has", mapKeysExpr, keyExpr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to add the spark reference link as comment, but the scala style check was still failing even after I ran make format. so i think I would just leave it in the PR description.

https://github.com/apache/spark/blob/branch-4.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L233

Copy link
Contributor

@coderfender coderfender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left test related comments

@mbutrovich
Copy link
Contributor

Thanks @peterxcli! Can we add/migrate the tests to the new framework?

https://datafusion.apache.org/comet/contributor-guide/sql-file-tests.html


-- ConfigMatrix: parquet.enable.dictionary=false,true

-- TODO: replace map_from_arrays with map whenever map is supported in Comet
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select map_contains_key(map_from_arrays(array(1, 2), array('a', 'b')), 1)

-- Decimal type coercion tests
-- TODO: requires map cast to be supported in Comet
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select map_contains_key(map_from_arrays(array(1.0, 2), array('a', 'b')), 1)

-- Empty map tests
-- TODO: requires casting from NullType to be supported in Comet
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterxcli
Copy link
Member Author

Thanks @peterxcli! Can we add/migrate the tests to the new framework?

thanks @mbutrovich for the heads up. I've updated them.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this, @peterxcli. The approach of rewriting map_contains_key to array_has(map_keys(map), key) follows Spark's internal implementation nicely, and I appreciate that you migrated the tests to the new SQL file test framework as requested.

I have one concern about null key handling that I wanted to raise. In Spark, ArrayContains (which underlies MapContainsKey) follows SQL three-valued logic: if the array contains null elements and no match is found, it returns null rather than false. This is because the result is indeterminate - the value might or might not match the null element.

DataFusion's array_has function historically followed DuckDB semantics instead, returning false in this case. There's an open Comet issue discussing a similar inconsistency with arrays_overlap (#2036).

Could you verify how this behaves when a map has a null key? For example:

SELECT map_contains_key(map(1, 'a', null, 'b'), 5)
-- Spark should return NULL (key not found, but null key exists = indeterminate)
-- DataFusion might return false

If there is a difference, it might be worth either adding a note about the incompatibility or investigating whether there's a Spark-compatible implementation path in the spark-expr crate.

That said, I noticed all CI checks are passing, which suggests the test suite didn't catch any compatibility issues. The test coverage looks good for the core functionality - empty maps, null maps, different key types, and missing keys are all covered. It might be worth adding a test specifically for maps with null keys to document the expected behavior and catch any regressions if the underlying DataFusion behavior changes.


This review was generated with AI assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Spark expression: map_contains_key

6 participants