Skip to content

fix: replace() returns wrong result for empty-string search#3391

Open
wattt3 wants to merge 1 commit intoapache:mainfrom
wattt3:fix/replace-empty-string
Open

fix: replace() returns wrong result for empty-string search#3391
wattt3 wants to merge 1 commit intoapache:mainfrom
wattt3:fix/replace-empty-string

Conversation

@wattt3
Copy link

@wattt3 wattt3 commented Feb 4, 2026

Which issue does this PR close?

Closes #3344.

Rationale for this change

What changes are included in this PR?

  • Added a SparkStringReplace UDF that handles the empty search string case by inserting the replacement between every character and at boundaries.
  • Updated QueryPlanSerde to use the new spark_replace instead of DataFusion's replace.

How are these changes tested?

  • Rust unit test for empty search string behavior including null handling.
  • Spark integration test in CometStringExpressionSuite comparing results with Spark's output.

@wattt3 wattt3 marked this pull request as ready for review February 4, 2026 14:02
}
}

test("replace with empty search string") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add these tests in the new sql file testing approach instead?

https://datafusion.apache.org/comet/contributor-guide/sql-file-tests.html

@andygrove
Copy link
Member

Thanks @wattt3. LGTM. I left a comment about moving the tests to the new sql file based approach. I will be happy to approve the PR once that is done.

@andygrove
Copy link
Member

There is a test failure:

2026-02-04T16:17:16.3295718Z - replace with empty search string *** FAILED *** (166 milliseconds)
2026-02-04T16:17:16.3296584Z   Results do not match for query:
2026-02-04T16:17:16.3299606Z   Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=311,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]]
2026-02-04T16:17:16.3302580Z   Timezone Env: 
2026-02-04T16:17:16.3302956Z   
2026-02-04T16:17:16.3303402Z   == Parsed Logical Plan ==
2026-02-04T16:17:16.3304730Z   Project [replace(col#89982, , x) AS replace(col, , x)#89993, replace(col#89982, , ) AS replace(col, , )#89994]
2026-02-04T16:17:16.3306566Z   +- SubqueryAlias spark_catalog.default.test
2026-02-04T16:17:16.3307192Z      +- Relation spark_catalog.default.test[col#89982] parquet
2026-02-04T16:17:16.3307674Z   
2026-02-04T16:17:16.3307942Z   == Analyzed Logical Plan ==
2026-02-04T16:17:16.3308425Z   replace(col, , x): string, replace(col, , ): string
2026-02-04T16:17:16.3309283Z   Project [replace(col#89982, , x) AS replace(col, , x)#89993, replace(col#89982, , ) AS replace(col, , )#89994]
2026-02-04T16:17:16.3310070Z   +- SubqueryAlias spark_catalog.default.test
2026-02-04T16:17:16.3310839Z      +- Relation spark_catalog.default.test[col#89982] parquet
2026-02-04T16:17:16.3311330Z   
2026-02-04T16:17:16.3311631Z   == Optimized Logical Plan ==
2026-02-04T16:17:16.3312688Z   Project [replace(col#89982, , x) AS replace(col, , x)#89993, replace(col#89982, , ) AS replace(col, , )#89994]
2026-02-04T16:17:16.3313560Z   +- Relation spark_catalog.default.test[col#89982] parquet
2026-02-04T16:17:16.3314032Z   
2026-02-04T16:17:16.3314277Z   == Physical Plan ==
2026-02-04T16:17:16.3314578Z   *(1) CometColumnarToRow
2026-02-04T16:17:16.3315628Z   +- CometProject [replace(col, , x)#89993, replace(col, , )#89994], [replace(col#89982, , x) AS replace(col, , x)#89993, replace(col#89982, , ) AS replace(col, , )#89994]
2026-02-04T16:17:16.3318449Z      +- CometScan [native_iceberg_compat] parquet spark_catalog.default.test[col#89982] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/__w/datafusion-comet/datafusion-comet/spark/spark-warehouse/test], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col:string>
2026-02-04T16:17:16.3320372Z   
2026-02-04T16:17:16.3320776Z   == Results ==
2026-02-04T16:17:16.3321058Z   
2026-02-04T16:17:16.3321294Z   == Results ==
2026-02-04T16:17:16.3321798Z   !== Correct Answer - 3 ==                                   == Spark Answer - 3 ==
2026-02-04T16:17:16.3322827Z    struct<replace(col, , x):string,replace(col, , ):string>   struct<replace(col, , x):string,replace(col, , ):string>
2026-02-04T16:17:16.3323753Z   ![,]                                                        [null,null]
2026-02-04T16:17:16.3324349Z   ![hello,hello]                                              [x,]
2026-02-04T16:17:16.3325067Z   ![null,null]                                                [xhxexlxlxox,hello] (QueryTest.scala:244)

@wattt3
Copy link
Author

wattt3 commented Feb 5, 2026

Hi @andygrove, thanks for the review
Due to environment issues, I couldn't run the test locally at the time.
I changed to a SQL file-based approach, but the result differs from the issue description.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Native replace() returns wrong result for empty-string search

2 participants