[python] support ray data sink to paimon #6883

FreemanDane · 2025-12-24T09:43:27Z

Purpose

Linked issue: close #xxx

Tests

API and Format

Documentation

XiaoHongbo-Hope · 2025-12-28T16:34:45Z

paimon-python/pypaimon/write/table_write.py

+        from pypaimon.write.ray_datasink import PaimonDatasink
+        datasink = PaimonDatasink(dataset, overwrite=overwrite)
+        dataset.write_datasink(datasink, concurrency=parallelism)
+


we can named it as write_ray, just list write_pandas, write_arrow and so on

XiaoHongbo-Hope · 2025-12-28T16:39:18Z

paimon-python/pypaimon/write/ray_datasink.py

+        table_write = self.writer_builder.new_write()
+        for block in blocks:
+            block_arrow: pa.Table = BlockAccessor.for_block(block).to_arrow()
+            table_write.write_arrow(block_arrow)


I am afraid to_arrow will cost a lot of memory, we can introduce some stream or iterable way.

XiaoHongbo-Hope · 2025-12-28T16:43:56Z

paimon-python/pypaimon/write/ray_datasink.py

+        staging bucket in S3.
+        """
+        self.writer_builder: WriteBuilder= self.table.new_batch_write_builder()
+        if self.overwrite:


Could you please add test to show that writer_builder is serializable

XiaoHongbo-Hope · 2025-12-28T16:45:50Z

paimon-python/pypaimon/write/ray_datasink.py

+        """
+        table_commit = self.writer_builder.new_commit()
+        table_commit.commit([commit_message for commit_messages in write_result.write_returns for commit_message in commit_messages])
+        table_commit.close()


We should handle write failure case too.

XiaoHongbo-Hope · 2025-12-28T17:01:04Z

paimon-python/pypaimon/write/table_write.py

+    def write_raydata(self, dataset, overwrite=False, parallelism=1):
+        from pypaimon.write.ray_datasink import PaimonDatasink
+        datasink = PaimonDatasink(dataset, overwrite=overwrite)
+        dataset.write_datasink(datasink, concurrency=parallelism)


provided dataset, but PaimonDatasink init method needs a table here

XiaoHongbo-Hope · 2025-12-28T17:03:57Z

paimon-python/pypaimon/write/table_write.py

        return self

+    def write_raydata(self, dataset, overwrite=False, parallelism=1):
+        from pypaimon.write.ray_datasink import PaimonDatasink


How can user get the dataset in non test mode code, can you add a sample code for that

longyun.fs added 2 commits December 24, 2025 17:42

[python] support ray data sink to paimon

08374b4

[python] update api, add test

f469021

XiaoHongbo-Hope reviewed Dec 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[python] support ray data sink to paimon #6883

[python] support ray data sink to paimon #6883

FreemanDane commented Dec 24, 2025

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Uh oh!

XiaoHongbo-Hope Dec 28, 2025 •

edited

Loading

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[python] support ray data sink to paimon #6883

Are you sure you want to change the base?

[python] support ray data sink to paimon #6883

Conversation

FreemanDane commented Dec 24, 2025

Purpose

Tests

API and Format

Documentation

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

XiaoHongbo-Hope Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

XiaoHongbo-Hope Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

XiaoHongbo-Hope Dec 28, 2025 •

edited

Loading