-
Notifications
You must be signed in to change notification settings - Fork 1
Add pixel chunking to specification #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis pull request adds comprehensive chunking support to the OME-Arrow library, enabling storage and retrieval of TCZYX image data in chunked format. New metadata structures, helper functions, and integration points across ingest, export, transform, and visualization modules allow users to build, store, and extract chunked pixel data alongside plane-based representation. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Ingest as Ingest Pipeline
participant Chunker as Chunk Builder
participant Meta as OME-Arrow Struct
participant Export as Export / Transform
participant Extract as Plane Extractor
User->>Ingest: Provide image planes + chunk params
activate Ingest
Ingest->>Chunker: Call _build_chunks_from_planes()
activate Chunker
Chunker->>Chunker: Split planes by chunk dimensions
Chunker-->>Ingest: Return chunk list with pixel data
deactivate Chunker
Ingest->>Meta: Construct OME_ARROW_STRUCT<br/>with chunk_grid & chunks fields
activate Meta
Meta-->>Ingest: StructScalar with chunking metadata
deactivate Meta
deactivate Ingest
Ingest-->>User: OME-Arrow record
User->>Export: Export or slice chunked data
activate Export
Export->>Chunker: Rebuild chunks (if slicing)
Chunker-->>Export: Updated chunk_grid & chunks
Export-->>User: Exported format (Zarr/Parquet) or sliced record
deactivate Export
User->>Extract: Extract single plane
activate Extract
Extract->>Extract: plane_from_chunks(data, t, c, z)
Extract->>Extract: Locate chunk(s) + validate indices
Extract->>Extract: Assemble or retrieve pixel data
Extract-->>User: 2D numpy array (Y, X)
deactivate Extract
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes The changes introduce a substantial feature spanning multiple modules (ingest, export, transform, view, meta) with moderate-to-dense logic density. Review requires tracing chunking flow across new helper functions, validating schema changes, and assessing integration points with existing plane-based and export pipelines. Moderate heterogeneity (chunking logic distributed but following consistent patterns) and reasonable test coverage partially offset the scope. Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.pre-commit-config.yaml:
- Around line 41-42: The pre-commit hook was updated to rev "v0.14.14" for repo:
https://github.com/astral-sh/ruff-pre-commit which changes autofix behavior
(many fixes now considered unsafe and not applied by default); verify the new
behavior by running your ruff-check hook against the repo/CI and confirm the
reduced autofixes is acceptable, or if you want the previous autofix behavior
enable unsafe fixes by adding unsafe = true to your ruff configuration
(ruff.toml) or adjust the ruff-check hook settings accordingly so CI/local
workflows apply the unsafe autofixes.
In `@src/ome_arrow/export.py`:
- Around line 82-126: The loop over chunks currently only checks lower bounds
but must also validate that each chunk's extents do not exceed the image bounds
before reshaping/assignment: for each chunk (variables
t,c,z,y,x,shape_z,shape_y,shape_x) add checks that z + shape_z <= sz, and that y
+ shape_y <= sy and x + shape_x <= sx (you can get sy,sx from out.shape[3] and
out.shape[4]) and raise a clear ValueError like f"chunks[{i}] extent out of
range: z+shape_z={z+shape_z} > sz={sz}" (and analogous messages for y/x) before
computing expected_len/reshape and assigning into
out[t,c,z:z+shape_z,y:y+shape_y,x:x+shape_x]; this prevents silent
shape-mismatch errors during arr3d assignment.
- Around line 214-254: The code currently returns a zero-filled plane even when
no chunk matched and does not validate chunk bounds; update the chunks handling
in the export routine (the loop that iterates over chunks and uses variables
chunk_grid, chunks, strict, _cast_plane, plane) to (1) track a boolean like
any_chunk_matched set to True whenever a chunk is applied to the plane and only
return plane if any_chunk_matched is True, otherwise fall through to the
existing fallback or raise; and (2) validate chunk origins and shapes (z0, y0,
x0 >= 0 and z0+szc <= depth, y0+syc <= sy, x0+sxc <= sx) before
slicing/assigning and raise ValueError (or honor strict) for out-of-bounds or
negative origins so malformed chunks cannot silently write/clip into the plane.
In `@src/ome_arrow/ingest.py`:
- Around line 329-347: The slab allocation currently hardcodes dtype=np.uint16
which can truncate or misrepresent input planes; modify the code that builds
slabs (the slab variable in the loop using plane_map and the chunks.append
block) to infer the dtype from an available plane (e.g., inspect
plane_map[(t,c,z0 + zi)] or a representative plane before the zi loop) or accept
a dtype parameter, then allocate slab = np.zeros((sz, sy, sx),
dtype=inferred_dtype) and ensure pixels = slab.reshape(-1).tolist() preserves
that dtype/precision when serializing. Adjust any callers of this ingest routine
if you add a parameter so dtype flows through correctly.
🧹 Nitpick comments (3)
tests/test_chunks.py (1)
73-83: Consider adding edge case tests.The happy-path test is solid. Consider adding tests for:
- Out-of-bounds indices (should raise appropriate errors)
- Partial/boundary chunks (chunks that don't align perfectly with image dimensions)
- Missing chunks scenario
src/ome_arrow/transform.py (1)
11-11: Importing private helper functions across modules.The
_build_chunks_from_planesand_normalize_chunk_shapefunctions are prefixed with underscore indicating they're internal/private. Consider either:
- Renaming them without underscore prefix if they're part of the intended API
- Keeping them internal and re-exporting through
__all__if needed externallyThis is a minor concern since both modules are within the same package.
src/ome_arrow/ingest.py (1)
756-775: Inconsistent chunking API acrossfrom_*functions.
from_numpyexposeschunk_shape,chunk_order, andbuild_chunksparameters, butfrom_tiff,from_ome_zarr, andfrom_stack_pattern_pathdon't expose them. Sincebuild_chunks=Trueis the default into_ome_arrow, chunks will be automatically built for all these functions.Consider either:
- Adding chunking parameters to all
from_*functions for consistency- Documenting that chunking is always enabled for these functions
| chunks = data.get("chunks") or [] | ||
| if chunks: | ||
| chunk_grid = data.get("chunk_grid") or {} | ||
| chunk_order = str(chunk_grid.get("chunk_order") or "ZYX").upper() | ||
| if chunk_order != "ZYX": | ||
| raise ValueError("Only chunk_order='ZYX' is supported for now.") | ||
|
|
||
| for i, ch in enumerate(chunks): | ||
| t = int(ch["t"]) | ||
| c = int(ch["c"]) | ||
| z = int(ch["z"]) | ||
| y = int(ch["y"]) | ||
| x = int(ch["x"]) | ||
| shape_z = int(ch["shape_z"]) | ||
| shape_y = int(ch["shape_y"]) | ||
| shape_x = int(ch["shape_x"]) | ||
|
|
||
| if not (0 <= t < st and 0 <= c < sc and 0 <= z < sz): | ||
| raise ValueError( | ||
| f"chunks[{i}] index out of range: (t,c,z)=({t},{c},{z})" | ||
| ) | ||
| if y < 0 or x < 0 or shape_z <= 0 or shape_y <= 0 or shape_x <= 0: | ||
| raise ValueError(f"chunks[{i}] has invalid shape or origin.") | ||
|
|
||
| pix = ch["pixels"] | ||
| try: | ||
| n = len(pix) | ||
| except Exception as e: | ||
| raise ValueError(f"chunks[{i}].pixels is not a sequence") from e | ||
|
|
||
| expected_len = shape_z * shape_y * shape_x | ||
| if n != expected_len: | ||
| if strict: | ||
| raise ValueError( | ||
| f"chunks[{i}].pixels length {n} != expected {expected_len}" | ||
| ) | ||
| if n > expected_len: | ||
| pix = pix[:expected_len] | ||
| else: | ||
| pix = list(pix) + [0] * (expected_len - n) | ||
|
|
||
| arr3d = np.asarray(pix).reshape(shape_z, shape_y, shape_x) | ||
| arr3d = _cast_plane(arr3d) | ||
| out[t, c, z : z + shape_z, y : y + shape_y, x : x + shape_x] = arr3d | ||
| return out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate chunk extents against image bounds.
Only lower bounds are checked; chunks that extend past Z/Y/X can yield confusing shape-mismatch errors during assignment. Add upper-bound validation before reshaping/assignment.
🔧 Proposed fix
if y < 0 or x < 0 or shape_z <= 0 or shape_y <= 0 or shape_x <= 0:
raise ValueError(f"chunks[{i}] has invalid shape or origin.")
+ if z + shape_z > sz or y + shape_y > sy or x + shape_x > sx:
+ raise ValueError(
+ f"chunks[{i}] exceeds image bounds for (z,y,x)=({z},{y},{x})"
+ )🧰 Tools
🪛 Ruff (0.14.13)
87-87: Avoid specifying long messages outside the exception class
(TRY003)
100-102: Avoid specifying long messages outside the exception class
(TRY003)
104-104: Avoid specifying long messages outside the exception class
(TRY003)
110-110: Avoid specifying long messages outside the exception class
(TRY003)
115-117: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@src/ome_arrow/export.py` around lines 82 - 126, The loop over chunks
currently only checks lower bounds but must also validate that each chunk's
extents do not exceed the image bounds before reshaping/assignment: for each
chunk (variables t,c,z,y,x,shape_z,shape_y,shape_x) add checks that z + shape_z
<= sz, and that y + shape_y <= sy and x + shape_x <= sx (you can get sy,sx from
out.shape[3] and out.shape[4]) and raise a clear ValueError like f"chunks[{i}]
extent out of range: z+shape_z={z+shape_z} > sz={sz}" (and analogous messages
for y/x) before computing expected_len/reshape and assigning into
out[t,c,z:z+shape_z,y:y+shape_y,x:x+shape_x]; this prevents silent
shape-mismatch errors during arr3d assignment.
| chunks = data.get("chunks") or [] | ||
| if chunks: | ||
| chunk_grid = data.get("chunk_grid") or {} | ||
| chunk_order = str(chunk_grid.get("chunk_order") or "ZYX").upper() | ||
| if chunk_order != "ZYX": | ||
| raise ValueError("Only chunk_order='ZYX' is supported for now.") | ||
|
|
||
| plane = np.zeros((sy, sx), dtype=dtype) | ||
| for i, ch in enumerate(chunks): | ||
| if int(ch["t"]) != t or int(ch["c"]) != c: | ||
| continue | ||
| z0 = int(ch["z"]) | ||
| szc = int(ch["shape_z"]) | ||
| if not (z0 <= z < z0 + szc): | ||
| continue | ||
| y0 = int(ch["y"]) | ||
| x0 = int(ch["x"]) | ||
| syc = int(ch["shape_y"]) | ||
| sxc = int(ch["shape_x"]) | ||
| pix = ch["pixels"] | ||
| try: | ||
| n = len(pix) | ||
| except Exception as e: | ||
| raise ValueError(f"chunks[{i}].pixels is not a sequence") from e | ||
| expected_len = szc * syc * sxc | ||
| if n != expected_len: | ||
| if strict: | ||
| raise ValueError( | ||
| f"chunks[{i}].pixels length {n} != expected {expected_len}" | ||
| ) | ||
| if n > expected_len: | ||
| pix = pix[:expected_len] | ||
| else: | ||
| pix = list(pix) + [0] * (expected_len - n) | ||
|
|
||
| slab = np.asarray(pix).reshape(szc, syc, sxc) | ||
| slab = _cast_plane(slab) | ||
| zi = z - z0 | ||
| plane[y0 : y0 + syc, x0 : x0 + sxc] = slab[zi] | ||
|
|
||
| return plane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid returning empty planes when no chunk matches; validate chunk bounds.
If chunks exist but none intersect the requested plane, this currently returns a zero-filled plane and skips the planes fallback. Also, negative origins or over-extent chunks can slip through. Track whether any chunk matched and only return when it did; otherwise fall back (or error) and validate bounds.
🔧 Proposed fix
- plane = np.zeros((sy, sx), dtype=dtype)
+ plane = np.zeros((sy, sx), dtype=dtype)
+ matched = False
for i, ch in enumerate(chunks):
if int(ch["t"]) != t or int(ch["c"]) != c:
continue
z0 = int(ch["z"])
szc = int(ch["shape_z"])
if not (z0 <= z < z0 + szc):
continue
y0 = int(ch["y"])
x0 = int(ch["x"])
syc = int(ch["shape_y"])
sxc = int(ch["shape_x"])
+ if (
+ z0 < 0
+ or y0 < 0
+ or x0 < 0
+ or szc <= 0
+ or syc <= 0
+ or sxc <= 0
+ or z0 + szc > sz
+ or y0 + syc > sy
+ or x0 + sxc > sx
+ ):
+ raise ValueError(f"chunks[{i}] exceeds image bounds.")
pix = ch["pixels"]
try:
n = len(pix)
except Exception as e:
raise ValueError(f"chunks[{i}].pixels is not a sequence") from e
@@
slab = np.asarray(pix).reshape(szc, syc, sxc)
slab = _cast_plane(slab)
zi = z - z0
plane[y0 : y0 + syc, x0 : x0 + sxc] = slab[zi]
-
- return plane
+ matched = True
+
+ if matched:
+ return plane
+ # fall through to plane-based path if no chunk matched📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| chunks = data.get("chunks") or [] | |
| if chunks: | |
| chunk_grid = data.get("chunk_grid") or {} | |
| chunk_order = str(chunk_grid.get("chunk_order") or "ZYX").upper() | |
| if chunk_order != "ZYX": | |
| raise ValueError("Only chunk_order='ZYX' is supported for now.") | |
| plane = np.zeros((sy, sx), dtype=dtype) | |
| for i, ch in enumerate(chunks): | |
| if int(ch["t"]) != t or int(ch["c"]) != c: | |
| continue | |
| z0 = int(ch["z"]) | |
| szc = int(ch["shape_z"]) | |
| if not (z0 <= z < z0 + szc): | |
| continue | |
| y0 = int(ch["y"]) | |
| x0 = int(ch["x"]) | |
| syc = int(ch["shape_y"]) | |
| sxc = int(ch["shape_x"]) | |
| pix = ch["pixels"] | |
| try: | |
| n = len(pix) | |
| except Exception as e: | |
| raise ValueError(f"chunks[{i}].pixels is not a sequence") from e | |
| expected_len = szc * syc * sxc | |
| if n != expected_len: | |
| if strict: | |
| raise ValueError( | |
| f"chunks[{i}].pixels length {n} != expected {expected_len}" | |
| ) | |
| if n > expected_len: | |
| pix = pix[:expected_len] | |
| else: | |
| pix = list(pix) + [0] * (expected_len - n) | |
| slab = np.asarray(pix).reshape(szc, syc, sxc) | |
| slab = _cast_plane(slab) | |
| zi = z - z0 | |
| plane[y0 : y0 + syc, x0 : x0 + sxc] = slab[zi] | |
| return plane | |
| chunks = data.get("chunks") or [] | |
| if chunks: | |
| chunk_grid = data.get("chunk_grid") or {} | |
| chunk_order = str(chunk_grid.get("chunk_order") or "ZYX").upper() | |
| if chunk_order != "ZYX": | |
| raise ValueError("Only chunk_order='ZYX' is supported for now.") | |
| plane = np.zeros((sy, sx), dtype=dtype) | |
| matched = False | |
| for i, ch in enumerate(chunks): | |
| if int(ch["t"]) != t or int(ch["c"]) != c: | |
| continue | |
| z0 = int(ch["z"]) | |
| szc = int(ch["shape_z"]) | |
| if not (z0 <= z < z0 + szc): | |
| continue | |
| y0 = int(ch["y"]) | |
| x0 = int(ch["x"]) | |
| syc = int(ch["shape_y"]) | |
| sxc = int(ch["shape_x"]) | |
| if ( | |
| z0 < 0 | |
| or y0 < 0 | |
| or x0 < 0 | |
| or szc <= 0 | |
| or syc <= 0 | |
| or sxc <= 0 | |
| or z0 + szc > sz | |
| or y0 + syc > sy | |
| or x0 + sxc > sx | |
| ): | |
| raise ValueError(f"chunks[{i}] exceeds image bounds.") | |
| pix = ch["pixels"] | |
| try: | |
| n = len(pix) | |
| except Exception as e: | |
| raise ValueError(f"chunks[{i}].pixels is not a sequence") from e | |
| expected_len = szc * syc * sxc | |
| if n != expected_len: | |
| if strict: | |
| raise ValueError( | |
| f"chunks[{i}].pixels length {n} != expected {expected_len}" | |
| ) | |
| if n > expected_len: | |
| pix = pix[:expected_len] | |
| else: | |
| pix = list(pix) + [0] * (expected_len - n) | |
| slab = np.asarray(pix).reshape(szc, syc, sxc) | |
| slab = _cast_plane(slab) | |
| zi = z - z0 | |
| plane[y0 : y0 + syc, x0 : x0 + sxc] = slab[zi] | |
| matched = True | |
| if matched: | |
| return plane | |
| # fall through to plane-based path if no chunk matched |
🧰 Tools
🪛 Ruff (0.14.13)
219-219: Avoid specifying long messages outside the exception class
(TRY003)
237-237: Avoid specifying long messages outside the exception class
(TRY003)
241-243: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@src/ome_arrow/export.py` around lines 214 - 254, The code currently returns a
zero-filled plane even when no chunk matched and does not validate chunk bounds;
update the chunks handling in the export routine (the loop that iterates over
chunks and uses variables chunk_grid, chunks, strict, _cast_plane, plane) to (1)
track a boolean like any_chunk_matched set to True whenever a chunk is applied
to the plane and only return plane if any_chunk_matched is True, otherwise fall
through to the existing fallback or raise; and (2) validate chunk origins and
shapes (z0, y0, x0 >= 0 and z0+szc <= depth, y0+syc <= sy, x0+sxc <= sx) before
slicing/assigning and raise ValueError (or honor strict) for out-of-bounds or
negative origins so malformed chunks cannot silently write/clip into the plane.
| slab = np.zeros((sz, sy, sx), dtype=np.uint16) | ||
| for zi in range(sz): | ||
| plane = plane_map.get((t, c, z0 + zi)) | ||
| if plane is None: | ||
| continue | ||
| slab[zi] = plane[y0 : y0 + sy, x0 : x0 + sx] | ||
| chunks.append( | ||
| { | ||
| "t": t, | ||
| "c": c, | ||
| "z": z0, | ||
| "y": y0, | ||
| "x": x0, | ||
| "shape_z": sz, | ||
| "shape_y": sy, | ||
| "shape_x": sx, | ||
| "pixels": slab.reshape(-1).tolist(), | ||
| } | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded np.uint16 dtype may cause data loss.
The slab array is always allocated as np.uint16 (line 329), but the input planes could have different dtypes. This could cause:
- Data truncation for float arrays
- Incorrect values for int32/int64 data
- Unnecessary conversion overhead for uint8 data
Consider inferring the dtype from the plane data or accepting it as a parameter.
🔧 Suggested fix
def _build_chunks_from_planes(
*,
planes: List[Dict[str, Any]],
size_t: int,
size_c: int,
size_z: int,
size_y: int,
size_x: int,
chunk_shape: Optional[Tuple[int, int, int]],
chunk_order: str = "ZYX",
+ dtype: np.dtype = np.uint16,
) -> List[Dict[str, Any]]:Then use dtype instead of np.uint16 at line 329:
- slab = np.zeros((sz, sy, sx), dtype=np.uint16)
+ slab = np.zeros((sz, sy, sx), dtype=dtype)🤖 Prompt for AI Agents
In `@src/ome_arrow/ingest.py` around lines 329 - 347, The slab allocation
currently hardcodes dtype=np.uint16 which can truncate or misrepresent input
planes; modify the code that builds slabs (the slab variable in the loop using
plane_map and the chunks.append block) to infer the dtype from an available
plane (e.g., inspect plane_map[(t,c,z0 + zi)] or a representative plane before
the zi loop) or accept a dtype parameter, then allocate slab = np.zeros((sz, sy,
sx), dtype=inferred_dtype) and ensure pixels = slab.reshape(-1).tolist()
preserves that dtype/precision when serializing. Adjust any callers of this
ingest routine if you add a parameter so dtype flows through correctly.
Summary by CodeRabbit
Release Notes
New Features
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.