feat: Add NpyCodec for lazy-loading numpy arrays #1331

dimitri-yatsenko · 2026-01-12T22:04:29Z

Summary

This PR introduces the <npy@> codec for schema-addressed numpy array storage with lazy loading capabilities within the Object-Augmented Schema (OAS).

Key Features

Lazy loading: Inspect array shape and dtype without downloading data
Memory mapping: Random access to large arrays via mmap_mode parameter
NumPy integration: Transparent array operations via __array__ protocol
Safe bulk fetch: Returns NpyRef objects instead of downloading all arrays
Portable format: Standard .npy files readable by NumPy, MATLAB, etc.
Schema-addressed: Paths mirror database structure ({schema}/{table}/{pk}/{attr}.npy)

OAS Codec Strategy

The Object-Augmented Schema integrates relational tables with object storage as a single system. Codecs define how Python objects are stored.

Built-in codecs (no extra dependencies):

<blob> / <blob@> - DataJoint legacy serialization (hash-addressed)
<npy@> - portable numpy arrays with lazy loading (schema-addressed) ← this PR
<object@> - files/folders (schema-addressed)
<hash@> - raw bytes (hash-addressed)

Separate packages (optional install, additional dependencies):

datajoint-zarr → <zarr@> (requires zarr)
datajoint-parquet → <parquet@> (requires pyarrow)
datajoint-tiff → <tiff@> (requires tifffile)

This keeps datajoint-python lean while allowing users to install codec packages as needed. The SchemaCodec base class introduced here enables community-contributed codecs.

OAS Addressing Schemes

Scheme	Path Derived From	Deduplication	Codecs
Hash-addressed	Content hash (MD5)	Yes	`<hash@>`, `<blob@>`, `<attach@>`
Schema-addressed	Schema structure	No	`<object@>`, `<npy@>`

Changes

New classes:

SchemaCodec - Abstract base class for schema-addressed codecs
NpyRef - Lazy reference with metadata access (shape, dtype, is_loaded)
NpyCodec - Codec implementation using .npy format

Refactoring:

ObjectCodec now inherits from SchemaCodec (DRY)
Renamed is_external → is_store throughout codebase for clarity

API:

Export SchemaCodec and NpyRef from datajoint public API

Usage Example

@schema
class Neuron(dj.Imported):
    definition = """
    -> Session
    neuron_id : int16
    ---
    activity : <npy@mystore>    # Lazy-loading array
    """

# Fetch returns NpyRef, not the array
ref = (Neuron & key).fetch1('activity')
print(ref.shape)      # (1000,) - no download!
print(ref.dtype)      # float64 - no download!

# Load when ready
array = ref.load()

# Memory-mapped for large arrays (random access without full load)
mmap = ref.load(mmap_mode='r')
slice = mmap[1000:2000]  # Only reads needed portion

# Or use directly with NumPy (auto-loads)
mean = np.mean(ref)

Memory Mapping

The mmap_mode parameter enables efficient random access to large arrays:

# Local filesystem stores: mmap directly (no copy)
arr = ref.load(mmap_mode='r')

# Remote stores (S3): downloads to cache first, then mmaps
arr = ref.load(mmap_mode='r')

Modes: 'r' (read-only), 'r+' (read-write), 'c' (copy-on-write)

Comparison: `<npy@>` vs `<blob@>`

Feature	`<npy@>`	`<blob@>`
Lazy loading	✓ shape/dtype without I/O	✗ always downloads
Memory mapping	✓ via `mmap_mode`	✗ not supported
OAS addressing	Schema-addressed	Hash-addressed
Format	Portable `.npy`	DataJoint serialization
Bulk fetch	Safe (returns refs)	Downloads everything

Test Plan

Unit tests for NpyRef metadata access
Unit tests for NpyRef mmap_mode (local and remote)
Unit tests for NpyCodec validation
Integration tests for roundtrip encode/decode
Integration tests for lazy loading and caching
Integration tests for NumPy function integration
All 639 tests pass

Add migrate_external() and migrate_filepath() to datajoint.migrate module for safe migration of 0.x external storage columns to 2.0 JSON format. Migration strategy: 1. Add new <column>_v2 columns with JSON type 2. Copy and convert data from old columns 3. User verifies data accessible via DataJoint 2.0 4. Finalize: rename columns (old → _v1, new → original) This allows 0.x and 2.0 to coexist during migration and provides rollback capability if issues are discovered. Functions: - migrate_external(schema, dry_run=True, finalize=False) - migrate_filepath(schema, dry_run=True, finalize=False) - _find_external_columns(schema) - detect 0.x external columns - _find_filepath_columns(schema) - detect 0.x filepath columns Co-Authored-By: Claude Opus 4.5 <[email protected]>

Implement the `<npy@>` codec for schema-addressed numpy array storage: - Add SchemaCodec base class for path-addressed storage codecs - Add NpyRef class for lazy array references with metadata - Add NpyCodec using .npy format with shape/dtype inspection - Refactor ObjectCodec to inherit from SchemaCodec - Rename is_external to is_store throughout codebase - Export SchemaCodec and NpyRef from public API - Bump version to 2.0.0a17 Key features: - Lazy loading: inspect shape/dtype without downloading - NumPy integration via __array__ protocol - Safe bulk fetch: returns NpyRef objects, not arrays - Schema-addressed paths: {schema}/{table}/{pk}/{attr}.npy Co-Authored-By: Claude Opus 4.5 <[email protected]>

The SchemaCodec (used by NpyCodec and ObjectCodec) needs _schema, _table, _field, and primary key values to construct schema-addressed storage paths. Previously, key=None was passed, resulting in "unknown/unknown" paths. Now builds proper context dict from table metadata and row values, enabling navigable paths like: {schema}/{table}/objects/{pk_path}/{attribute}.npy Co-Authored-By: Claude Opus 4.5 <[email protected]>

…to feature/npy-codec

Merge PR #1330 (blob preview display) into feature/npy-codec. Bump version from 2.0.0a17 to 2.0.0a18. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Address reviewer feedback from PR #1330: attr should never be None since field_name comes from heading.names. Raising an error surfaces bugs immediately rather than silently returning a misleading placeholder. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Support memory-mapped loading for large arrays: - Local filesystem stores: mmap directly, no download - Remote stores: download to cache, then mmap Co-Authored-By: Claude Opus 4.5 <[email protected]>

github-actions bot added enhancement Indicates new improvements feature Indicates new features labels Jan 12, 2026

dimitri-yatsenko force-pushed the feature/npy-codec branch from 8d7c92e to 08d5c6a Compare January 12, 2026 22:12

dimitri-yatsenko requested a review from d-v-b January 12, 2026 22:13

dimitri-yatsenko self-assigned this Jan 12, 2026

dimitri-yatsenko and others added 7 commits January 12, 2026 16:29

Merge remote-tracking branch 'origin/enhance/blob-preview-display' in…

14d3da6

…to feature/npy-codec

chore: Merge enhance/blob-preview-display and bump to 2.0.0a18

9f6826e

Merge PR #1330 (blob preview display) into feature/npy-codec. Bump version from 2.0.0a17 to 2.0.0a18. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge remote-tracking branch 'origin/pre/v2.0' into feature/npy-codec

6b951d4

feat: Add mmap_mode parameter to NpyRef.load()

12ea814

Support memory-mapped loading for large arrays: - Local filesystem stores: mmap directly, no download - Remote stores: download to cache, then mmap Co-Authored-By: Claude Opus 4.5 <[email protected]>

fix: Remove unused variable in mmap test

c02a882

dimitri-yatsenko requested a review from ttngu207 January 13, 2026 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add NpyCodec for lazy-loading numpy arrays #1331

feat: Add NpyCodec for lazy-loading numpy arrays #1331

Uh oh!

dimitri-yatsenko commented Jan 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add NpyCodec for lazy-loading numpy arrays #1331

Are you sure you want to change the base?

feat: Add NpyCodec for lazy-loading numpy arrays #1331

Uh oh!

Conversation

dimitri-yatsenko commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

OAS Codec Strategy

OAS Addressing Schemes

Changes

Usage Example

Memory Mapping

Comparison: <npy@> vs <blob@>

Test Plan

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dimitri-yatsenko commented Jan 12, 2026 •

edited

Loading

Comparison: `<npy@>` vs `<blob@>`