Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko commented Jan 12, 2026

Summary

This PR introduces the <npy@> codec for schema-addressed numpy array storage with lazy loading capabilities within the Object-Augmented Schema (OAS).

Key Features

  • Lazy loading: Inspect array shape and dtype without downloading data
  • Memory mapping: Random access to large arrays via mmap_mode parameter
  • NumPy integration: Transparent array operations via __array__ protocol
  • Safe bulk fetch: Returns NpyRef objects instead of downloading all arrays
  • Portable format: Standard .npy files readable by NumPy, MATLAB, etc.
  • Schema-addressed: Paths mirror database structure ({schema}/{table}/{pk}/{attr}.npy)

OAS Codec Strategy

The Object-Augmented Schema integrates relational tables with object storage as a single system. Codecs define how Python objects are stored.

Built-in codecs (no extra dependencies):

  • <blob> / <blob@> - DataJoint legacy serialization (hash-addressed)
  • <npy@> - portable numpy arrays with lazy loading (schema-addressed) ← this PR
  • <object@> - files/folders (schema-addressed)
  • <hash@> - raw bytes (hash-addressed)

Separate packages (optional install, additional dependencies):

  • datajoint-zarr<zarr@> (requires zarr)
  • datajoint-parquet<parquet@> (requires pyarrow)
  • datajoint-tiff<tiff@> (requires tifffile)

This keeps datajoint-python lean while allowing users to install codec packages as needed. The SchemaCodec base class introduced here enables community-contributed codecs.

OAS Addressing Schemes

Scheme Path Derived From Deduplication Codecs
Hash-addressed Content hash (MD5) Yes <hash@>, <blob@>, <attach@>
Schema-addressed Schema structure No <object@>, <npy@>

Changes

New classes:

  • SchemaCodec - Abstract base class for schema-addressed codecs
  • NpyRef - Lazy reference with metadata access (shape, dtype, is_loaded)
  • NpyCodec - Codec implementation using .npy format

Refactoring:

  • ObjectCodec now inherits from SchemaCodec (DRY)
  • Renamed is_externalis_store throughout codebase for clarity

API:

  • Export SchemaCodec and NpyRef from datajoint public API

Usage Example

@schema
class Neuron(dj.Imported):
    definition = """
    -> Session
    neuron_id : int16
    ---
    activity : <npy@mystore>    # Lazy-loading array
    """

# Fetch returns NpyRef, not the array
ref = (Neuron & key).fetch1('activity')
print(ref.shape)      # (1000,) - no download!
print(ref.dtype)      # float64 - no download!

# Load when ready
array = ref.load()

# Memory-mapped for large arrays (random access without full load)
mmap = ref.load(mmap_mode='r')
slice = mmap[1000:2000]  # Only reads needed portion

# Or use directly with NumPy (auto-loads)
mean = np.mean(ref)

Memory Mapping

The mmap_mode parameter enables efficient random access to large arrays:

# Local filesystem stores: mmap directly (no copy)
arr = ref.load(mmap_mode='r')

# Remote stores (S3): downloads to cache first, then mmaps
arr = ref.load(mmap_mode='r')

Modes: 'r' (read-only), 'r+' (read-write), 'c' (copy-on-write)

Comparison: <npy@> vs <blob@>

Feature <npy@> <blob@>
Lazy loading ✓ shape/dtype without I/O ✗ always downloads
Memory mapping ✓ via mmap_mode ✗ not supported
OAS addressing Schema-addressed Hash-addressed
Format Portable .npy DataJoint serialization
Bulk fetch Safe (returns refs) Downloads everything

Test Plan

  • Unit tests for NpyRef metadata access
  • Unit tests for NpyRef mmap_mode (local and remote)
  • Unit tests for NpyCodec validation
  • Integration tests for roundtrip encode/decode
  • Integration tests for lazy loading and caching
  • Integration tests for NumPy function integration
  • All 639 tests pass

Related

  • Documentation: datajoint/datajoint-docs (pre/v2.0 branch)
    • how-to/use-npy-codec.md
    • reference/specs/npy-codec.md
    • tutorials/ephys-with-npy.ipynb

🤖 Generated with Claude Code

Add migrate_external() and migrate_filepath() to datajoint.migrate module
for safe migration of 0.x external storage columns to 2.0 JSON format.

Migration strategy:
1. Add new <column>_v2 columns with JSON type
2. Copy and convert data from old columns
3. User verifies data accessible via DataJoint 2.0
4. Finalize: rename columns (old → _v1, new → original)

This allows 0.x and 2.0 to coexist during migration and provides
rollback capability if issues are discovered.

Functions:
- migrate_external(schema, dry_run=True, finalize=False)
- migrate_filepath(schema, dry_run=True, finalize=False)
- _find_external_columns(schema) - detect 0.x external columns
- _find_filepath_columns(schema) - detect 0.x filepath columns

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@github-actions github-actions bot added enhancement Indicates new improvements feature Indicates new features labels Jan 12, 2026
Implement the `<npy@>` codec for schema-addressed numpy array storage:

- Add SchemaCodec base class for path-addressed storage codecs
- Add NpyRef class for lazy array references with metadata
- Add NpyCodec using .npy format with shape/dtype inspection
- Refactor ObjectCodec to inherit from SchemaCodec
- Rename is_external to is_store throughout codebase
- Export SchemaCodec and NpyRef from public API
- Bump version to 2.0.0a17

Key features:
- Lazy loading: inspect shape/dtype without downloading
- NumPy integration via __array__ protocol
- Safe bulk fetch: returns NpyRef objects, not arrays
- Schema-addressed paths: {schema}/{table}/{pk}/{attr}.npy

Co-Authored-By: Claude Opus 4.5 <[email protected]>
dimitri-yatsenko and others added 7 commits January 12, 2026 16:29
The SchemaCodec (used by NpyCodec and ObjectCodec) needs _schema,
_table, _field, and primary key values to construct schema-addressed
storage paths. Previously, key=None was passed, resulting in
"unknown/unknown" paths.

Now builds proper context dict from table metadata and row values,
enabling navigable paths like:
  {schema}/{table}/objects/{pk_path}/{attribute}.npy

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Merge PR #1330 (blob preview display) into feature/npy-codec.
Bump version from 2.0.0a17 to 2.0.0a18.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Address reviewer feedback from PR #1330: attr should never be None
since field_name comes from heading.names. Raising an error surfaces
bugs immediately rather than silently returning a misleading placeholder.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Support memory-mapped loading for large arrays:
- Local filesystem stores: mmap directly, no download
- Remote stores: download to cache, then mmap

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Indicates new improvements feature Indicates new features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants