Skip to content

ZarrStore

ZarrStore is a DataModel that points at a Zarr store root and provides methods for inspecting and reading its arrays.

Unlike File, which represents a single byte stream, a Zarr store is a tree of objects (metadata and chunks). ZarrStore rows are created when a DataChain is initialized from Zarr stores, which collapses every object under a store root into a single row:

import datachain as dc

chain = dc.read_zarr("s3://bucket-name/data/")
for (store,) in chain.limit(1).to_iter("zarr"):
    print(store.get_info())

There are additional models for working with Zarr stores:

  • ZarrInfo - summary metadata for a store (format, array paths, attributes).
  • ZarrArray - a single array within a store; exposes shape, chunks, dtype, and attrs, and reads data via read() or select().
  • ZarrSelection - a lazy, bounded region inside an array (e.g. one image frame) that can travel through a chain as a column and is materialized on demand via read() or rendered to image bytes via read_bytes().

For a complete example of Zarr processing with DataChain, see Embedding Zarr image frames - a pipeline that reads RGB camera frames from a directory of Zarr stores and encodes them with OpenCLIP.

ZarrStore

Bases: DataModel

A Zarr store root.

Unlike :class:~datachain.lib.file.File, a store is a tree of objects (metadata and chunks) rather than a single byte stream, so it is modeled as a plain DataModel. The nested file points at the store root prefix and carries the storage credentials/catalog needed to read the store.

get_array

get_array(path: str = '') -> ZarrArray

Return a single array by its path within the store.

Source code in datachain/lib/zarr.py
def get_array(self, path: str = "") -> "ZarrArray":
    """Return a single array by its path within the store."""
    node = self._open()
    arr = node[path] if path else node
    if not isinstance(arr, zarr.Array):
        raise ValueError(  # noqa: TRY004
            f"'{path}' is not a Zarr array in store {self.path!r}"
        )
    return self._to_array(arr, path)

get_arrays

get_arrays() -> Iterator[ZarrArray]

Yield every array in the store (recursively).

Source code in datachain/lib/zarr.py
def get_arrays(self) -> Iterator["ZarrArray"]:
    """Yield every array in the store (recursively)."""
    yield from self._arrays(self._open())

get_info

get_info() -> ZarrInfo

Return summary metadata for the store.

Source code in datachain/lib/zarr.py
def get_info(self) -> ZarrInfo:
    """Return summary metadata for the store."""
    node = self._open()
    return ZarrInfo(
        zarr_format=int(node.metadata.zarr_format),
        arrays=[a.path for a in self._arrays(node)],
        attrs=dict(node.attrs),
    )

ZarrArray

Bases: DataModel

A single array within a :class:ZarrStore.

read

read(selection: Any = None) -> Any

Read array data, optionally restricted to a NumPy-style selection.

Source code in datachain/lib/zarr.py
def read(self, selection: Any = None) -> Any:
    """Read array data, optionally restricted to a NumPy-style selection."""
    node = self.store._open()
    arr = node[self.path] if self.path else node
    if selection is None:
        return arr[...]
    return arr[selection]

select

select(
    index: int | list[int],
    media: Literal["image", "audio", "video"] | None = None,
) -> ZarrSelection

Return a lazy :class:ZarrSelection pointing at an item in this array.

index addresses the leading axes (e.g. i or [i] for one frame of an (N, H, W, C) array). The region is read on demand via :meth:ZarrSelection.read, so the item can travel through a DataChain as a column without materializing its bytes.

Source code in datachain/lib/zarr.py
def select(
    self,
    index: "int | list[int]",
    media: "Literal['image', 'audio', 'video'] | None" = None,
) -> "ZarrSelection":
    """Return a lazy :class:`ZarrSelection` pointing at an item in this array.

    ``index`` addresses the leading axes (e.g. ``i`` or ``[i]`` for one
    frame of an ``(N, H, W, C)`` array).  The region is read on demand via
    :meth:`ZarrSelection.read`, so the item can travel through a DataChain
    as a column without materializing its bytes.
    """
    idx = [index] if isinstance(index, int) else list(index)
    return ZarrSelection(array=self, index=idx, media=media)

ZarrSelection

Bases: DataModel

A lazy, bounded region inside a :class:ZarrArray.

Points at a single item (or block) inside an array without reading it, analogous to how :class:~datachain.lib.file.File points at a byte stream. index addresses the leading axes; :meth:read materializes the region.

read

read() -> Any

Read and return the selected region.

Source code in datachain/lib/zarr.py
def read(self) -> Any:
    """Read and return the selected region."""
    return self.array.read(tuple(self.index))

read_bytes

read_bytes(format: str = 'PNG') -> bytes

Render the selected region to encoded media bytes.

Only media="image" is supported for now: the region is read and encoded with Pillow (e.g. PNG), so callers such as Studio can stream a preview without materializing the image into the row.

Source code in datachain/lib/zarr.py
def read_bytes(self, format: str = "PNG") -> bytes:
    """Render the selected region to encoded media bytes.

    Only ``media="image"`` is supported for now: the region is read and
    encoded with Pillow (e.g. PNG), so callers such as Studio can stream a
    preview without materializing the image into the row.
    """
    if self.media not in (None, "image"):
        raise ValueError(f"read_bytes() supports image media, not {self.media!r}")
    import io

    import numpy as np
    from PIL import Image

    # Normalize e.g. "jpg"/".png" to a registered Pillow format name, with a
    # plain upper-cased fallback (mirrors VideoFrame.read_bytes without the
    # optional video dependency).
    ext = format if format.startswith(".") else f".{format}"
    pil_format = Image.registered_extensions().get(ext.lower(), format.upper())

    arr = np.asarray(self.read())
    if arr.dtype != np.uint8:
        arr = arr.astype("uint8")
    buf = io.BytesIO()
    Image.fromarray(arr).save(buf, format=pil_format)
    return buf.getvalue()

ZarrInfo

Bases: DataModel

Summary metadata for a Zarr store.