Skip to content

Functions

Use built-in functions for data manipulation and analysis to operate on the underlying database storing the chain data. These functions are useful for operations like DataChain.filter and DataChain.mutate.

Functions are organized by category and accessed through their respective modules. For example, string functions are accessed via func.string.length(), array functions via func.array.contains(), etc.

Global Function Access

Only a subset of functions are available directly from datachain.func (e.g., func.length). Most functions should be accessed through their specific module namespace (e.g., func.string.length) to avoid naming conflicts.

Function Categories

DataChain provides several categories of functions for different types of operations:

Func

Func(
    name: str,
    inner: Callable,
    cols: Sequence[ColT] | None = None,
    args: Sequence[Any] | None = None,
    kwargs: dict[str, Any] | None = None,
    result_type: DataType | None = None,
    type_from_args: Callable[..., DataType] | None = None,
    is_array: bool = False,
    from_array: bool = False,
    is_window: bool = False,
    window: Window | None = None,
    label: str | None = None,
)

Bases: Function

A built-in function applied to dataset columns, created by calling functions from the func module.

There are three kinds of functions:

  • Row-level — transform each row independently, used in mutate, filter, and merge: func.path.file_stem(C("file.path")), func.string.length(C("name"))
  • Aggregate — collapse rows into a single value, used in group_by: func.count(), func.sum("file.size"), func.avg("score")
  • Window — compute over a partition of rows, require .over(): func.row_number().over(window), func.rank().over(window)
Source code in datachain/func/func.py
def __init__(
    self,
    name: str,
    inner: Callable,
    cols: Sequence[ColT] | None = None,
    args: Sequence[Any] | None = None,
    kwargs: dict[str, Any] | None = None,
    result_type: "DataType | None" = None,
    type_from_args: Callable[..., "DataType"] | None = None,
    is_array: bool = False,
    from_array: bool = False,
    is_window: bool = False,
    window: "Window | None" = None,
    label: str | None = None,
) -> None:
    self.name = name
    self.inner = inner
    self.cols = cols or []
    self.args = args or []
    self.kwargs = kwargs or {}
    self.result_type = result_type
    self.type_from_args = type_from_args
    self.is_array = is_array
    self.from_array = from_array
    self.is_window = is_window
    self.window = window
    self.col_label = label

Usage

from datachain.func import aggregate, array, conditional, numeric, path, random, string, window

# Access functions through their module namespaces
dc.mutate(
    text_length=string.length("text_column"),
    contains_item=array.contains("array_column", "value"),
    file_extension=path.file_ext("file_path")
)

# Some commonly used functions are also available directly
from datachain.func import sum, count, length, ifelse
dc.mutate(total=sum("amount"))