Functions
Use built-in functions for data manipulation and analysis to operate on the underlying database storing the chain data. These functions are useful for operations like DataChain.filter and DataChain.mutate.
Functions are organized by category and accessed through their respective modules. For example, string functions are accessed via func.string.length(), array functions via func.array.contains(), etc.
Global Function Access
Only a subset of functions are available directly from datachain.func (e.g., func.length). Most functions should be accessed through their specific module namespace (e.g., func.string.length) to avoid naming conflicts.
Function Categories
DataChain provides several categories of functions for different types of operations:
- Aggregate Functions - Functions for aggregating data like
sum,count,avg, etc. - Array Functions - Functions for working with arrays and lists
- Conditional Functions - Functions for conditional logic like
ifelse,case, etc. - Numeric Functions - Functions for numeric operations and computations
- Path Functions - Functions for working with file paths
- Random Functions - Functions for generating random values
- String Functions - Functions for string manipulation and processing
- Window Functions - Functions for window operations
Func
Func(
name: str,
inner: Callable,
cols: Sequence[ColT] | None = None,
args: Sequence[Any] | None = None,
kwargs: dict[str, Any] | None = None,
result_type: DataType | None = None,
type_from_args: Callable[..., DataType] | None = None,
is_array: bool = False,
from_array: bool = False,
is_window: bool = False,
window: Window | None = None,
label: str | None = None,
)
Bases: Function
A built-in function applied to dataset columns, created by calling functions
from the func module.
There are three kinds of functions:
- Row-level — transform each row independently, used in
mutate,filter, andmerge:func.path.file_stem(C("file.path")),func.string.length(C("name")) - Aggregate — collapse rows into a single value, used in
group_by:func.count(),func.sum("file.size"),func.avg("score") - Window — compute over a partition of rows, require
.over():func.row_number().over(window),func.rank().over(window)
Source code in datachain/func/func.py
Usage
from datachain.func import aggregate, array, conditional, numeric, path, random, string, window
# Access functions through their module namespaces
dc.mutate(
text_length=string.length("text_column"),
contains_item=array.contains("array_column", "value"),
file_extension=path.file_ext("file_path")
)
# Some commonly used functions are also available directly
from datachain.func import sum, count, length, ifelse
dc.mutate(total=sum("amount"))