Skip to content

Functions

Use built-in functions for data manipulation and analysis to operate on the underlying database storing the chain data. These functions are useful for operations like DataChain.filter and DataChain.mutate. Import these functions from datachain.func.

func

and_

and_(*args: Union[ColumnElement, Func]) -> Func

Returns the function that produces conjunction of expressions joined by AND logical operator.

Parameters:

  • args (ColumnElement | Func, default: () ) –

    The expressions for AND statement. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column in the dataset. If a Func is provided, it is assumed to be a function returning a value.

Returns:

  • Func ( Func ) –

    A Func object that represents the AND function.

Example
dc.mutate(
    test=ifelse(and_(isnone("name"), isnone("surname")), "Empty", "Not Empty")
)
Notes
  • The result column will always be of type bool.
Source code in datachain/func/conditional.py
def and_(*args: Union[ColumnElement, Func]) -> Func:
    """
    Returns the function that produces conjunction of expressions joined by AND
    logical operator.

    Args:
        args (ColumnElement | Func): The expressions for AND statement.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column in the dataset.
            If a Func is provided, it is assumed to be a function returning a value.

    Returns:
        Func: A `Func` object that represents the AND function.

    Example:
        ```py
        dc.mutate(
            test=ifelse(and_(isnone("name"), isnone("surname")), "Empty", "Not Empty")
        )
        ```

    Notes:
        - The result column will always be of type bool.
    """
    cols, func_args = [], []

    for arg in args:
        if isinstance(arg, (str, Func)):
            cols.append(arg)
        else:
            func_args.append(arg)

    return Func("and", inner=sql_and, cols=cols, args=func_args, result_type=bool)

any_value

any_value(col: Union[str, Column]) -> Func

Returns the ANY_VALUE aggregate SQL function for the given column name.

The ANY_VALUE function returns an arbitrary value from the specified column. It is useful when you do not care which particular value is returned, as long as it comes from one of the rows in the group.

Parameters:

  • col (str | Column) –

    The name of the column from which to return an arbitrary value. Column can be specified as a string or a Column object.

Returns:

  • Func ( Func ) –

    A Func object that represents the ANY_VALUE aggregate function.

Example
dc.group_by(
    file_example=func.any_value("file.path"),
    signal_example=func.any_value(dc.C("signal.value")),
    partition_by="signal.category",
)
Notes
  • The any_value function can be used with any type of column.
  • The result column will have the same type as the input column.
  • The result of any_value is non-deterministic, meaning it may return different values for different executions.
Source code in datachain/func/aggregate.py
def any_value(col: Union[str, Column]) -> Func:
    """
    Returns the ANY_VALUE aggregate SQL function for the given column name.

    The ANY_VALUE function returns an arbitrary value from the specified column.
    It is useful when you do not care which particular value is returned,
    as long as it comes from one of the rows in the group.

    Args:
        col (str | Column): The name of the column from which to return
            an arbitrary value.
            Column can be specified as a string or a `Column` object.

    Returns:
        Func: A Func object that represents the ANY_VALUE aggregate function.

    Example:
        ```py
        dc.group_by(
            file_example=func.any_value("file.path"),
            signal_example=func.any_value(dc.C("signal.value")),
            partition_by="signal.category",
        )
        ```

    Notes:
        - The `any_value` function can be used with any type of column.
        - The result column will have the same type as the input column.
        - The result of `any_value` is non-deterministic,
          meaning it may return different values for different executions.
    """
    return Func("any_value", inner=aggregate.any_value, cols=[col])

avg

avg(col: Union[str, Column]) -> Func

Returns the AVG aggregate SQL function for the specified column.

The AVG function returns the average of a numeric column in a table. It calculates the mean of all values in the specified column.

Parameters:

  • col (str | Column) –

    The name of the column for which to calculate the average. Column can be specified as a string or a Column object.

Returns:

  • Func ( Func ) –

    A Func object that represents the AVG aggregate function.

Example
dc.group_by(
    avg_file_size=func.avg("file.size"),
    avg_signal_value=func.avg(dc.C("signal.value")),
    partition_by="signal.category",
)
Notes
  • The avg function should be used on numeric columns.
  • The result column will always be of type float.
Source code in datachain/func/aggregate.py
def avg(col: Union[str, Column]) -> Func:
    """
    Returns the AVG aggregate SQL function for the specified column.

    The AVG function returns the average of a numeric column in a table.
    It calculates the mean of all values in the specified column.

    Args:
        col (str | Column): The name of the column for which to calculate the average.
            Column can be specified as a string or a `Column` object.

    Returns:
        Func: A Func object that represents the AVG aggregate function.

    Example:
        ```py
        dc.group_by(
            avg_file_size=func.avg("file.size"),
            avg_signal_value=func.avg(dc.C("signal.value")),
            partition_by="signal.category",
        )
        ```

    Notes:
        - The `avg` function should be used on numeric columns.
        - The result column will always be of type float.
    """
    return Func("avg", inner=aggregate.avg, cols=[col], result_type=float)

bit_and

bit_and(*args: Union[str, Column, Func, int]) -> Func

Returns a function that computes the bitwise AND operation between two values.

Parameters:

  • args (str | Column | Func | int, default: () ) –

    Two values to compute the bitwise AND operation between. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column. If a Func is provided, it is assumed to be a function returning an int. If an integer is provided, it is assumed to be a constant value.

Returns:

  • Func ( Func ) –

    A Func object that represents the bitwise AND function.

Example
dc.mutate(
    and1=func.bit_and("signal.value", 0x0F),
    and2=func.bit_and(dc.C("signal.value1"), "signal.value2"),
)
Notes
  • The result column will always be of type int.
Source code in datachain/func/numeric.py
def bit_and(*args: Union[str, Column, Func, int]) -> Func:
    """
    Returns a function that computes the bitwise AND operation between two values.

    Args:
        args (str | Column | Func | int): Two values to compute
            the bitwise AND operation between.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column.
            If a Func is provided, it is assumed to be a function returning an int.
            If an integer is provided, it is assumed to be a constant value.

    Returns:
        Func: A `Func` object that represents the bitwise AND function.

    Example:
        ```py
        dc.mutate(
            and1=func.bit_and("signal.value", 0x0F),
            and2=func.bit_and(dc.C("signal.value1"), "signal.value2"),
        )
        ```

    Notes:
        - The result column will always be of type int.
    """
    cols, func_args = [], []
    for arg in args:
        if isinstance(arg, int):
            func_args.append(arg)
        else:
            cols.append(arg)

    if len(cols) + len(func_args) != 2:
        raise ValueError("bit_and() requires exactly two arguments")

    return Func(
        "bit_and",
        inner=numeric.bit_and,
        cols=cols,
        args=func_args,
        result_type=int,
    )

bit_hamming_distance

bit_hamming_distance(
    *args: Union[str, Column, Func, int]
) -> Func

Returns a function that computes the Hamming distance between two integers.

The Hamming distance is the number of positions at which the corresponding bits are different. This function returns the dissimilarity between the integers, where 0 indicates identical integers and values closer to the number of bits in the integer indicate higher dissimilarity.

Parameters:

  • args (str | Column | Func | int, default: () ) –

    Two integers to compute the Hamming distance between. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column. If a Func is provided, it is assumed to be a function returning an int. If an int is provided, it is assumed to be an integer literal.

Returns:

  • Func ( Func ) –

    A Func object that represents the Hamming distance function.

Example
dc.mutate(
    hd1=func.bit_hamming_distance("signal.value1", "signal.value2"),
    hd2=func.bit_hamming_distance(dc.C("signal.value1"), 0x0F),
)
Notes
  • The result column will always be of type int.
Source code in datachain/func/numeric.py
def bit_hamming_distance(*args: Union[str, Column, Func, int]) -> Func:
    """
    Returns a function that computes the Hamming distance between two integers.

    The Hamming distance is the number of positions at which the corresponding bits
    are different. This function returns the dissimilarity between the integers,
    where 0 indicates identical integers and values closer to the number of bits
    in the integer indicate higher dissimilarity.

    Args:
        args (str | Column | Func | int): Two integers to compute
            the Hamming distance between.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column.
            If a Func is provided, it is assumed to be a function returning an int.
            If an int is provided, it is assumed to be an integer literal.

    Returns:
        Func: A `Func` object that represents the Hamming distance function.

    Example:
        ```py
        dc.mutate(
            hd1=func.bit_hamming_distance("signal.value1", "signal.value2"),
            hd2=func.bit_hamming_distance(dc.C("signal.value1"), 0x0F),
        )
        ```

    Notes:
        - The result column will always be of type int.
    """
    cols, func_args = [], []
    for arg in args:
        if isinstance(arg, int):
            func_args.append(arg)
        else:
            cols.append(arg)

    if len(cols) + len(func_args) != 2:
        raise ValueError("bit_hamming_distance() requires exactly two arguments")

    return Func(
        "bit_hamming_distance",
        inner=numeric.bit_hamming_distance,
        cols=cols,
        args=func_args,
        result_type=int,
    )

bit_or

bit_or(*args: Union[str, Column, Func, int]) -> Func

Returns a function that computes the bitwise OR operation between two values.

Parameters:

  • args (str | Column | Func | int, default: () ) –

    Two values to compute the bitwise OR operation between. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column. If a Func is provided, it is assumed to be a function returning an int. If an integer is provided, it is assumed to be a constant value.

Returns:

  • Func ( Func ) –

    A Func object that represents the bitwise OR function.

Example
dc.mutate(
    or1=func.bit_or("signal.value", 0x0F),
    or2=func.bit_or(dc.C("signal.value1"), "signal.value2"),
)
Notes
  • The result column will always be of type int.
Source code in datachain/func/numeric.py
def bit_or(*args: Union[str, Column, Func, int]) -> Func:
    """
    Returns a function that computes the bitwise OR operation between two values.

    Args:
        args (str | Column | Func | int): Two values to compute
            the bitwise OR operation between.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column.
            If a Func is provided, it is assumed to be a function returning an int.
            If an integer is provided, it is assumed to be a constant value.

    Returns:
        Func: A `Func` object that represents the bitwise OR function.

    Example:
        ```py
        dc.mutate(
            or1=func.bit_or("signal.value", 0x0F),
            or2=func.bit_or(dc.C("signal.value1"), "signal.value2"),
        )
        ```

    Notes:
        - The result column will always be of type int.
    """
    cols, func_args = [], []
    for arg in args:
        if isinstance(arg, int):
            func_args.append(arg)
        else:
            cols.append(arg)

    if len(cols) + len(func_args) != 2:
        raise ValueError("bit_or() requires exactly two arguments")

    return Func(
        "bit_or",
        inner=numeric.bit_or,
        cols=cols,
        args=func_args,
        result_type=int,
    )

bit_xor

bit_xor(*args: Union[str, Column, Func, int]) -> Func

Returns a function that computes the bitwise XOR operation between two values.

Parameters:

  • args (str | Column | Func | int, default: () ) –

    Two values to compute the bitwise XOR operation between. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column. If a Func is provided, it is assumed to be a function returning an int. If an integer is provided, it is assumed to be a constant value.

Returns:

  • Func ( Func ) –

    A Func object that represents the bitwise XOR function.

Example
dc.mutate(
    xor1=func.bit_xor("signal.value", 0x0F),
    xor2=func.bit_xor(dc.C("signal.value1"), "signal.value2"),
)
Notes
  • The result column will always be of type int.
Source code in datachain/func/numeric.py
def bit_xor(*args: Union[str, Column, Func, int]) -> Func:
    """
    Returns a function that computes the bitwise XOR operation between two values.

    Args:
        args (str | Column | Func | int): Two values to compute
            the bitwise XOR operation between.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column.
            If a Func is provided, it is assumed to be a function returning an int.
            If an integer is provided, it is assumed to be a constant value.

    Returns:
        Func: A `Func` object that represents the bitwise XOR function.

    Example:
        ```py
        dc.mutate(
            xor1=func.bit_xor("signal.value", 0x0F),
            xor2=func.bit_xor(dc.C("signal.value1"), "signal.value2"),
        )
        ```

    Notes:
        - The result column will always be of type int.
    """
    cols, func_args = [], []
    for arg in args:
        if isinstance(arg, int):
            func_args.append(arg)
        else:
            cols.append(arg)

    if len(cols) + len(func_args) != 2:
        raise ValueError("bit_xor() requires exactly two arguments")

    return Func(
        "bit_xor",
        inner=numeric.bit_xor,
        cols=cols,
        args=func_args,
        result_type=int,
    )

byte_hamming_distance

byte_hamming_distance(*args: ColT) -> Func

Computes the Hamming distance between two strings.

The Hamming distance is the number of positions at which the corresponding characters are different. This function returns the dissimilarity between the strings, where 0 indicates identical strings and values closer to the length of the strings indicate higher dissimilarity.

Parameters:

  • args (str | Column | Func | literal, default: () ) –

    Two strings to compute the Hamming distance between. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column in the dataset. If a Func is provided, it is assumed to be a function returning a string. If a literal is provided, it is assumed to be a string literal.

Returns:

  • Func ( Func ) –

    A Func object that represents the Hamming distance function.

Example
dc.mutate(
    hd1=func.byte_hamming_distance("file.phash", literal("hello")),
    hd2=func.byte_hamming_distance(dc.C("file.phash"), "hello"),
    hd3=func.byte_hamming_distance(
        dc.func.literal("hi"),
        dc.func.literal("hello"),
    ),
)
Notes
  • The result column will always be of type int.
Source code in datachain/func/string.py
def byte_hamming_distance(*args: ColT) -> Func:
    """
    Computes the Hamming distance between two strings.

    The Hamming distance is the number of positions at which the corresponding
    characters are different. This function returns the dissimilarity between
    the strings, where 0 indicates identical strings and values closer to the length
    of the strings indicate higher dissimilarity.

    Args:
        args (str | Column | Func | literal): Two strings to compute
            the Hamming distance between.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column in the dataset.
            If a Func is provided, it is assumed to be a function returning a string.
            If a literal is provided, it is assumed to be a string literal.

    Returns:
        Func: A `Func` object that represents the Hamming distance function.

    Example:
        ```py
        dc.mutate(
            hd1=func.byte_hamming_distance("file.phash", literal("hello")),
            hd2=func.byte_hamming_distance(dc.C("file.phash"), "hello"),
            hd3=func.byte_hamming_distance(
                dc.func.literal("hi"),
                dc.func.literal("hello"),
            ),
        )
        ```

    Notes:
        - The result column will always be of type int.
    """
    cols, func_args = [], []
    for arg in args:
        if get_origin(arg) is literal:
            func_args.append(arg)
        else:
            cols.append(arg)

    if len(cols) + len(func_args) != 2:
        raise ValueError("byte_hamming_distance() requires exactly two arguments")

    return Func(
        "byte_hamming_distance",
        inner=string.byte_hamming_distance,
        cols=cols,
        args=func_args,
        result_type=int,
    )

case

case(
    *args: tuple[Union[ColumnElement, Func, bool], CaseT],
    else_: Optional[CaseT] = None
) -> Func

Returns a case expression that evaluates a list of conditions and returns corresponding results. Results can be Python primitives (string, numbers, booleans), nested functions (including case function), or columns.

Parameters:

  • args (tuple[ColumnElement | Func | bool, CaseT], default: () ) –

    Tuples of (condition, value) pairs. Each condition is evaluated in order, and the corresponding value is returned for the first condition that evaluates to True.

  • else_ (CaseT, default: None ) –

    Value to return if no conditions are satisfied. If omitted and no conditions are satisfied, the result will be None (NULL in DB).

Returns:

  • Func ( Func ) –

    A Func object that represents the case function.

Example
dc.mutate(
    res=func.case((dc.C("num") > 0, "P"), (dc.C("num") < 0, "N"), else_="Z"),
)
Notes
  • The result type is inferred from the values provided in the case statements.
Source code in datachain/func/conditional.py
def case(
    *args: tuple[Union[ColumnElement, Func, bool], CaseT], else_: Optional[CaseT] = None
) -> Func:
    """
    Returns a case expression that evaluates a list of conditions and returns
    corresponding results. Results can be Python primitives (string, numbers, booleans),
    nested functions (including case function), or columns.

    Args:
        args (tuple[ColumnElement | Func | bool, CaseT]): Tuples of (condition, value)
            pairs. Each condition is evaluated in order, and the corresponding value
            is returned for the first condition that evaluates to True.
        else_ (CaseT, optional): Value to return if no conditions are satisfied.
            If omitted and no conditions are satisfied, the result will be None
            (NULL in DB).

    Returns:
        Func: A `Func` object that represents the case function.

    Example:
        ```py
        dc.mutate(
            res=func.case((dc.C("num") > 0, "P"), (dc.C("num") < 0, "N"), else_="Z"),
        )
        ```

    Notes:
        - The result type is inferred from the values provided in the case statements.
    """
    supported_types = [int, float, complex, str, bool]

    def _get_type(val):
        from enum import Enum

        if isinstance(val, Func):
            # nested functions
            return val.result_type
        if isinstance(val, Column):
            # at this point we cannot know what is the type of a column
            return None
        if isinstance(val, Enum):
            return type(val.value)
        return type(val)

    if not args:
        raise DataChainParamsError("Missing statements")

    type_ = _get_type(else_) if else_ is not None else None

    for arg in args:
        arg_type = _get_type(arg[1])
        if arg_type is None:
            # we couldn't figure out the type of case value
            continue
        if type_ and arg_type != type_:
            raise DataChainParamsError(
                f"Statement values must be of the same type, got {type_} and {arg_type}"
            )
        type_ = arg_type

    if type_ is not None and type_ not in supported_types:
        raise DataChainParamsError(
            f"Only python literals ({supported_types}) are supported for values"
        )

    kwargs = {"else_": else_}

    return Func("case", inner=sql_case, cols=args, kwargs=kwargs, result_type=type_)

collect

collect(col: Union[str, Column]) -> Func

Returns the COLLECT aggregate SQL function for the given column name.

The COLLECT function gathers all values from the specified column into an array or similar structure. It is useful for combining values from a column into a collection, often for further processing or aggregation.

Parameters:

  • col (str | Column) –

    The name of the column from which to collect values. Column can be specified as a string or a Column object.

Returns:

  • Func ( Func ) –

    A Func object that represents the COLLECT aggregate function.

Example
dc.group_by(
    signals=func.collect("signal"),
    file_paths=func.collect(dc.C("file.path")),
    partition_by="signal.category",
)
Notes
  • The collect function can be used with numeric and string columns.
  • The result column will have an array type.
Source code in datachain/func/aggregate.py
def collect(col: Union[str, Column]) -> Func:
    """
    Returns the COLLECT aggregate SQL function for the given column name.

    The COLLECT function gathers all values from the specified column
    into an array or similar structure. It is useful for combining values from a column
    into a collection, often for further processing or aggregation.

    Args:
        col (str | Column): The name of the column from which to collect values.
            Column can be specified as a string or a `Column` object.

    Returns:
        Func: A Func object that represents the COLLECT aggregate function.

    Example:
        ```py
        dc.group_by(
            signals=func.collect("signal"),
            file_paths=func.collect(dc.C("file.path")),
            partition_by="signal.category",
        )
        ```

    Notes:
        - The `collect` function can be used with numeric and string columns.
        - The result column will have an array type.
    """
    return Func("collect", inner=aggregate.collect, cols=[col], is_array=True)

concat

concat(col: Union[str, Column], separator='') -> Func

Returns the CONCAT aggregate SQL function for the given column name.

The CONCAT function concatenates values from the specified column into a single string. It is useful for merging text values from multiple rows into a single combined value.

Parameters:

  • col (str | Column) –

    The name of the column from which to concatenate values. Column can be specified as a string or a Column object.

  • separator (str, default: '' ) –

    The separator to use between concatenated values. Defaults to an empty string.

Returns:

  • Func ( Func ) –

    A Func object that represents the CONCAT aggregate function.

Example
dc.group_by(
    files=func.concat("file.path", separator=", "),
    signals=func.concat(dc.C("signal.name"), separator=" | "),
    partition_by="signal.category",
)
Notes
  • The concat function can be used with string columns.
  • The result column will have a string type.
Source code in datachain/func/aggregate.py
def concat(col: Union[str, Column], separator="") -> Func:
    """
    Returns the CONCAT aggregate SQL function for the given column name.

    The CONCAT function concatenates values from the specified column
    into a single string. It is useful for merging text values from multiple rows
    into a single combined value.

    Args:
        col (str | Column): The name of the column from which to concatenate values.
            Column can be specified as a string or a `Column` object.
        separator (str, optional): The separator to use between concatenated values.
            Defaults to an empty string.

    Returns:
        Func: A Func object that represents the CONCAT aggregate function.

    Example:
        ```py
        dc.group_by(
            files=func.concat("file.path", separator=", "),
            signals=func.concat(dc.C("signal.name"), separator=" | "),
            partition_by="signal.category",
        )
        ```

    Notes:
        - The `concat` function can be used with string columns.
        - The result column will have a string type.
    """

    def inner(arg):
        return aggregate.group_concat(arg, separator)

    return Func("concat", inner=inner, cols=[col], result_type=str)

contains

contains(
    arr: Union[str, Column, Func, Sequence], elem: Any
) -> Func

Checks whether the array contains the specified element.

Parameters:

  • arr (str | Column | Func | Sequence) –

    Array to check for the element. If a string is provided, it is assumed to be the name of the array column. If a Column is provided, it is assumed to be an array column. If a Func is provided, it is assumed to be a function returning an array. If a sequence is provided, it is assumed to be an array of values.

  • elem (Any) –

    Element to check for in the array.

Returns:

  • Func ( Func ) –

    A Func object that represents the contains function. Result of the function will be 1 if the element is present in the array, and 0 otherwise.

Example
dc.mutate(
    contains1=func.array.contains("signal.values", 3),
    contains2=func.array.contains(dc.C("signal.values"), 7),
    contains3=func.array.contains([1, 2, 3, 4, 5], 7),
)
Notes
  • The result column will always be of type int.
Source code in datachain/func/array.py
def contains(arr: Union[str, Column, Func, Sequence], elem: Any) -> Func:
    """
    Checks whether the array contains the specified element.

    Args:
        arr (str | Column | Func | Sequence): Array to check for the element.
            If a string is provided, it is assumed to be the name of the array column.
            If a Column is provided, it is assumed to be an array column.
            If a Func is provided, it is assumed to be a function returning an array.
            If a sequence is provided, it is assumed to be an array of values.
        elem (Any): Element to check for in the array.

    Returns:
        Func: A `Func` object that represents the contains function. Result of the
            function will be `1` if the element is present in the array,
            and `0` otherwise.

    Example:
        ```py
        dc.mutate(
            contains1=func.array.contains("signal.values", 3),
            contains2=func.array.contains(dc.C("signal.values"), 7),
            contains3=func.array.contains([1, 2, 3, 4, 5], 7),
        )
        ```

    Notes:
        - The result column will always be of type int.
    """

    def inner(arg):
        is_json = type(elem) in [list, dict]
        return array.contains(arg, elem, is_json)

    if isinstance(arr, (str, Column, Func)):
        cols = [arr]
        args = None
    else:
        cols = None
        args = [arr]

    return Func("contains", inner=inner, cols=cols, args=args, result_type=int)

cosine_distance

cosine_distance(
    *args: Union[str, Column, Func, Sequence]
) -> Func

Returns the cosine distance between two vectors.

The cosine distance is derived from the cosine similarity, which measures the angle between two vectors. This function returns the dissimilarity between the vectors, where 0 indicates identical vectors and values closer to 1 indicate higher dissimilarity.

Parameters:

  • args (str | Column | Func | Sequence, default: () ) –

    Two vectors to compute the cosine distance between. If a string is provided, it is assumed to be the name of the column vector. If a Column is provided, it is assumed to be an array column. If a Func is provided, it is assumed to be a function returning an array. If a sequence is provided, it is assumed to be a vector of values.

Returns:

  • Func ( Func ) –

    A Func object that represents the cosine_distance function.

Example
target_embedding = [0.1, 0.2, 0.3]
dc.mutate(
    cos_dist1=func.cosine_distance("embedding", target_embedding),
    cos_dist2=func.cosine_distance(dc.C("emb1"), "emb2"),
    cos_dist3=func.cosine_distance(target_embedding, [0.4, 0.5, 0.6]),
)
Notes
  • Ensure both vectors have the same number of elements.
  • The result column will always be of type float.
Source code in datachain/func/array.py
def cosine_distance(*args: Union[str, Column, Func, Sequence]) -> Func:
    """
    Returns the cosine distance between two vectors.

    The cosine distance is derived from the cosine similarity, which measures the angle
    between two vectors. This function returns the dissimilarity between the vectors,
    where 0 indicates identical vectors and values closer to 1
    indicate higher dissimilarity.

    Args:
        args (str | Column | Func | Sequence): Two vectors to compute the cosine
            distance between.
            If a string is provided, it is assumed to be the name of the column vector.
            If a Column is provided, it is assumed to be an array column.
            If a Func is provided, it is assumed to be a function returning an array.
            If a sequence is provided, it is assumed to be a vector of values.

    Returns:
        Func: A `Func` object that represents the cosine_distance function.

    Example:
        ```py
        target_embedding = [0.1, 0.2, 0.3]
        dc.mutate(
            cos_dist1=func.cosine_distance("embedding", target_embedding),
            cos_dist2=func.cosine_distance(dc.C("emb1"), "emb2"),
            cos_dist3=func.cosine_distance(target_embedding, [0.4, 0.5, 0.6]),
        )
        ```

    Notes:
        - Ensure both vectors have the same number of elements.
        - The result column will always be of type float.
    """
    cols, func_args = [], []
    for arg in args:
        if isinstance(arg, (str, Column, Func)):
            cols.append(arg)
        else:
            func_args.append(list(arg))

    if len(cols) + len(func_args) != 2:
        raise ValueError("cosine_distance() requires exactly two arguments")
    if not cols and len(func_args[0]) != len(func_args[1]):
        raise ValueError("cosine_distance() requires vectors of the same length")

    return Func(
        "cosine_distance",
        inner=array.cosine_distance,
        cols=cols,
        args=func_args,
        result_type=float,
    )

count

count(col: Optional[Union[str, Column]] = None) -> Func

Returns a COUNT aggregate SQL function for the specified column.

The COUNT function returns the number of rows, optionally filtered by a specific column.

Parameters:

  • col (str | Column, default: None ) –

    The column to count. If omitted, counts all rows. The column can be specified as a string or a Column object.

Returns:

  • Func ( Func ) –

    A Func object representing the COUNT aggregate function.

Example
dc.group_by(
    count1=func.count(),
    count2=func.count("signal.id"),
    count3=func.count(dc.C("signal.category")),
    partition_by="signal.category",
)
Notes
  • The result column will always have an integer type.
Source code in datachain/func/aggregate.py
def count(col: Optional[Union[str, Column]] = None) -> Func:
    """
    Returns a COUNT aggregate SQL function for the specified column.

    The COUNT function returns the number of rows, optionally filtered
    by a specific column.

    Args:
        col (str | Column, optional): The column to count.
            If omitted, counts all rows.
            The column can be specified as a string or a `Column` object.

    Returns:
        Func: A `Func` object representing the COUNT aggregate function.

    Example:
        ```py
        dc.group_by(
            count1=func.count(),
            count2=func.count("signal.id"),
            count3=func.count(dc.C("signal.category")),
            partition_by="signal.category",
        )
        ```

    Notes:
        - The result column will always have an integer type.
    """
    return Func(
        "count",
        inner=sa_func.count,
        cols=[col] if col is not None else None,
        result_type=int,
    )

dense_rank

dense_rank() -> Func

Returns the DENSE_RANK window function for SQL queries.

The DENSE_RANK function assigns a rank to each row within a partition of a result set, without gaps in the ranking for ties. Rows with equal values receive the same rank, but the next rank is assigned consecutively (i.e., if two rows are ranked 1, the next row will be ranked 2).

Returns:

  • Func ( Func ) –

    A Func object that represents the DENSE_RANK window function.

Example
window = func.window(partition_by="signal.category", order_by="created_at")
dc.mutate(
    dense_rank=func.dense_rank().over(window),
)
Notes
  • The result column will always be of type int.
  • The DENSE_RANK function differs from RANK in that it does not leave gaps in the ranking for tied values.
Source code in datachain/func/aggregate.py
def dense_rank() -> Func:
    """
    Returns the DENSE_RANK window function for SQL queries.

    The DENSE_RANK function assigns a rank to each row within a partition
    of a result set, without gaps in the ranking for ties. Rows with equal values
    receive the same rank, but the next rank is assigned consecutively
    (i.e., if two rows are ranked 1, the next row will be ranked 2).

    Returns:
        Func: A Func object that represents the DENSE_RANK window function.

    Example:
        ```py
        window = func.window(partition_by="signal.category", order_by="created_at")
        dc.mutate(
            dense_rank=func.dense_rank().over(window),
        )
        ```

    Notes:
        - The result column will always be of type int.
        - The DENSE_RANK function differs from RANK in that it does not leave gaps
          in the ranking for tied values.
    """
    return Func("dense_rank", inner=sa_func.dense_rank, result_type=int, is_window=True)

euclidean_distance

euclidean_distance(
    *args: Union[str, Column, Func, Sequence]
) -> Func

Returns the Euclidean distance between two vectors.

The Euclidean distance is the straight-line distance between two points in Euclidean space. This function returns the distance between the two vectors.

Parameters:

  • args (str | Column | Func | Sequence, default: () ) –

    Two vectors to compute the Euclidean distance between. If a string is provided, it is assumed to be the name of the column vector. If a Column is provided, it is assumed to be an array column. If a Func is provided, it is assumed to be a function returning an array. If a sequence is provided, it is assumed to be a vector of values.

Returns:

  • Func ( Func ) –

    A Func object that represents the euclidean_distance function.

Example
target_embedding = [0.1, 0.2, 0.3]
dc.mutate(
    eu_dist1=func.euclidean_distance("embedding", target_embedding),
    eu_dist2=func.euclidean_distance(dc.C("emb1"), "emb2"),
    eu_dist3=func.euclidean_distance(target_embedding, [0.4, 0.5, 0.6]),
)
Notes
  • Ensure both vectors have the same number of elements.
  • The result column will always be of type float.
Source code in datachain/func/array.py
def euclidean_distance(*args: Union[str, Column, Func, Sequence]) -> Func:
    """
    Returns the Euclidean distance between two vectors.

    The Euclidean distance is the straight-line distance between two points
    in Euclidean space. This function returns the distance between the two vectors.

    Args:
        args (str | Column | Func | Sequence): Two vectors to compute the Euclidean
            distance between.
            If a string is provided, it is assumed to be the name of the column vector.
            If a Column is provided, it is assumed to be an array column.
            If a Func is provided, it is assumed to be a function returning an array.
            If a sequence is provided, it is assumed to be a vector of values.

    Returns:
        Func: A `Func` object that represents the euclidean_distance function.

    Example:
        ```py
        target_embedding = [0.1, 0.2, 0.3]
        dc.mutate(
            eu_dist1=func.euclidean_distance("embedding", target_embedding),
            eu_dist2=func.euclidean_distance(dc.C("emb1"), "emb2"),
            eu_dist3=func.euclidean_distance(target_embedding, [0.4, 0.5, 0.6]),
        )
        ```

    Notes:
        - Ensure both vectors have the same number of elements.
        - The result column will always be of type float.
    """
    cols, func_args = [], []
    for arg in args:
        if isinstance(arg, (str, Column, Func)):
            cols.append(arg)
        else:
            func_args.append(list(arg))

    if len(cols) + len(func_args) != 2:
        raise ValueError("euclidean_distance() requires exactly two arguments")
    if not cols and len(func_args[0]) != len(func_args[1]):
        raise ValueError("euclidean_distance() requires vectors of the same length")

    return Func(
        "euclidean_distance",
        inner=array.euclidean_distance,
        cols=cols,
        args=func_args,
        result_type=float,
    )

file_ext

file_ext(col: ColT) -> Func

Returns the extension of the given path.

Parameters:

  • col (str | Column | Func | literal) –

    String to compute the file extension of. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column object. If a Func is provided, it is assumed to be a function returning a string. If a literal is provided, it is assumed to be a string literal.

Returns:

  • Func ( Func ) –

    A Func object that represents the file extension function.

Example
dc.mutate(
    filestem1=func.path.file_ext("file.path"),
    filestem2=func.path.file_ext(dc.C("file.path")),
    filestem3=func.path.file_ext(dc.func.literal("/path/to/file.txt")
)
Note
  • The result column will always be of type string.
Source code in datachain/func/path.py
def file_ext(col: ColT) -> Func:
    """
    Returns the extension of the given path.

    Args:
        col (str | Column | Func | literal): String to compute the file extension of.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column object.
            If a Func is provided, it is assumed to be a function returning a string.
            If a literal is provided, it is assumed to be a string literal.

    Returns:
        Func: A `Func` object that represents the file extension function.

    Example:
        ```py
        dc.mutate(
            filestem1=func.path.file_ext("file.path"),
            filestem2=func.path.file_ext(dc.C("file.path")),
            filestem3=func.path.file_ext(dc.func.literal("/path/to/file.txt")
        )
        ```

    Note:
        - The result column will always be of type string.
    """

    return Func("file_ext", inner=path.file_ext, cols=[col], result_type=str)

file_stem

file_stem(col: ColT) -> Func

Returns the path without the extension.

Parameters:

  • col (str | Column | Func | literal) –

    String to compute the file stem of. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column object. If a Func is provided, it is assumed to be a function returning a string. If a literal is provided, it is assumed to be a string literal.

Returns:

  • Func ( Func ) –

    A Func object that represents the file stem function.

Example
dc.mutate(
    filestem1=func.path.file_stem("file.path"),
    filestem2=func.path.file_stem(dc.C("file.path")),
    filestem3=func.path.file_stem(dc.func.literal("/path/to/file.txt")
)
Note
  • The result column will always be of type string.
Source code in datachain/func/path.py
def file_stem(col: ColT) -> Func:
    """
    Returns the path without the extension.

    Args:
        col (str | Column | Func | literal): String to compute the file stem of.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column object.
            If a Func is provided, it is assumed to be a function returning a string.
            If a literal is provided, it is assumed to be a string literal.

    Returns:
        Func: A `Func` object that represents the file stem function.

    Example:
        ```py
        dc.mutate(
            filestem1=func.path.file_stem("file.path"),
            filestem2=func.path.file_stem(dc.C("file.path")),
            filestem3=func.path.file_stem(dc.func.literal("/path/to/file.txt")
        )
        ```

    Note:
        - The result column will always be of type string.
    """

    return Func("file_stem", inner=path.file_stem, cols=[col], result_type=str)

first

first(col: Union[str, Column]) -> Func

Returns the FIRST_VALUE window function for SQL queries.

The FIRST_VALUE function returns the first value in an ordered set of values within a partition. The first value is determined by the specified order and can be useful for retrieving the leading value in a group of rows.

Parameters:

  • col (str | Column) –

    The name of the column from which to retrieve the first value. Column can be specified as a string or a Column object.

Returns:

  • Func ( Func ) –

    A Func object that represents the FIRST_VALUE window function.

Example
window = func.window(partition_by="signal.category", order_by="created_at")
dc.mutate(
    first_file=func.first("file.path").over(window),
    first_signal=func.first(dc.C("signal.value")).over(window),
)
Note
  • The result of first_value will always reflect the value of the first row in the specified order.
  • The result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
def first(col: Union[str, Column]) -> Func:
    """
    Returns the FIRST_VALUE window function for SQL queries.

    The FIRST_VALUE function returns the first value in an ordered set of values
    within a partition. The first value is determined by the specified order
    and can be useful for retrieving the leading value in a group of rows.

    Args:
        col (str | Column): The name of the column from which to retrieve
            the first value.
            Column can be specified as a string or a `Column` object.

    Returns:
        Func: A Func object that represents the FIRST_VALUE window function.

    Example:
        ```py
        window = func.window(partition_by="signal.category", order_by="created_at")
        dc.mutate(
            first_file=func.first("file.path").over(window),
            first_signal=func.first(dc.C("signal.value")).over(window),
        )
        ```

    Note:
        - The result of `first_value` will always reflect the value of the first row
          in the specified order.
        - The result column will have the same type as the input column.
    """
    return Func("first", inner=sa_func.first_value, cols=[col], is_window=True)

greatest

greatest(*args: Union[str, Column, Func, float]) -> Func

Returns the greatest (largest) value from the given input values.

Parameters:

  • args (str | Column | Func | int | float, default: () ) –

    The values to compare. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column in the dataset. If a Func is provided, it is assumed to be a function returning a value. If an int or float is provided, it is assumed to be a literal.

Returns:

  • Func ( Func ) –

    A Func object that represents the greatest function.

Example
dc.mutate(
    greatest=func.greatest(dc.C("signal.value"), "signal.value2", 0.5, 1.0),
)
Notes
  • The result column will always be of the same type as the input columns.
Source code in datachain/func/conditional.py
def greatest(*args: Union[str, Column, Func, float]) -> Func:
    """
    Returns the greatest (largest) value from the given input values.

    Args:
        args (str | Column | Func | int | float): The values to compare.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column in the dataset.
            If a Func is provided, it is assumed to be a function returning a value.
            If an int or float is provided, it is assumed to be a literal.

    Returns:
        Func: A `Func` object that represents the greatest function.

    Example:
        ```py
        dc.mutate(
            greatest=func.greatest(dc.C("signal.value"), "signal.value2", 0.5, 1.0),
        )
        ```

    Notes:
        - The result column will always be of the same type as the input columns.
    """
    cols, func_args = [], []

    for arg in args:
        if isinstance(arg, (str, Column, Func)):
            cols.append(arg)
        else:
            func_args.append(arg)

    return Func(
        "greatest",
        inner=conditional.greatest,
        cols=cols,
        args=func_args,
        result_type=int,
    )

ifelse

ifelse(
    condition: Union[ColumnElement, Func],
    if_val: CaseT,
    else_val: CaseT,
) -> Func

Returns an if-else expression that evaluates a condition and returns one of two values based on the result. Values can be Python primitives (string, numbers, booleans), nested functions, or columns.

Parameters:

  • condition (ColumnElement | Func) –

    Condition to evaluate.

  • if_val (ColumnElement | Func | literal) –

    Value to return if condition is True.

  • else_val (ColumnElement | Func | literal) –

    Value to return if condition is False.

Returns:

  • Func ( Func ) –

    A Func object that represents the ifelse function.

Example
dc.mutate(
    res=func.ifelse(isnone("col"), "EMPTY", "NOT_EMPTY")
)
Notes
  • The result type is inferred from the values provided in the ifelse statement.
Source code in datachain/func/conditional.py
def ifelse(
    condition: Union[ColumnElement, Func], if_val: CaseT, else_val: CaseT
) -> Func:
    """
    Returns an if-else expression that evaluates a condition and returns one
    of two values based on the result. Values can be Python primitives
    (string, numbers, booleans), nested functions, or columns.

    Args:
        condition (ColumnElement | Func): Condition to evaluate.
        if_val (ColumnElement | Func | literal): Value to return if condition is True.
        else_val (ColumnElement | Func | literal): Value to return if condition
            is False.

    Returns:
        Func: A `Func` object that represents the ifelse function.

    Example:
        ```py
        dc.mutate(
            res=func.ifelse(isnone("col"), "EMPTY", "NOT_EMPTY")
        )
        ```

    Notes:
        - The result type is inferred from the values provided in the ifelse statement.
    """
    return case((condition, if_val), else_=else_val)

int_hash_64

int_hash_64(col: Union[str, Column, Func, int]) -> Func

Returns a function that computes the 64-bit hash of an integer.

Parameters:

  • col (str | Column | Func | int) –

    Integer to compute the hash of. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column. If a Func is provided, it is assumed to be a function returning an int. If an int is provided, it is assumed to be an int literal.

Returns:

  • Func ( Func ) –

    A Func object that represents the 64-bit hash function.

Example
dc.mutate(
    val_hash=func.int_hash_64("val"),
    val_hash2=func.int_hash_64(dc.C("val2")),
)
Notes
  • The result column will always be of type int.
Source code in datachain/func/numeric.py
def int_hash_64(col: Union[str, Column, Func, int]) -> Func:
    """
    Returns a function that computes the 64-bit hash of an integer.

    Args:
        col (str | Column | Func | int): Integer to compute the hash of.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column.
            If a Func is provided, it is assumed to be a function returning an int.
            If an int is provided, it is assumed to be an int literal.

    Returns:
        Func: A `Func` object that represents the 64-bit hash function.

    Example:
        ```py
        dc.mutate(
            val_hash=func.int_hash_64("val"),
            val_hash2=func.int_hash_64(dc.C("val2")),
        )
        ```

    Notes:
        - The result column will always be of type int.
    """
    cols, args = [], []
    if isinstance(col, int):
        args.append(col)
    else:
        cols.append(col)

    return Func(
        "int_hash_64", inner=numeric.int_hash_64, cols=cols, args=args, result_type=int
    )

isnone

isnone(col: Union[str, ColumnElement]) -> Func

Returns a function that checks if the column value is None (NULL in DB).

Parameters:

  • col (str | Column) –

    Column to check if it's None or not. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column in the dataset.

Returns:

  • Func ( Func ) –

    A Func object that represents the isnone function. Returns True if column value is None, otherwise False.

Example
dc.mutate(test=ifelse(isnone("col"), "EMPTY", "NOT_EMPTY"))
Notes
  • The result column will always be of type bool.
Source code in datachain/func/conditional.py
def isnone(col: Union[str, ColumnElement]) -> Func:
    """
    Returns a function that checks if the column value is `None` (NULL in DB).

    Args:
        col (str | Column): Column to check if it's None or not.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column in the dataset.

    Returns:
        Func: A `Func` object that represents the isnone function.
            Returns True if column value is None, otherwise False.

    Example:
        ```py
        dc.mutate(test=ifelse(isnone("col"), "EMPTY", "NOT_EMPTY"))
        ```

    Notes:
        - The result column will always be of type bool.
    """
    if isinstance(col, str):
        # if string is provided, it is assumed to be the name of the column
        col = Column(col)

    return case((col.is_(None) if col is not None else True, True), else_=False)

least

least(*args: Union[str, Column, Func, float]) -> Func

Returns the least (smallest) value from the given input values.

Parameters:

  • args (str | Column | Func | int | float, default: () ) –

    The values to compare. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column in the dataset. If a Func is provided, it is assumed to be a function returning a value. If an int or float is provided, it is assumed to be a literal.

Returns:

  • Func ( Func ) –

    A Func object that represents the least function.

Example
dc.mutate(
    least=func.least(dc.C("signal.value"), "signal.value2", -1.0, 0),
)
Notes
  • The result column will always be of the same type as the input columns.
Source code in datachain/func/conditional.py
def least(*args: Union[str, Column, Func, float]) -> Func:
    """
    Returns the least (smallest) value from the given input values.

    Args:
        args (str | Column | Func | int | float): The values to compare.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column in the dataset.
            If a Func is provided, it is assumed to be a function returning a value.
            If an int or float is provided, it is assumed to be a literal.

    Returns:
        Func: A `Func` object that represents the least function.

    Example:
        ```py
        dc.mutate(
            least=func.least(dc.C("signal.value"), "signal.value2", -1.0, 0),
        )
        ```

    Notes:
        - The result column will always be of the same type as the input columns.
    """
    cols, func_args = [], []

    for arg in args:
        if isinstance(arg, (str, Column, Func)):
            cols.append(arg)
        else:
            func_args.append(arg)

    return Func(
        "least", inner=conditional.least, cols=cols, args=func_args, result_type=int
    )

length

length(arg: Union[str, Column, Func, Sequence]) -> Func

Returns the length of the array.

Parameters:

  • arg (str | Column | Func | Sequence) –

    Array to compute the length of. If a string is provided, it is assumed to be the name of the array column. If a Column is provided, it is assumed to be an array column. If a Func is provided, it is assumed to be a function returning an array. If a sequence is provided, it is assumed to be an array of values.

Returns:

  • Func ( Func ) –

    A Func object that represents the array length function.

Example
dc.mutate(
    len1=func.array.length("signal.values"),
    len2=func.array.length(dc.C("signal.values")),
    len3=func.array.length([1, 2, 3, 4, 5]),
)
Notes
  • The result column will always be of type int.
Source code in datachain/func/array.py
def length(arg: Union[str, Column, Func, Sequence]) -> Func:
    """
    Returns the length of the array.

    Args:
        arg (str | Column | Func | Sequence): Array to compute the length of.
            If a string is provided, it is assumed to be the name of the array column.
            If a Column is provided, it is assumed to be an array column.
            If a Func is provided, it is assumed to be a function returning an array.
            If a sequence is provided, it is assumed to be an array of values.

    Returns:
        Func: A `Func` object that represents the array length function.

    Example:
        ```py
        dc.mutate(
            len1=func.array.length("signal.values"),
            len2=func.array.length(dc.C("signal.values")),
            len3=func.array.length([1, 2, 3, 4, 5]),
        )
        ```

    Notes:
        - The result column will always be of type int.
    """
    if isinstance(arg, (str, Column, Func)):
        cols = [arg]
        args = None
    else:
        cols = None
        args = [arg]

    return Func("length", inner=array.length, cols=cols, args=args, result_type=int)

max

max(col: Union[str, Column]) -> Func

Returns the MAX aggregate SQL function for the given column name.

The MAX function returns the smallest value in the specified column. It can be used on both numeric and non-numeric columns to find the maximum value.

Parameters:

  • col (str | Column) –

    The name of the column for which to find the maximum value. Column can be specified as a string or a Column object.

Returns:

  • Func ( Func ) –

    A Func object that represents the MAX aggregate function.

Example
dc.group_by(
    largest_file=func.max("file.size"),
    max_signal=func.max(dc.C("signal")),
    partition_by="signal.category",
)
Notes
  • The max function can be used with numeric, date, and string columns.
  • The result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
def max(col: Union[str, Column]) -> Func:
    """
    Returns the MAX aggregate SQL function for the given column name.

    The MAX function returns the smallest value in the specified column.
    It can be used on both numeric and non-numeric columns to find the maximum value.

    Args:
        col (str | Column): The name of the column for which to find the maximum value.
            Column can be specified as a string or a `Column` object.

    Returns:
        Func: A Func object that represents the MAX aggregate function.

    Example:
        ```py
        dc.group_by(
            largest_file=func.max("file.size"),
            max_signal=func.max(dc.C("signal")),
            partition_by="signal.category",
        )
        ```

    Notes:
        - The `max` function can be used with numeric, date, and string columns.
        - The result column will have the same type as the input column.
    """
    return Func("max", inner=sa_func.max, cols=[col])

min

min(col: Union[str, Column]) -> Func

Returns the MIN aggregate SQL function for the specified column.

The MIN function returns the smallest value in the specified column. It can be used on both numeric and non-numeric columns to find the minimum value.

Parameters:

  • col (str | Column) –

    The name of the column for which to find the minimum value. Column can be specified as a string or a Column object.

Returns:

  • Func ( Func ) –

    A Func object that represents the MIN aggregate function.

Example
dc.group_by(
    smallest_file=func.min("file.size"),
    min_signal=func.min(dc.C("signal")),
    partition_by="signal.category",
)
Notes
  • The min function can be used with numeric, date, and string columns.
  • The result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
def min(col: Union[str, Column]) -> Func:
    """
    Returns the MIN aggregate SQL function for the specified column.

    The MIN function returns the smallest value in the specified column.
    It can be used on both numeric and non-numeric columns to find the minimum value.

    Args:
        col (str | Column): The name of the column for which to find the minimum value.
            Column can be specified as a string or a `Column` object.

    Returns:
        Func: A Func object that represents the MIN aggregate function.

    Example:
        ```py
        dc.group_by(
            smallest_file=func.min("file.size"),
            min_signal=func.min(dc.C("signal")),
            partition_by="signal.category",
        )
        ```

    Notes:
        - The `min` function can be used with numeric, date, and string columns.
        - The result column will have the same type as the input column.
    """
    return Func("min", inner=sa_func.min, cols=[col])

name

name(col: ColT) -> Func

Returns the final component of a posix-style path.

Parameters:

  • col (str | Column | Func | literal) –

    String to compute the path name of. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column object. If a Func is provided, it is assumed to be a function returning a string. If a literal is provided, it is assumed to be a string literal.

Returns:

  • Func ( Func ) –

    A Func object that represents the path name function.

Example
dc.mutate(
    filename1=func.path.name("file.path"),
    filename2=func.path.name(dc.C("file.path")),
    filename3=func.path.name(dc.func.literal("/path/to/file.txt")
)
Note
  • The result column will always be of type string.
Source code in datachain/func/path.py
def name(col: ColT) -> Func:
    """
    Returns the final component of a posix-style path.

    Args:
        col (str | Column | Func | literal): String to compute the path name of.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column object.
            If a Func is provided, it is assumed to be a function returning a string.
            If a literal is provided, it is assumed to be a string literal.

    Returns:
        Func: A `Func` object that represents the path name function.

    Example:
        ```py
        dc.mutate(
            filename1=func.path.name("file.path"),
            filename2=func.path.name(dc.C("file.path")),
            filename3=func.path.name(dc.func.literal("/path/to/file.txt")
        )
        ```

    Note:
        - The result column will always be of type string.
    """

    return Func("name", inner=path.name, cols=[col], result_type=str)

or_

or_(*args: Union[ColumnElement, Func]) -> Func

Returns the function that produces conjunction of expressions joined by OR logical operator.

Parameters:

  • args (ColumnElement | Func, default: () ) –

    The expressions for OR statement. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column in the dataset. If a Func is provided, it is assumed to be a function returning a value.

Returns:

  • Func ( Func ) –

    A Func object that represents the OR function.

Example
dc.mutate(
    test=ifelse(or_(isnone("name"), dc.C("name") == ''), "Empty", "Not Empty")
)
Notes
  • The result column will always be of type bool.
Source code in datachain/func/conditional.py
def or_(*args: Union[ColumnElement, Func]) -> Func:
    """
    Returns the function that produces conjunction of expressions joined by OR
    logical operator.

    Args:
        args (ColumnElement | Func): The expressions for OR statement.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column in the dataset.
            If a Func is provided, it is assumed to be a function returning a value.

    Returns:
        Func: A `Func` object that represents the OR function.

    Example:
        ```py
        dc.mutate(
            test=ifelse(or_(isnone("name"), dc.C("name") == ''), "Empty", "Not Empty")
        )
        ```

    Notes:
        - The result column will always be of type bool.
    """
    cols, func_args = [], []

    for arg in args:
        if isinstance(arg, (str, Column, Func)):
            cols.append(arg)
        else:
            func_args.append(arg)

    return Func("or", inner=sql_or, cols=cols, args=func_args, result_type=bool)

parent

parent(col: ColT) -> Func

Returns the directory component of a posix-style path.

Parameters:

  • col (str | Column | Func | literal) –

    String to compute the path parent of. If a string is provided, it is assumed to be the name of the column. If a Column is provided, it is assumed to be a column object. If a Func is provided, it is assumed to be a function returning a string. If a literal is provided, it is assumed to be a string literal.

Returns:

  • Func ( Func ) –

    A Func object that represents the path parent function.

Example
dc.mutate(
    parent1=func.path.parent("file.path"),
    parent2=func.path.parent(dc.C("file.path")),
    parent3=func.path.parent(dc.func.literal("/path/to/file.txt")),
)
Note
  • The result column will always be of type string.
Source code in datachain/func/path.py
def parent(col: ColT) -> Func:
    """
    Returns the directory component of a posix-style path.

    Args:
        col (str | Column | Func | literal): String to compute the path parent of.
            If a string is provided, it is assumed to be the name of the column.
            If a Column is provided, it is assumed to be a column object.
            If a Func is provided, it is assumed to be a function returning a string.
            If a literal is provided, it is assumed to be a string literal.

    Returns:
        Func: A `Func` object that represents the path parent function.

    Example:
        ```py
        dc.mutate(
            parent1=func.path.parent("file.path"),
            parent2=func.path.parent(dc.C("file.path")),
            parent3=func.path.parent(dc.func.literal("/path/to/file.txt")),
        )
        ```

    Note:
        - The result column will always be of type string.
    """
    return Func("parent", inner=path.parent, cols=[col], result_type=str)

rand

rand() -> Func

Returns the random integer value.

Returns:

  • Func ( Func ) –

    A Func object that represents the rand function.

Example
dc.mutate(
    rnd=func.random.rand(),
)
Note
  • The result column will always be of type integer.
Source code in datachain/func/random.py
def rand() -> Func:
    """
    Returns the random integer value.

    Returns:
        Func: A `Func` object that represents the rand function.

    Example:
        ```py
        dc.mutate(
            rnd=func.random.rand(),
        )
        ```

    Note:
        - The result column will always be of type integer.
    """
    return Func("rand", inner=random.rand, result_type=int)

rank

rank() -> Func

Returns the RANK window function for SQL queries.

The RANK function assigns a rank to each row within a partition of a result set, with gaps in the ranking for ties. Rows with equal values receive the same rank, and the next rank is skipped (i.e., if two rows are ranked 1, the next row is ranked 3).

Returns:

  • Func ( Func ) –

    A Func object that represents the RANK window function.

Example
window = func.window(partition_by="signal.category", order_by="created_at")
dc.mutate(
    rank=func.rank().over(window),
)
Notes
  • The result column will always be of type int.
  • The RANK function differs from ROW_NUMBER in that rows with the same value in the ordering column(s) receive the same rank.
Source code in datachain/func/aggregate.py
def rank() -> Func:
    """
    Returns the RANK window function for SQL queries.

    The RANK function assigns a rank to each row within a partition of a result set,
    with gaps in the ranking for ties. Rows with equal values receive the same rank,
    and the next rank is skipped (i.e., if two rows are ranked 1,
    the next row is ranked 3).

    Returns:
        Func: A Func object that represents the RANK window function.

    Example:
        ```py
        window = func.window(partition_by="signal.category", order_by="created_at")
        dc.mutate(
            rank=func.rank().over(window),
        )
        ```

    Notes:
        - The result column will always be of type int.
        - The RANK function differs from ROW_NUMBER in that rows with the same value
          in the ordering column(s) receive the same rank.
    """
    return Func("rank", inner=sa_func.rank, result_type=int, is_window=True)

row_number

row_number() -> Func

Returns the ROW_NUMBER window function for SQL queries.

The ROW_NUMBER function assigns a unique sequential integer to rows within a partition of a result set, starting from 1 for the first row in each partition. It is commonly used to generate row numbers within partitions or ordered results.

Returns:

  • Func ( Func ) –

    A Func object that represents the ROW_NUMBER window function.

Example
window = func.window(partition_by="signal.category", order_by="created_at")
dc.mutate(
    row_number=func.row_number().over(window),
)
Note
  • The result column will always be of type int.
Source code in datachain/func/aggregate.py
def row_number() -> Func:
    """
    Returns the ROW_NUMBER window function for SQL queries.

    The ROW_NUMBER function assigns a unique sequential integer to rows
    within a partition of a result set, starting from 1 for the first row
    in each partition. It is commonly used to generate row numbers within
    partitions or ordered results.

    Returns:
        Func: A Func object that represents the ROW_NUMBER window function.

    Example:
        ```py
        window = func.window(partition_by="signal.category", order_by="created_at")
        dc.mutate(
            row_number=func.row_number().over(window),
        )
        ```

    Note:
        - The result column will always be of type int.
    """
    return Func("row_number", inner=sa_func.row_number, result_type=int, is_window=True)

sip_hash_64

sip_hash_64(
    arg: Union[str, Column, Func, Sequence],
) -> Func

Returns the SipHash-64 hash of the array.

Parameters:

  • arg (str | Column | Func | Sequence) –

    Array to compute the SipHash-64 hash of. If a string is provided, it is assumed to be the name of the array column. If a Column is provided, it is assumed to be an array column. If a Func is provided, it is assumed to be a function returning an array. If a sequence is provided, it is assumed to be an array of values.

Returns:

  • Func ( Func ) –

    A Func object that represents the sip_hash_64 function.

Example
dc.mutate(
    hash1=func.sip_hash_64("signal.values"),
    hash2=func.sip_hash_64(dc.C("signal.values")),
    hash3=func.sip_hash_64([1, 2, 3, 4, 5]),
)
Note
  • This function is only available for the ClickHouse warehouse.
  • The result column will always be of type int.
Source code in datachain/func/array.py
def sip_hash_64(arg: Union[str, Column, Func, Sequence]) -> Func:
    """
    Returns the SipHash-64 hash of the array.

    Args:
        arg (str | Column | Func | Sequence): Array to compute the SipHash-64 hash of.
            If a string is provided, it is assumed to be the name of the array column.
            If a Column is provided, it is assumed to be an array column.
            If a Func is provided, it is assumed to be a function returning an array.
            If a sequence is provided, it is assumed to be an array of values.

    Returns:
        Func: A `Func` object that represents the sip_hash_64 function.

    Example:
        ```py
        dc.mutate(
            hash1=func.sip_hash_64("signal.values"),
            hash2=func.sip_hash_64(dc.C("signal.values")),
            hash3=func.sip_hash_64([1, 2, 3, 4, 5]),
        )
        ```

    Note:
        - This function is only available for the ClickHouse warehouse.
        - The result column will always be of type int.
    """
    if isinstance(arg, (str, Column, Func)):
        cols = [arg]
        args = None
    else:
        cols = None
        args = [arg]

    return Func(
        "sip_hash_64", inner=array.sip_hash_64, cols=cols, args=args, result_type=int
    )

sum

sum(col: Union[str, Column]) -> Func

Returns the SUM aggregate SQL function for the specified column.

The SUM function returns the total sum of a numeric column in a table. It sums up all the values for the specified column.

Parameters:

  • col (str | Column) –

    The name of the column for which to calculate the sum. The column can be specified as a string or a Column object.

Returns:

  • Func ( Func ) –

    A Func object that represents the SUM aggregate function.

Example
dc.group_by(
    files_size=func.sum("file.size"),
    total_size=func.sum(dc.C("size")),
    partition_by="signal.category",
)
Notes
  • The sum function should be used on numeric columns.
  • The result column type will be the same as the input column type.
Source code in datachain/func/aggregate.py
def sum(col: Union[str, Column]) -> Func:
    """
    Returns the SUM aggregate SQL function for the specified column.

    The SUM function returns the total sum of a numeric column in a table.
    It sums up all the values for the specified column.

    Args:
        col (str | Column): The name of the column for which to calculate the sum.
            The column can be specified as a string or a `Column` object.

    Returns:
        Func: A `Func` object that represents the SUM aggregate function.

    Example:
        ```py
        dc.group_by(
            files_size=func.sum("file.size"),
            total_size=func.sum(dc.C("size")),
            partition_by="signal.category",
        )
        ```

    Notes:
        - The `sum` function should be used on numeric columns.
        - The result column type will be the same as the input column type.
    """
    return Func("sum", inner=sa_func.sum, cols=[col])