SQL

Use SQL functions to operate on the underlying database storing the chain data. Useful for operations like DataChain.filter and DataChain.mutate. Import these functions from datachain.sql.functions.

avg

avg(col: str) -> Func

Returns the AVG aggregate SQL function for the given column name.

The AVG function returns the average of a numeric column in a table. It calculates the mean of all values in the specified column.

Parameters:

col (str) –

The name of the column for which to calculate the average.

Returns:

Func ( Func ) –

A Func object that represents the AVG aggregate function.

Example

dc.group_by(
    avg_file_size=func.avg("file.size"),
    partition_by="signal.category",
)

Notes

The avg function should be used on numeric columns.
Result column will always be of type float.

Source code in datachain/func/aggregate.py

def avg(col: str) -> Func:
    """
    Returns the AVG aggregate SQL function for the given column name.

    The AVG function returns the average of a numeric column in a table.
    It calculates the mean of all values in the specified column.

    Args:
        col (str): The name of the column for which to calculate the average.

    Returns:
        Func: A Func object that represents the AVG aggregate function.

    Example:
        ```py
        dc.group_by(
            avg_file_size=func.avg("file.size"),
            partition_by="signal.category",
        )
        ```

    Notes:
        - The `avg` function should be used on numeric columns.
        - Result column will always be of type float.
    """
    return Func("avg", inner=aggregate.avg, cols=[col], result_type=float)

count

count(col: Optional[str] = None) -> Func

Returns the COUNT aggregate SQL function for the given column name.

The COUNT function returns the number of rows in a table.

Parameters:

col (str, default: None ) –

The name of the column for which to count rows. If not provided, it defaults to counting all rows.

Returns:

Func ( Func ) –

A Func object that represents the COUNT aggregate function.

Example

dc.group_by(
    count=func.count(),
    partition_by="signal.category",
)

Notes

Result column will always be of type int.

Source code in datachain/func/aggregate.py

def count(col: Optional[str] = None) -> Func:
    """
    Returns the COUNT aggregate SQL function for the given column name.

    The COUNT function returns the number of rows in a table.

    Args:
        col (str, optional): The name of the column for which to count rows.
                             If not provided, it defaults to counting all rows.

    Returns:
        Func: A Func object that represents the COUNT aggregate function.

    Example:
        ```py
        dc.group_by(
            count=func.count(),
            partition_by="signal.category",
        )
        ```

    Notes:
        - Result column will always be of type int.
    """
    return Func(
        "count", inner=sa_func.count, cols=[col] if col else None, result_type=int
    )

greatest

greatest(*args: Union[ColT, float]) -> Func

Returns the greatest (largest) value from the given input values.

Parameters:

args (ColT | str | int | float | Sequence, default: () ) –

The values to compare. If a string is provided, it is assumed to be the name of the column. If a Func is provided, it is assumed to be a function returning a value. If an int, float, or Sequence is provided, it is assumed to be a literal.

Returns:

Func ( Func ) –

A Func object that represents the greatest function.

Example

dc.mutate(
    greatest=func.greatest("signal.value", 0),
)

Note

Result column will always be of the same type as the input columns.

Source code in datachain/func/conditional.py

def greatest(*args: Union[ColT, float]) -> Func:
    """
    Returns the greatest (largest) value from the given input values.

    Args:
        args (ColT | str | int | float | Sequence): The values to compare.
            If a string is provided, it is assumed to be the name of the column.
            If a Func is provided, it is assumed to be a function returning a value.
            If an int, float, or Sequence is provided, it is assumed to be a literal.

    Returns:
        Func: A Func object that represents the greatest function.

    Example:
        ```py
        dc.mutate(
            greatest=func.greatest("signal.value", 0),
        )
        ```

    Note:
        - Result column will always be of the same type as the input columns.
    """
    cols, func_args = [], []

    for arg in args:
        if isinstance(arg, (str, Func)):
            cols.append(arg)
        else:
            func_args.append(arg)

    return Func(
        "greatest",
        inner=conditional.greatest,
        cols=cols,
        args=func_args,
        result_type=int,
    )

least

least(*args: Union[ColT, float]) -> Func

Returns the least (smallest) value from the given input values.

Parameters:

args (ColT | str | int | float | Sequence, default: () ) –

The values to compare. If a string is provided, it is assumed to be the name of the column. If a Func is provided, it is assumed to be a function returning a value. If an int, float, or Sequence is provided, it is assumed to be a literal.

Returns:

Func ( Func ) –

A Func object that represents the least function.

Example

dc.mutate(
    least=func.least("signal.value", 0),
)

Note

Result column will always be of the same type as the input columns.

Source code in datachain/func/conditional.py

def least(*args: Union[ColT, float]) -> Func:
    """
    Returns the least (smallest) value from the given input values.

    Args:
        args (ColT | str | int | float | Sequence): The values to compare.
            If a string is provided, it is assumed to be the name of the column.
            If a Func is provided, it is assumed to be a function returning a value.
            If an int, float, or Sequence is provided, it is assumed to be a literal.

    Returns:
        Func: A Func object that represents the least function.

    Example:
        ```py
        dc.mutate(
            least=func.least("signal.value", 0),
        )
        ```

    Note:
        - Result column will always be of the same type as the input columns.
    """
    cols, func_args = [], []

    for arg in args:
        if isinstance(arg, (str, Func)):
            cols.append(arg)
        else:
            func_args.append(arg)

    return Func(
        "least", inner=conditional.least, cols=cols, args=func_args, result_type=int
    )

max

max(col: str) -> Func

Returns the MAX aggregate SQL function for the given column name.

The MAX function returns the smallest value in the specified column. It can be used on both numeric and non-numeric columns to find the maximum value.

Parameters:

col (str) –

The name of the column for which to find the maximum value.

Returns:

Func ( Func ) –

A Func object that represents the MAX aggregate function.

Example

dc.group_by(
    largest_file=func.max("file.size"),
    partition_by="signal.category",
)

Notes

The max function can be used with numeric, date, and string columns.
Result column will have the same type as the input column.

Source code in datachain/func/aggregate.py

def max(col: str) -> Func:
    """
    Returns the MAX aggregate SQL function for the given column name.

    The MAX function returns the smallest value in the specified column.
    It can be used on both numeric and non-numeric columns to find the maximum value.

    Args:
        col (str): The name of the column for which to find the maximum value.

    Returns:
        Func: A Func object that represents the MAX aggregate function.

    Example:
        ```py
        dc.group_by(
            largest_file=func.max("file.size"),
            partition_by="signal.category",
        )
        ```

    Notes:
        - The `max` function can be used with numeric, date, and string columns.
        - Result column will have the same type as the input column.
    """
    return Func("max", inner=sa_func.max, cols=[col])

min

min(col: str) -> Func

Returns the MIN aggregate SQL function for the given column name.

The MIN function returns the smallest value in the specified column. It can be used on both numeric and non-numeric columns to find the minimum value.

Parameters:

col (str) –

The name of the column for which to find the minimum value.

Returns:

Func ( Func ) –

A Func object that represents the MIN aggregate function.

Example

dc.group_by(
    smallest_file=func.min("file.size"),
    partition_by="signal.category",
)

Notes

The min function can be used with numeric, date, and string columns.
Result column will have the same type as the input column.

Source code in datachain/func/aggregate.py

def min(col: str) -> Func:
    """
    Returns the MIN aggregate SQL function for the given column name.

    The MIN function returns the smallest value in the specified column.
    It can be used on both numeric and non-numeric columns to find the minimum value.

    Args:
        col (str): The name of the column for which to find the minimum value.

    Returns:
        Func: A Func object that represents the MIN aggregate function.

    Example:
        ```py
        dc.group_by(
            smallest_file=func.min("file.size"),
            partition_by="signal.category",
        )
        ```

    Notes:
        - The `min` function can be used with numeric, date, and string columns.
        - Result column will have the same type as the input column.
    """
    return Func("min", inner=sa_func.min, cols=[col])

rand

rand() -> Func

Returns the random integer value.

Returns:

Func ( Func ) –

A Func object that represents the rand function.

Example

dc.mutate(
    rnd=func.random.rand(),
)

Note

Result column will always be of type integer.

Source code in datachain/func/random.py

def rand() -> Func:
    """
    Returns the random integer value.

    Returns:
        Func: A Func object that represents the rand function.

    Example:
        ```py
        dc.mutate(
            rnd=func.random.rand(),
        )
        ```

    Note:
        - Result column will always be of type integer.
    """
    return Func("rand", inner=random.rand, result_type=int)

sum

sum(col: str) -> Func

Returns the SUM aggregate SQL function for the given column name.

The SUM function returns the total sum of a numeric column in a table. It sums up all the values for the specified column.

Parameters:

col (str) –

The name of the column for which to calculate the sum.

Returns:

Func ( Func ) –

A Func object that represents the SUM aggregate function.

Example

dc.group_by(
    files_size=func.sum("file.size"),
    partition_by="signal.category",
)

Notes

The sum function should be used on numeric columns.
Result column type will be the same as the input column type.

Source code in datachain/func/aggregate.py

def sum(col: str) -> Func:
    """
    Returns the SUM aggregate SQL function for the given column name.

    The SUM function returns the total sum of a numeric column in a table.
    It sums up all the values for the specified column.

    Args:
        col (str): The name of the column for which to calculate the sum.

    Returns:
        Func: A Func object that represents the SUM aggregate function.

    Example:
        ```py
        dc.group_by(
            files_size=func.sum("file.size"),
            partition_by="signal.category",
        )
        ```

    Notes:
        - The `sum` function should be used on numeric columns.
        - Result column type will be the same as the input column type.
    """
    return Func("sum", inner=sa_func.sum, cols=[col])

array

cosine_distance

cosine_distance(*args: Union[str, Sequence]) -> Func

Computes the cosine distance between two vectors.

The cosine distance is derived from the cosine similarity, which measures the angle between two vectors. This function returns the dissimilarity between the vectors, where 0 indicates identical vectors and values closer to 1 indicate higher dissimilarity.

Parameters:

args (str | Sequence, default: () ) –

Two vectors to compute the cosine distance between. If a string is provided, it is assumed to be the name of the column vector. If a sequence is provided, it is assumed to be a vector of values.

Returns:

Func ( Func ) –

A Func object that represents the cosine_distance function.

Example

target_embedding = [0.1, 0.2, 0.3]
dc.mutate(
    cos_dist1=func.cosine_distance("embedding", target_embedding),
    cos_dist2=func.cosine_distance(target_embedding, [0.4, 0.5, 0.6]),
)

Notes

Ensure both vectors have the same number of elements.
Result column will always be of type float.

Source code in datachain/func/array.py

def cosine_distance(*args: Union[str, Sequence]) -> Func:
    """
    Computes the cosine distance between two vectors.

    The cosine distance is derived from the cosine similarity, which measures the angle
    between two vectors. This function returns the dissimilarity between the vectors,
    where 0 indicates identical vectors and values closer to 1
    indicate higher dissimilarity.

    Args:
        args (str | Sequence): Two vectors to compute the cosine distance between.
            If a string is provided, it is assumed to be the name of the column vector.
            If a sequence is provided, it is assumed to be a vector of values.

    Returns:
        Func: A Func object that represents the cosine_distance function.

    Example:
        ```py
        target_embedding = [0.1, 0.2, 0.3]
        dc.mutate(
            cos_dist1=func.cosine_distance("embedding", target_embedding),
            cos_dist2=func.cosine_distance(target_embedding, [0.4, 0.5, 0.6]),
        )
        ```

    Notes:
        - Ensure both vectors have the same number of elements.
        - Result column will always be of type float.
    """
    cols, func_args = [], []
    for arg in args:
        if isinstance(arg, str):
            cols.append(arg)
        else:
            func_args.append(list(arg))

    if len(cols) + len(func_args) != 2:
        raise ValueError("cosine_distance() requires exactly two arguments")
    if not cols and len(func_args[0]) != len(func_args[1]):
        raise ValueError("cosine_distance() requires vectors of the same length")

    return Func(
        "cosine_distance",
        inner=array.cosine_distance,
        cols=cols,
        args=func_args,
        result_type=float,
    )

euclidean_distance

euclidean_distance(*args: Union[str, Sequence]) -> Func

Computes the Euclidean distance between two vectors.

The Euclidean distance is the straight-line distance between two points in Euclidean space. This function returns the distance between the two vectors.

Parameters:

args (str | Sequence, default: () ) –

Two vectors to compute the Euclidean distance between. If a string is provided, it is assumed to be the name of the column vector. If a sequence is provided, it is assumed to be a vector of values.

Returns:

Func ( Func ) –

A Func object that represents the euclidean_distance function.

Example

target_embedding = [0.1, 0.2, 0.3]
dc.mutate(
    eu_dist1=func.euclidean_distance("embedding", target_embedding),
    eu_dist2=func.euclidean_distance(target_embedding, [0.4, 0.5, 0.6]),
)

Notes

Ensure both vectors have the same number of elements.
Result column will always be of type float.

Source code in datachain/func/array.py

def euclidean_distance(*args: Union[str, Sequence]) -> Func:
    """
    Computes the Euclidean distance between two vectors.

    The Euclidean distance is the straight-line distance between two points
    in Euclidean space. This function returns the distance between the two vectors.

    Args:
        args (str | Sequence): Two vectors to compute the Euclidean distance between.
            If a string is provided, it is assumed to be the name of the column vector.
            If a sequence is provided, it is assumed to be a vector of values.

    Returns:
        Func: A Func object that represents the euclidean_distance function.

    Example:
        ```py
        target_embedding = [0.1, 0.2, 0.3]
        dc.mutate(
            eu_dist1=func.euclidean_distance("embedding", target_embedding),
            eu_dist2=func.euclidean_distance(target_embedding, [0.4, 0.5, 0.6]),
        )
        ```

    Notes:
        - Ensure both vectors have the same number of elements.
        - Result column will always be of type float.
    """
    cols, func_args = [], []
    for arg in args:
        if isinstance(arg, str):
            cols.append(arg)
        else:
            func_args.append(list(arg))

    if len(cols) + len(func_args) != 2:
        raise ValueError("euclidean_distance() requires exactly two arguments")
    if not cols and len(func_args[0]) != len(func_args[1]):
        raise ValueError("euclidean_distance() requires vectors of the same length")

    return Func(
        "euclidean_distance",
        inner=array.euclidean_distance,
        cols=cols,
        args=func_args,
        result_type=float,
    )

length

length(arg: Union[str, Sequence, Func]) -> Func

Returns the length of the array.

Parameters:

arg (str | Sequence | Func) –

Array to compute the length of. If a string is provided, it is assumed to be the name of the array column. If a sequence is provided, it is assumed to be an array of values. If a Func is provided, it is assumed to be a function returning an array.

Returns:

Func ( Func ) –

A Func object that represents the array length function.

Example

dc.mutate(
    len1=func.array.length("signal.values"),
    len2=func.array.length([1, 2, 3, 4, 5]),
)

Note

Result column will always be of type int.

Source code in datachain/func/array.py

def length(arg: Union[str, Sequence, Func]) -> Func:
    """
    Returns the length of the array.

    Args:
        arg (str | Sequence | Func): Array to compute the length of.
            If a string is provided, it is assumed to be the name of the array column.
            If a sequence is provided, it is assumed to be an array of values.
            If a Func is provided, it is assumed to be a function returning an array.

    Returns:
        Func: A Func object that represents the array length function.

    Example:
        ```py
        dc.mutate(
            len1=func.array.length("signal.values"),
            len2=func.array.length([1, 2, 3, 4, 5]),
        )
        ```

    Note:
        - Result column will always be of type int.
    """
    if isinstance(arg, (str, Func)):
        cols = [arg]
        args = None
    else:
        cols = None
        args = [arg]

    return Func("length", inner=array.length, cols=cols, args=args, result_type=int)

sip_hash_64

sip_hash_64(arg: Union[str, Sequence]) -> Func

Computes the SipHash-64 hash of the array.

Parameters:

arg (str | Sequence) –

Array to compute the SipHash-64 hash of. If a string is provided, it is assumed to be the name of the array column. If a sequence is provided, it is assumed to be an array of values.

Returns:

Func ( Func ) –

A Func object that represents the sip_hash_64 function.

Example

dc.mutate(
    hash1=func.sip_hash_64("signal.values"),
    hash2=func.sip_hash_64([1, 2, 3, 4, 5]),
)

Note

This function is only available for the ClickHouse warehouse.
Result column will always be of type int.

Source code in datachain/func/array.py

def sip_hash_64(arg: Union[str, Sequence]) -> Func:
    """
    Computes the SipHash-64 hash of the array.

    Args:
        arg (str | Sequence): Array to compute the SipHash-64 hash of.
            If a string is provided, it is assumed to be the name of the array column.
            If a sequence is provided, it is assumed to be an array of values.

    Returns:
        Func: A Func object that represents the sip_hash_64 function.

    Example:
        ```py
        dc.mutate(
            hash1=func.sip_hash_64("signal.values"),
            hash2=func.sip_hash_64([1, 2, 3, 4, 5]),
        )
        ```

    Note:
        - This function is only available for the ClickHouse warehouse.
        - Result column will always be of type int.
    """
    if isinstance(arg, str):
        cols = [arg]
        args = None
    else:
        cols = None
        args = [arg]

    return Func(
        "sip_hash_64", inner=array.sip_hash_64, cols=cols, args=args, result_type=int
    )

path

file_ext

file_ext(col: ColT) -> Func

Returns the extension of the given path.

Parameters:

col (str | literal) –

String to compute the file extension of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.

Returns:

Func ( Func ) –

A Func object that represents the file extension function.

Example

dc.mutate(
    file_stem=func.path.file_ext("file.path"),
)

Note

Result column will always be of type string.

Source code in datachain/func/path.py

def file_ext(col: ColT) -> Func:
    """
    Returns the extension of the given path.

    Args:
        col (str | literal): String to compute the file extension of.
            If a string is provided, it is assumed to be the name of the column.
            If a literal is provided, it is assumed to be a string literal.
            If a Func is provided, it is assumed to be a function returning a string.

    Returns:
        Func: A Func object that represents the file extension function.

    Example:
        ```py
        dc.mutate(
            file_stem=func.path.file_ext("file.path"),
        )
        ```

    Note:
        - Result column will always be of type string.
    """

    return Func("file_ext", inner=path.file_ext, cols=[col], result_type=str)

file_stem

file_stem(col: ColT) -> Func

Returns the path without the extension.

Parameters:

col (str | literal) –

String to compute the file stem of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.

Returns:

Func ( Func ) –

A Func object that represents the file stem function.

Example

dc.mutate(
    file_stem=func.path.file_stem("file.path"),
)

Note

Result column will always be of type string.

Source code in datachain/func/path.py

def file_stem(col: ColT) -> Func:
    """
    Returns the path without the extension.

    Args:
        col (str | literal): String to compute the file stem of.
            If a string is provided, it is assumed to be the name of the column.
            If a literal is provided, it is assumed to be a string literal.
            If a Func is provided, it is assumed to be a function returning a string.

    Returns:
        Func: A Func object that represents the file stem function.

    Example:
        ```py
        dc.mutate(
            file_stem=func.path.file_stem("file.path"),
        )
        ```

    Note:
        - Result column will always be of type string.
    """

    return Func("file_stem", inner=path.file_stem, cols=[col], result_type=str)

name

name(col: ColT) -> Func

Returns the final component of a posix-style path.

Parameters:

col (str | literal) –

String to compute the path name of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.

Returns:

Func ( Func ) –

A Func object that represents the path name function.

Example

dc.mutate(
    file_name=func.path.name("file.path"),
)

Note

Result column will always be of type string.

Source code in datachain/func/path.py

def name(col: ColT) -> Func:
    """
    Returns the final component of a posix-style path.

    Args:
        col (str | literal): String to compute the path name of.
            If a string is provided, it is assumed to be the name of the column.
            If a literal is provided, it is assumed to be a string literal.
            If a Func is provided, it is assumed to be a function returning a string.

    Returns:
        Func: A Func object that represents the path name function.

    Example:
        ```py
        dc.mutate(
            file_name=func.path.name("file.path"),
        )
        ```

    Note:
        - Result column will always be of type string.
    """

    return Func("name", inner=path.name, cols=[col], result_type=str)

parent

parent(col: ColT) -> Func

Returns the directory component of a posix-style path.

Parameters:

col (str | literal | Func) –

String to compute the path parent of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.

Returns:

Func ( Func ) –

A Func object that represents the path parent function.

Example

dc.mutate(
    parent=func.path.parent("file.path"),
)

Note

Result column will always be of type string.

Source code in datachain/func/path.py

def parent(col: ColT) -> Func:
    """
    Returns the directory component of a posix-style path.

    Args:
        col (str | literal | Func): String to compute the path parent of.
            If a string is provided, it is assumed to be the name of the column.
            If a literal is provided, it is assumed to be a string literal.
            If a Func is provided, it is assumed to be a function returning a string.

    Returns:
        Func: A Func object that represents the path parent function.

    Example:
        ```py
        dc.mutate(
            parent=func.path.parent("file.path"),
        )
        ```

    Note:
        - Result column will always be of type string.
    """
    return Func("parent", inner=path.parent, cols=[col], result_type=str)

string

byte_hamming_distance

byte_hamming_distance(*args: Union[str, Func]) -> Func

Computes the Hamming distance between two strings.

The Hamming distance is the number of positions at which the corresponding characters are different. This function returns the dissimilarity between the strings, where 0 indicates identical strings and values closer to the length of the strings indicate higher dissimilarity.

Parameters:

args (str | literal, default: () ) –

Two strings to compute the Hamming distance between. If a str is provided, it is assumed to be the name of the column. If a Literal is provided, it is assumed to be a string literal.

Returns:

Func ( Func ) –

A Func object that represents the Hamming distance function.

Example

dc.mutate(
    ham_dist=func.byte_hamming_distance("file.phash", literal("hello")),
)

Notes

Result column will always be of type int.

Source code in datachain/func/string.py

def byte_hamming_distance(*args: Union[str, Func]) -> Func:
    """
    Computes the Hamming distance between two strings.

    The Hamming distance is the number of positions at which the corresponding
    characters are different. This function returns the dissimilarity between
    the strings, where 0 indicates identical strings and values closer to the length
    of the strings indicate higher dissimilarity.

    Args:
        args (str | literal): Two strings to compute the Hamming distance between.
            If a str is provided, it is assumed to be the name of the column.
            If a Literal is provided, it is assumed to be a string literal.

    Returns:
        Func: A Func object that represents the Hamming distance function.

    Example:
        ```py
        dc.mutate(
            ham_dist=func.byte_hamming_distance("file.phash", literal("hello")),
        )
        ```

    Notes:
        - Result column will always be of type int.
    """
    cols, func_args = [], []
    for arg in args:
        if get_origin(arg) is literal:
            func_args.append(arg)
        else:
            cols.append(arg)

    if len(cols) + len(func_args) != 2:
        raise ValueError("byte_hamming_distance() requires exactly two arguments")

    return Func(
        "byte_hamming_distance",
        inner=string.byte_hamming_distance,
        cols=cols,
        args=func_args,
        result_type=int,
    )

length

length(col: Union[str, Func]) -> Func

Returns the length of the string.

Parameters:

col (str | literal | Func) –

String to compute the length of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.

Returns:

Func ( Func ) –

A Func object that represents the string length function.

Example

dc.mutate(
    len1=func.string.length("file.path"),
    len2=func.string.length("Random string"),
)

Note

Result column will always be of type int.

Source code in datachain/func/string.py

def length(col: Union[str, Func]) -> Func:
    """
    Returns the length of the string.

    Args:
        col (str | literal | Func): String to compute the length of.
            If a string is provided, it is assumed to be the name of the column.
            If a literal is provided, it is assumed to be a string literal.
            If a Func is provided, it is assumed to be a function returning a string.

    Returns:
        Func: A Func object that represents the string length function.

    Example:
        ```py
        dc.mutate(
            len1=func.string.length("file.path"),
            len2=func.string.length("Random string"),
        )
        ```

    Note:
        - Result column will always be of type int.
    """
    return Func("length", inner=string.length, cols=[col], result_type=int)

regexp_replace

regexp_replace(
    col: Union[str, Func], regex: str, replacement: str
) -> Func

Replaces substring that match a regular expression.

Parameters:

col (str | literal) –

Column to split. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
regex (str) –

Regular expression pattern to replace.
replacement (str) –

Replacement string.

Returns:

Func ( Func ) –

A Func object that represents the regexp_replace function.

Example

dc.mutate(
    signal=func.string.regexp_replace("signal.name", r"\d+", "X"),
)

Note

Result column will always be of type string.

Source code in datachain/func/string.py

def regexp_replace(col: Union[str, Func], regex: str, replacement: str) -> Func:
    r"""
    Replaces substring that match a regular expression.

    Args:
        col (str | literal): Column to split.
            If a string is provided, it is assumed to be the name of the column.
            If a literal is provided, it is assumed to be a string literal.
            If a Func is provided, it is assumed to be a function returning a string.
        regex (str): Regular expression pattern to replace.
        replacement (str): Replacement string.

    Returns:
        Func: A Func object that represents the regexp_replace function.

    Example:
        ```py
        dc.mutate(
            signal=func.string.regexp_replace("signal.name", r"\d+", "X"),
        )
        ```

    Note:
        - Result column will always be of type string.
    """

    def inner(arg):
        return string.regexp_replace(arg, regex, replacement)

    if get_origin(col) is literal:
        cols = None
        args = [col]
    else:
        cols = [col]
        args = None

    return Func("regexp_replace", inner=inner, cols=cols, args=args, result_type=str)

replace

replace(
    col: Union[str, Func], pattern: str, replacement: str
) -> Func

Replaces substring with another string.

Parameters:

col (str | literal) –

Column to split. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
pattern (str) –

Pattern to replace.
replacement (str) –

Replacement string.

Returns:

Func ( Func ) –

A Func object that represents the replace function.

Example

dc.mutate(
    signal=func.string.replace("signal.name", "pattern", "replacement),
)

Note

Result column will always be of type string.

Source code in datachain/func/string.py

def replace(col: Union[str, Func], pattern: str, replacement: str) -> Func:
    """
    Replaces substring with another string.

    Args:
        col (str | literal): Column to split.
            If a string is provided, it is assumed to be the name of the column.
            If a literal is provided, it is assumed to be a string literal.
            If a Func is provided, it is assumed to be a function returning a string.
        pattern (str): Pattern to replace.
        replacement (str): Replacement string.

    Returns:
        Func: A Func object that represents the replace function.

    Example:
        ```py
        dc.mutate(
            signal=func.string.replace("signal.name", "pattern", "replacement),
        )
        ```

    Note:
        - Result column will always be of type string.
    """

    def inner(arg):
        return string.replace(arg, pattern, replacement)

    if get_origin(col) is literal:
        cols = None
        args = [col]
    else:
        cols = [col]
        args = None

    return Func("replace", inner=inner, cols=cols, args=args, result_type=str)

split

split(
    col: Union[str, Func],
    sep: str,
    limit: Optional[int] = None,
) -> Func

Takes a column and split character and returns an array of the parts.

Parameters:

col (str | literal) –

Column to split. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
sep (str) –

Separator to split the string.
limit (int, default: None ) –

Maximum number of splits to perform.

Returns:

Func ( Func ) –

A Func object that represents the split function.

Example

dc.mutate(
    path_parts=func.string.split("file.path", "/"),
    str_words=func.string.length("Random string", " "),
)

Note

Result column will always be of type array of strings.

Source code in datachain/func/string.py

def split(col: Union[str, Func], sep: str, limit: Optional[int] = None) -> Func:
    """
    Takes a column and split character and returns an array of the parts.

    Args:
        col (str | literal): Column to split.
            If a string is provided, it is assumed to be the name of the column.
            If a literal is provided, it is assumed to be a string literal.
            If a Func is provided, it is assumed to be a function returning a string.
        sep (str): Separator to split the string.
        limit (int, optional): Maximum number of splits to perform.

    Returns:
        Func: A Func object that represents the split function.

    Example:
        ```py
        dc.mutate(
            path_parts=func.string.split("file.path", "/"),
            str_words=func.string.length("Random string", " "),
        )
        ```

    Note:
        - Result column will always be of type array of strings.
    """

    def inner(arg):
        if limit is not None:
            return string.split(arg, sep, limit)
        return string.split(arg, sep)

    if get_origin(col) is literal:
        cols = None
        args = [col]
    else:
        cols = [col]
        args = None

    return Func("split", inner=inner, cols=cols, args=args, result_type=list[str])