Functions
Use built-in functions for data manipulation and analysis to operate on the underlying database storing the chain data. These functions are useful for operations like DataChain.filter
and DataChain.mutate
. Import these functions from datachain.func
.
func
and_
and_(*args: Union[ColumnElement, Func]) -> Func
Returns the function that produces conjunction of expressions joined by AND logical operator.
Parameters:
-
args
(ColumnElement | Func
, default:()
) –The expressions for AND statement.
Returns:
-
Func
(Func
) –A Func object that represents the and function.
Source code in datachain/func/conditional.py
any_value
any_value(col: str) -> Func
Returns the ANY_VALUE aggregate SQL function for the given column name.
The ANY_VALUE function returns an arbitrary value from the specified column. It is useful when you do not care which particular value is returned, as long as it comes from one of the rows in the group.
Parameters:
-
col
(str
) –The name of the column from which to return an arbitrary value.
Returns:
-
Func
(Func
) –A Func object that represents the ANY_VALUE aggregate function.
Notes
- The
any_value
function can be used with any type of column. - Result column will have the same type as the input column.
- The result of
any_value
is non-deterministic, meaning it may return different values for different executions.
Source code in datachain/func/aggregate.py
avg
avg(col: str) -> Func
Returns the AVG aggregate SQL function for the given column name.
The AVG function returns the average of a numeric column in a table. It calculates the mean of all values in the specified column.
Parameters:
-
col
(str
) –The name of the column for which to calculate the average.
Returns:
-
Func
(Func
) –A Func object that represents the AVG aggregate function.
Notes
- The
avg
function should be used on numeric columns. - Result column will always be of type float.
Source code in datachain/func/aggregate.py
bit_and
Computes the bitwise AND operation between two values.
Parameters:
-
args
(str | int
, default:()
) –Two values to compute the bitwise AND operation between. If a string is provided, it is assumed to be the name of the column vector. If an integer is provided, it is assumed to be a constant value.
Returns:
-
Func
(Func
) –A Func object that represents the bitwise AND function.
Notes
- Result column will always be of type int.
Source code in datachain/func/numeric.py
bit_hamming_distance
Computes the Hamming distance between the bit representations of two integer values.
The Hamming distance is the number of positions at which the corresponding bits are different. This function returns the dissimilarity between the integers, where 0 indicates identical integers and values closer to the number of bits in the integer indicate higher dissimilarity.
Parameters:
-
args
(str | int
, default:()
) –Two integers to compute the Hamming distance between. If a str is provided, it is assumed to be the name of the column. If an int is provided, it is assumed to be an integer literal.
Returns:
-
Func
(Func
) –A Func object that represents the Hamming distance function.
Notes
- Result column will always be of type int.
Source code in datachain/func/numeric.py
bit_or
Computes the bitwise AND operation between two values.
Parameters:
-
args
(str | int
, default:()
) –Two values to compute the bitwise OR operation between. If a string is provided, it is assumed to be the name of the column vector. If an integer is provided, it is assumed to be a constant value.
Returns:
-
Func
(Func
) –A Func object that represents the bitwise OR function.
Notes
- Result column will always be of type int.
Source code in datachain/func/numeric.py
bit_xor
Computes the bitwise XOR operation between two values.
Parameters:
-
args
(str | int
, default:()
) –Two values to compute the bitwise XOR operation between. If a string is provided, it is assumed to be the name of the column vector. If an integer is provided, it is assumed to be a constant value.
Returns:
-
Func
(Func
) –A Func object that represents the bitwise XOR function.
Notes
- Result column will always be of type int.
Source code in datachain/func/numeric.py
byte_hamming_distance
Computes the Hamming distance between two strings.
The Hamming distance is the number of positions at which the corresponding characters are different. This function returns the dissimilarity between the strings, where 0 indicates identical strings and values closer to the length of the strings indicate higher dissimilarity.
Parameters:
-
args
(str | literal
, default:()
) –Two strings to compute the Hamming distance between. If a str is provided, it is assumed to be the name of the column. If a Literal is provided, it is assumed to be a string literal.
Returns:
-
Func
(Func
) –A Func object that represents the Hamming distance function.
Notes
- Result column will always be of type int.
Source code in datachain/func/string.py
case
case(
*args: tuple[Union[ColumnElement, Func, bool], CaseT],
else_: Optional[CaseT] = None
) -> Func
Returns the case function that produces case expression which has a list of conditions and corresponding results. Results can be python primitives like string, numbers or booleans but can also be other nested functions (including case function) or columns. Result type is inferred from condition results.
Parameters:
-
args
(tuple((ColumnElement | Func | bool),(str | int | float | complex | bool, Func, ColumnElement
, default:()
) –Tuple of condition and values pair.
-
else_
((str | int | float | complex | bool, Func)
, default:None
) –optional else value in case expression. If omitted, and no case conditions are satisfied, the result will be None (NULL in DB).
Returns:
-
Func
(Func
) –A Func object that represents the case function.
Source code in datachain/func/conditional.py
collect
collect(col: str) -> Func
Returns the COLLECT aggregate SQL function for the given column name.
The COLLECT function gathers all values from the specified column into an array or similar structure. It is useful for combining values from a column into a collection, often for further processing or aggregation.
Parameters:
-
col
(str
) –The name of the column from which to collect values.
Returns:
-
Func
(Func
) –A Func object that represents the COLLECT aggregate function.
Notes
- The
collect
function can be used with numeric and string columns. - Result column will have an array type.
Source code in datachain/func/aggregate.py
concat
concat(col: str, separator='') -> Func
Returns the CONCAT aggregate SQL function for the given column name.
The CONCAT function concatenates values from the specified column into a single string. It is useful for merging text values from multiple rows into a single combined value.
Parameters:
-
col
(str
) –The name of the column from which to concatenate values.
-
separator
(str
, default:''
) –The separator to use between concatenated values. Defaults to an empty string.
Returns:
-
Func
(Func
) –A Func object that represents the CONCAT aggregate function.
Example
Notes
- The
concat
function can be used with string columns. - Result column will have a string type.
Source code in datachain/func/aggregate.py
contains
Checks whether the arr
array has the elem
element.
Parameters:
-
arr
(str | Sequence | Func
) –Array to check for the element. If a string is provided, it is assumed to be the name of the array column. If a sequence is provided, it is assumed to be an array of values. If a Func is provided, it is assumed to be a function returning an array.
-
elem
(Any
) –Element to check for in the array.
Returns:
-
Func
(Func
) –A Func object that represents the contains function. Result of the function will be 1 if the element is present in the array, and 0 otherwise.
Example
Source code in datachain/func/array.py
cosine_distance
Computes the cosine distance between two vectors.
The cosine distance is derived from the cosine similarity, which measures the angle between two vectors. This function returns the dissimilarity between the vectors, where 0 indicates identical vectors and values closer to 1 indicate higher dissimilarity.
Parameters:
-
args
(str | Sequence
, default:()
) –Two vectors to compute the cosine distance between. If a string is provided, it is assumed to be the name of the column vector. If a sequence is provided, it is assumed to be a vector of values.
Returns:
-
Func
(Func
) –A Func object that represents the cosine_distance function.
Example
Notes
- Ensure both vectors have the same number of elements.
- Result column will always be of type float.
Source code in datachain/func/array.py
count
Returns the COUNT aggregate SQL function for the given column name.
The COUNT function returns the number of rows in a table.
Parameters:
-
col
(str
, default:None
) –The name of the column for which to count rows. If not provided, it defaults to counting all rows.
Returns:
-
Func
(Func
) –A Func object that represents the COUNT aggregate function.
Notes
- Result column will always be of type int.
Source code in datachain/func/aggregate.py
dense_rank
Returns the DENSE_RANK window function for SQL queries.
The DENSE_RANK function assigns a rank to each row within a partition of a result set, without gaps in the ranking for ties. Rows with equal values receive the same rank, but the next rank is assigned consecutively (i.e., if two rows are ranked 1, the next row will be ranked 2).
Returns:
-
Func
(Func
) –A Func object that represents the DENSE_RANK window function.
Example
Notes
- The result column will always be of type int.
- The DENSE_RANK function differs from RANK in that it does not leave gaps in the ranking for tied values.
Source code in datachain/func/aggregate.py
euclidean_distance
Computes the Euclidean distance between two vectors.
The Euclidean distance is the straight-line distance between two points in Euclidean space. This function returns the distance between the two vectors.
Parameters:
-
args
(str | Sequence
, default:()
) –Two vectors to compute the Euclidean distance between. If a string is provided, it is assumed to be the name of the column vector. If a sequence is provided, it is assumed to be a vector of values.
Returns:
-
Func
(Func
) –A Func object that represents the euclidean_distance function.
Example
Notes
- Ensure both vectors have the same number of elements.
- Result column will always be of type float.
Source code in datachain/func/array.py
first
first(col: str) -> Func
Returns the FIRST_VALUE window function for SQL queries.
The FIRST_VALUE function returns the first value in an ordered set of values within a partition. The first value is determined by the specified order and can be useful for retrieving the leading value in a group of rows.
Parameters:
-
col
(str
) –The name of the column from which to retrieve the first value.
Returns:
-
Func
(Func
) –A Func object that represents the FIRST_VALUE window function.
Example
Note
- The result of
first_value
will always reflect the value of the first row in the specified order. - The result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
greatest
Returns the greatest (largest) value from the given input values.
Parameters:
-
args
(ColT | str | int | float | Sequence
, default:()
) –The values to compare. If a string is provided, it is assumed to be the name of the column. If a Func is provided, it is assumed to be a function returning a value. If an int, float, or Sequence is provided, it is assumed to be a literal.
Returns:
-
Func
(Func
) –A Func object that represents the greatest function.
Note
- Result column will always be of the same type as the input columns.
Source code in datachain/func/conditional.py
ifelse
ifelse(
condition: Union[ColumnElement, Func],
if_val: CaseT,
else_val: CaseT,
) -> Func
Returns the ifelse function that produces if expression which has a condition and values for true and false outcome. Results can be one of python primitives like string, numbers or booleans, but can also be nested functions or columns. Result type is inferred from the values.
Parameters:
-
condition
((ColumnElement, Func)
) –Condition which is evaluated.
-
if_val
((str | int | float | complex | bool, Func, ColumnElement)
) –Value for true condition outcome.
-
else_val
((str | int | float | complex | bool, Func, ColumnElement)
) –Value for false condition outcome.
Returns:
-
Func
(Func
) –A Func object that represents the ifelse function.
Source code in datachain/func/conditional.py
int_hash_64
Returns the 64-bit hash of an integer.
Parameters:
-
col
(str | int
) –String to compute the hash of. If a string is provided, it is assumed to be the name of the column. If a int is provided, it is assumed to be an int literal. If a Func is provided, it is assumed to be a function returning an int.
Returns:
-
Func
(Func
) –A Func object that represents the 64-bit hash function.
Note
- Result column will always be of type int.
Source code in datachain/func/numeric.py
isnone
Returns True if column value is None, otherwise False.
Parameters:
-
col
(str | Column
) –Column to check if it's None or not. If a string is provided, it is assumed to be the name of the column.
Returns:
-
Func
(Func
) –A Func object that represents the conditional to check if column is None.
Source code in datachain/func/conditional.py
least
Returns the least (smallest) value from the given input values.
Parameters:
-
args
(ColT | str | int | float | Sequence
, default:()
) –The values to compare. If a string is provided, it is assumed to be the name of the column. If a Func is provided, it is assumed to be a function returning a value. If an int, float, or Sequence is provided, it is assumed to be a literal.
Returns:
-
Func
(Func
) –A Func object that represents the least function.
Note
- Result column will always be of the same type as the input columns.
Source code in datachain/func/conditional.py
length
Returns the length of the array.
Parameters:
-
arg
(str | Sequence | Func
) –Array to compute the length of. If a string is provided, it is assumed to be the name of the array column. If a sequence is provided, it is assumed to be an array of values. If a Func is provided, it is assumed to be a function returning an array.
Returns:
-
Func
(Func
) –A Func object that represents the array length function.
Example
Note
- Result column will always be of type int.
Source code in datachain/func/array.py
max
max(col: str) -> Func
Returns the MAX aggregate SQL function for the given column name.
The MAX function returns the smallest value in the specified column. It can be used on both numeric and non-numeric columns to find the maximum value.
Parameters:
-
col
(str
) –The name of the column for which to find the maximum value.
Returns:
-
Func
(Func
) –A Func object that represents the MAX aggregate function.
Notes
- The
max
function can be used with numeric, date, and string columns. - Result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
min
min(col: str) -> Func
Returns the MIN aggregate SQL function for the given column name.
The MIN function returns the smallest value in the specified column. It can be used on both numeric and non-numeric columns to find the minimum value.
Parameters:
-
col
(str
) –The name of the column for which to find the minimum value.
Returns:
-
Func
(Func
) –A Func object that represents the MIN aggregate function.
Notes
- The
min
function can be used with numeric, date, and string columns. - Result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
or_
or_(*args: Union[ColumnElement, Func]) -> Func
Returns the function that produces conjunction of expressions joined by OR logical operator.
Parameters:
-
args
(ColumnElement | Func
, default:()
) –The expressions for OR statement.
Returns:
-
Func
(Func
) –A Func object that represents the or function.
Source code in datachain/func/conditional.py
rand
Returns the random integer value.
Returns:
-
Func
(Func
) –A Func object that represents the rand function.
Note
- Result column will always be of type integer.
Source code in datachain/func/random.py
rank
Returns the RANK window function for SQL queries.
The RANK function assigns a rank to each row within a partition of a result set, with gaps in the ranking for ties. Rows with equal values receive the same rank, and the next rank is skipped (i.e., if two rows are ranked 1, the next row is ranked 3).
Returns:
-
Func
(Func
) –A Func object that represents the RANK window function.
Example
Notes
- The result column will always be of type int.
- The RANK function differs from ROW_NUMBER in that rows with the same value in the ordering column(s) receive the same rank.
Source code in datachain/func/aggregate.py
row_number
Returns the ROW_NUMBER window function for SQL queries.
The ROW_NUMBER function assigns a unique sequential integer to rows within a partition of a result set, starting from 1 for the first row in each partition. It is commonly used to generate row numbers within partitions or ordered results.
Returns:
-
Func
(Func
) –A Func object that represents the ROW_NUMBER window function.
Example
Note
- The result column will always be of type int.
Source code in datachain/func/aggregate.py
sip_hash_64
Computes the SipHash-64 hash of the array.
Parameters:
-
arg
(str | Sequence
) –Array to compute the SipHash-64 hash of. If a string is provided, it is assumed to be the name of the array column. If a sequence is provided, it is assumed to be an array of values.
Returns:
-
Func
(Func
) –A Func object that represents the sip_hash_64 function.
Example
Note
- This function is only available for the ClickHouse warehouse.
- Result column will always be of type int.
Source code in datachain/func/array.py
sum
sum(col: str) -> Func
Returns the SUM aggregate SQL function for the given column name.
The SUM function returns the total sum of a numeric column in a table. It sums up all the values for the specified column.
Parameters:
-
col
(str
) –The name of the column for which to calculate the sum.
Returns:
-
Func
(Func
) –A Func object that represents the SUM aggregate function.
Notes
- The
sum
function should be used on numeric columns. - Result column type will be the same as the input column type.