Aggregate Functions
Aggregate functions perform calculations on sets of values and return a single result.
aggregate
any_value
Returns the ANY_VALUE aggregate SQL function for the given column name.
The ANY_VALUE function returns an arbitrary value from the specified column. It is useful when you do not care which particular value is returned, as long as it comes from one of the rows in the group.
Parameters:
-
col
(str | Column
) βThe name of the column from which to return an arbitrary value. Column can be specified as a string or a
Column
object.
Returns:
-
Func
(Func
) βA Func object that represents the ANY_VALUE aggregate function.
Example
Notes
- The
any_value
function can be used with any type of column. - The result column will have the same type as the input column.
- The result of
any_value
is non-deterministic, meaning it may return different values for different executions.
Source code in datachain/func/aggregate.py
avg
Returns the AVG aggregate SQL function for the specified column.
The AVG function returns the average of a numeric column in a table. It calculates the mean of all values in the specified column.
Parameters:
-
col
(str | Column
) βThe name of the column for which to calculate the average. Column can be specified as a string or a
Column
object.
Returns:
-
Func
(Func
) βA Func object that represents the AVG aggregate function.
Example
Notes
- The
avg
function should be used on numeric columns. - The result column will always be of type float.
Source code in datachain/func/aggregate.py
collect
Returns the COLLECT aggregate SQL function for the given column name.
The COLLECT function gathers all values from the specified column into an array or similar structure. It is useful for combining values from a column into a collection, often for further processing or aggregation.
Parameters:
-
col
(str | Column
) βThe name of the column from which to collect values. Column can be specified as a string or a
Column
object.
Returns:
-
Func
(Func
) βA Func object that represents the COLLECT aggregate function.
Example
Notes
- The
collect
function can be used with numeric and string columns. - The result column will have an array type.
Source code in datachain/func/aggregate.py
concat
Returns the CONCAT aggregate SQL function for the given column name.
The CONCAT function concatenates values from the specified column into a single string. It is useful for merging text values from multiple rows into a single combined value.
Parameters:
-
col
(str | Column
) βThe name of the column from which to concatenate values. Column can be specified as a string or a
Column
object. -
separator
(str
, default:''
) βThe separator to use between concatenated values. Defaults to an empty string.
Returns:
-
Func
(Func
) βA Func object that represents the CONCAT aggregate function.
Example
Notes
- The
concat
function can be used with string columns. - The result column will have a string type.
Source code in datachain/func/aggregate.py
count
Returns a COUNT aggregate SQL function for the specified column.
The COUNT function returns the number of rows, optionally filtered by a specific column.
Parameters:
-
col
(str | Column
, default:None
) βThe column to count. If omitted, counts all rows. The column can be specified as a string or a
Column
object.
Returns:
-
Func
(Func
) βA
Func
object representing the COUNT aggregate function.
Example
Notes
- The result column will always have an integer type.
Source code in datachain/func/aggregate.py
dense_rank
Returns the DENSE_RANK window function for SQL queries.
The DENSE_RANK function assigns a rank to each row within a partition of a result set, without gaps in the ranking for ties. Rows with equal values receive the same rank, but the next rank is assigned consecutively (i.e., if two rows are ranked 1, the next row will be ranked 2).
Returns:
-
Func
(Func
) βA Func object that represents the DENSE_RANK window function.
Example
Notes
- The result column will always be of type int.
- The DENSE_RANK function differs from RANK in that it does not leave gaps in the ranking for tied values.
Source code in datachain/func/aggregate.py
first
Returns the FIRST_VALUE window function for SQL queries.
The FIRST_VALUE function returns the first value in an ordered set of values within a partition. The first value is determined by the specified order and can be useful for retrieving the leading value in a group of rows.
Parameters:
-
col
(str | Column
) βThe name of the column from which to retrieve the first value. Column can be specified as a string or a
Column
object.
Returns:
-
Func
(Func
) βA Func object that represents the FIRST_VALUE window function.
Example
Note
- The result of
first_value
will always reflect the value of the first row in the specified order. - The result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
max
Returns the MAX aggregate SQL function for the given column name.
The MAX function returns the smallest value in the specified column. It can be used on both numeric and non-numeric columns to find the maximum value.
Parameters:
-
col
(str | Column
) βThe name of the column for which to find the maximum value. Column can be specified as a string or a
Column
object.
Returns:
-
Func
(Func
) βA Func object that represents the MAX aggregate function.
Example
Notes
- The
max
function can be used with numeric, date, and string columns. - The result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
min
Returns the MIN aggregate SQL function for the specified column.
The MIN function returns the smallest value in the specified column. It can be used on both numeric and non-numeric columns to find the minimum value.
Parameters:
-
col
(str | Column
) βThe name of the column for which to find the minimum value. Column can be specified as a string or a
Column
object.
Returns:
-
Func
(Func
) βA Func object that represents the MIN aggregate function.
Example
Notes
- The
min
function can be used with numeric, date, and string columns. - The result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
rank
Returns the RANK window function for SQL queries.
The RANK function assigns a rank to each row within a partition of a result set, with gaps in the ranking for ties. Rows with equal values receive the same rank, and the next rank is skipped (i.e., if two rows are ranked 1, the next row is ranked 3).
Returns:
-
Func
(Func
) βA Func object that represents the RANK window function.
Example
Notes
- The result column will always be of type int.
- The RANK function differs from ROW_NUMBER in that rows with the same value in the ordering column(s) receive the same rank.
Source code in datachain/func/aggregate.py
row_number
Returns the ROW_NUMBER window function for SQL queries.
The ROW_NUMBER function assigns a unique sequential integer to rows within a partition of a result set, starting from 1 for the first row in each partition. It is commonly used to generate row numbers within partitions or ordered results.
Returns:
-
Func
(Func
) βA Func object that represents the ROW_NUMBER window function.
Example
Note
- The result column will always be of type int.
Source code in datachain/func/aggregate.py
sum
Returns the SUM aggregate SQL function for the specified column.
The SUM function returns the total sum of a numeric column in a table. It sums up all the values for the specified column.
Parameters:
-
col
(str | Column
) βThe name of the column for which to calculate the sum. The column can be specified as a string or a
Column
object.
Returns:
-
Func
(Func
) βA
Func
object that represents the SUM aggregate function.
Example
Notes
- The
sum
function should be used on numeric columns. - The result column type will be the same as the input column type.