SQL
Use SQL functions to operate on the underlying database storing the chain data. Useful
for operations like DataChain.filter
and DataChain.mutate
. Import
these functions from datachain.sql.functions
.
avg
avg(col: str) -> Func
Returns the AVG aggregate SQL function for the given column name.
The AVG function returns the average of a numeric column in a table. It calculates the mean of all values in the specified column.
Parameters:
-
col
(str
) βThe name of the column for which to calculate the average.
Returns:
-
Func
(Func
) βA Func object that represents the AVG aggregate function.
Notes
- The
avg
function should be used on numeric columns. - Result column will always be of type float.
Source code in datachain/func/aggregate.py
count
Returns the COUNT aggregate SQL function for the given column name.
The COUNT function returns the number of rows in a table.
Parameters:
-
col
(str
, default:None
) βThe name of the column for which to count rows. If not provided, it defaults to counting all rows.
Returns:
-
Func
(Func
) βA Func object that represents the COUNT aggregate function.
Notes
- Result column will always be of type int.
Source code in datachain/func/aggregate.py
greatest
Returns the greatest (largest) value from the given input values.
Parameters:
-
args
(ColT | str | int | float | Sequence
, default:()
) βThe values to compare. If a string is provided, it is assumed to be the name of the column. If a Func is provided, it is assumed to be a function returning a value. If an int, float, or Sequence is provided, it is assumed to be a literal.
Returns:
-
Func
(Func
) βA Func object that represents the greatest function.
Note
- Result column will always be of the same type as the input columns.
Source code in datachain/func/conditional.py
least
Returns the least (smallest) value from the given input values.
Parameters:
-
args
(ColT | str | int | float | Sequence
, default:()
) βThe values to compare. If a string is provided, it is assumed to be the name of the column. If a Func is provided, it is assumed to be a function returning a value. If an int, float, or Sequence is provided, it is assumed to be a literal.
Returns:
-
Func
(Func
) βA Func object that represents the least function.
Note
- Result column will always be of the same type as the input columns.
Source code in datachain/func/conditional.py
max
max(col: str) -> Func
Returns the MAX aggregate SQL function for the given column name.
The MAX function returns the smallest value in the specified column. It can be used on both numeric and non-numeric columns to find the maximum value.
Parameters:
-
col
(str
) βThe name of the column for which to find the maximum value.
Returns:
-
Func
(Func
) βA Func object that represents the MAX aggregate function.
Notes
- The
max
function can be used with numeric, date, and string columns. - Result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
min
min(col: str) -> Func
Returns the MIN aggregate SQL function for the given column name.
The MIN function returns the smallest value in the specified column. It can be used on both numeric and non-numeric columns to find the minimum value.
Parameters:
-
col
(str
) βThe name of the column for which to find the minimum value.
Returns:
-
Func
(Func
) βA Func object that represents the MIN aggregate function.
Notes
- The
min
function can be used with numeric, date, and string columns. - Result column will have the same type as the input column.
Source code in datachain/func/aggregate.py
rand
Returns the random integer value.
Returns:
-
Func
(Func
) βA Func object that represents the rand function.
Note
- Result column will always be of type integer.
Source code in datachain/func/random.py
sum
sum(col: str) -> Func
Returns the SUM aggregate SQL function for the given column name.
The SUM function returns the total sum of a numeric column in a table. It sums up all the values for the specified column.
Parameters:
-
col
(str
) βThe name of the column for which to calculate the sum.
Returns:
-
Func
(Func
) βA Func object that represents the SUM aggregate function.
Notes
- The
sum
function should be used on numeric columns. - Result column type will be the same as the input column type.
Source code in datachain/func/aggregate.py
array
cosine_distance
Computes the cosine distance between two vectors.
The cosine distance is derived from the cosine similarity, which measures the angle between two vectors. This function returns the dissimilarity between the vectors, where 0 indicates identical vectors and values closer to 1 indicate higher dissimilarity.
Parameters:
-
args
(str | Sequence
, default:()
) βTwo vectors to compute the cosine distance between. If a string is provided, it is assumed to be the name of the column vector. If a sequence is provided, it is assumed to be a vector of values.
Returns:
-
Func
(Func
) βA Func object that represents the cosine_distance function.
Example
Notes
- Ensure both vectors have the same number of elements.
- Result column will always be of type float.
Source code in datachain/func/array.py
euclidean_distance
Computes the Euclidean distance between two vectors.
The Euclidean distance is the straight-line distance between two points in Euclidean space. This function returns the distance between the two vectors.
Parameters:
-
args
(str | Sequence
, default:()
) βTwo vectors to compute the Euclidean distance between. If a string is provided, it is assumed to be the name of the column vector. If a sequence is provided, it is assumed to be a vector of values.
Returns:
-
Func
(Func
) βA Func object that represents the euclidean_distance function.
Example
Notes
- Ensure both vectors have the same number of elements.
- Result column will always be of type float.
Source code in datachain/func/array.py
length
Returns the length of the array.
Parameters:
-
arg
(str | Sequence | Func
) βArray to compute the length of. If a string is provided, it is assumed to be the name of the array column. If a sequence is provided, it is assumed to be an array of values. If a Func is provided, it is assumed to be a function returning an array.
Returns:
-
Func
(Func
) βA Func object that represents the array length function.
Example
Note
- Result column will always be of type int.
Source code in datachain/func/array.py
sip_hash_64
Computes the SipHash-64 hash of the array.
Parameters:
-
arg
(str | Sequence
) βArray to compute the SipHash-64 hash of. If a string is provided, it is assumed to be the name of the array column. If a sequence is provided, it is assumed to be an array of values.
Returns:
-
Func
(Func
) βA Func object that represents the sip_hash_64 function.
Example
Note
- This function is only available for the ClickHouse warehouse.
- Result column will always be of type int.
Source code in datachain/func/array.py
path
file_ext
Returns the extension of the given path.
Parameters:
-
col
(str | literal
) βString to compute the file extension of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
Returns:
-
Func
(Func
) βA Func object that represents the file extension function.
Note
- Result column will always be of type string.
Source code in datachain/func/path.py
file_stem
Returns the path without the extension.
Parameters:
-
col
(str | literal
) βString to compute the file stem of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
Returns:
-
Func
(Func
) βA Func object that represents the file stem function.
Note
- Result column will always be of type string.
Source code in datachain/func/path.py
name
Returns the final component of a posix-style path.
Parameters:
-
col
(str | literal
) βString to compute the path name of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
Returns:
-
Func
(Func
) βA Func object that represents the path name function.
Note
- Result column will always be of type string.
Source code in datachain/func/path.py
parent
Returns the directory component of a posix-style path.
Parameters:
-
col
(str | literal | Func
) βString to compute the path parent of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
Returns:
-
Func
(Func
) βA Func object that represents the path parent function.
Note
- Result column will always be of type string.
Source code in datachain/func/path.py
string
byte_hamming_distance
Computes the Hamming distance between two strings.
The Hamming distance is the number of positions at which the corresponding characters are different. This function returns the dissimilarity between the strings, where 0 indicates identical strings and values closer to the length of the strings indicate higher dissimilarity.
Parameters:
-
args
(str | literal
, default:()
) βTwo strings to compute the Hamming distance between. If a str is provided, it is assumed to be the name of the column. If a Literal is provided, it is assumed to be a string literal.
Returns:
-
Func
(Func
) βA Func object that represents the Hamming distance function.
Notes
- Result column will always be of type int.
Source code in datachain/func/string.py
length
Returns the length of the string.
Parameters:
-
col
(str | literal | Func
) βString to compute the length of. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
Returns:
-
Func
(Func
) βA Func object that represents the string length function.
Example
Note
- Result column will always be of type int.
Source code in datachain/func/string.py
regexp_replace
Replaces substring that match a regular expression.
Parameters:
-
col
(str | literal
) βColumn to split. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
-
regex
(str
) βRegular expression pattern to replace.
-
replacement
(str
) βReplacement string.
Returns:
-
Func
(Func
) βA Func object that represents the regexp_replace function.
Note
- Result column will always be of type string.
Source code in datachain/func/string.py
replace
Replaces substring with another string.
Parameters:
-
col
(str | literal
) βColumn to split. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
-
pattern
(str
) βPattern to replace.
-
replacement
(str
) βReplacement string.
Returns:
-
Func
(Func
) βA Func object that represents the replace function.
Note
- Result column will always be of type string.
Source code in datachain/func/string.py
split
Takes a column and split character and returns an array of the parts.
Parameters:
-
col
(str | literal
) βColumn to split. If a string is provided, it is assumed to be the name of the column. If a literal is provided, it is assumed to be a string literal. If a Func is provided, it is assumed to be a function returning a string.
-
sep
(str
) βSeparator to split the string.
-
limit
(int
, default:None
) βMaximum number of splits to perform.
Returns:
-
Func
(Func
) βA Func object that represents the split function.
Example
Note
- Result column will always be of type array of strings.