The build definition file, usually named terrazzo.json
, is the means for communicating the steps that Terrazzo should carry out to produce a dataset for you. Think of the build definition as a way of “programming” your dataset in a simple, JSON-based language that stays as close as possible to plain English.
Not familiar with JSON? You can find a good introduction here.
This page describes the format of the build definition file by example and provides a complete reference for the properties and directives available for use.
Anatomy of a build definition
Let’s begin by exploring a real-world example of a Terrazzo build definition that is non-trivial in scope and exercises a good portion of Terrazzo’s build functionality.
The example below (nys/census-political-overlap@1.0.0
) attempts to link United States Census tracts to New York State election districts by their overlapping geographies. A census tract is a geographic district produced by the U.S. Census as a basis for reporting demographic and economic statistics. An election district is a fundamental geographic district within New York State that is used as a building block for political districting: City Council, State Assembly, and State Senate districts are all partitionings of NYS election districts.
These two datasets originate at different levels of government and do not share a common “surrogate key” (such as a numeric id) that can be used to join them. Instead, this build definition must generate a “cross join” of all possible combinations of census tract and election district, then filters out any pairs of districts that do not overlap. Finally, for those districts that do overlap, the build definition calculates the percentage of overlap relative to the total area of the election district.
The resulting dataset can be used to generate coarse demographics for the residents of an election district, by scaling the demographics of the overlapping census tracts relative to their overlap of the district.
The raw build definition follows. Scroll below it for a closer look at each section.
{
"name": "nys/census-political-overlap",
"version": "1.0.0",
"author": "Julian Graham <joolean@undecidable.net>"
"description": "List the overlap of 2020 US Census tracts with NYS election districts"
"plugins": ["terrazzo_geometry"],
"from": {"id": "us-national/census/nyct2020@1.0.0"},
"with": {
"eds": {"path": "nyed.parquet"}
},
"transform": [
{"operation": "renameColumn", "column": "geometry", "to": "ct_geometry"},
{"operation": "renameColumn", "in": "eds", "column": "geometry", "to": "ed_geometry"},
{"operation": "join", "type": "cross", "to": "eds", "on": []},
{
"operation": "filter",
"by": {
"fun": "geom_intersects",
"args": [{"col": "ed_geometry"}, {"col": "ct_geometry"}]
}
},
{
"operation": "compute",
"newColumn": "intersection_geometry",
"as": {
"fun": "geom_intersection",
"args": [{"col": "ed_geometry"}, {"col": "ct_geometry"}]
}
},
{
"operation": "compute",
"newColumn": "ed_area",
"as": {"fun": "geom_area", "args": [{"col": "ed_geometry"}]}
},
{
"operation": "compute",
"newColumn": "intersection_area",
"as": {
"fun": "geom_area",
"args": [{"col": "intersection_geometry"}]
}
},
{
"operation": "filter",
"by": {
"left": {"col": "intersection_area"},
"op": ">",
"right": {"lit": 0}
}
},
{
"operation": "compute",
"newColumn": "overlap_pct",
"as": {
"left": {"col": "intersection_area"},
"op": "/",
"right": {"col": "ed_area"}
}
},
{"operation": "dropColumns", "except": ["GEOID", "CDTANAME", "ElectDist", "overlap_pct"]}
]
}
Let’s start at the top!
The fields at the “head” of the definition provide some general metadata about what the dataset is for, who created it, and how it can be found in a Terrazzo repository:
"name": "nys/census-political-overlap",
"version": "1.0.0",
"author": "Julian Graham <joolean@undecidable.net>"
"description": "List the overlap of 2020 US Census tracts with NYS election districts"
Together, the name
and version
fields form the public id for this dataset: nys/census-political-overlap@1.0.0
.
The nys
in nys/census-political-overlap
is the collection. A collection is like a folder in a Terrazzo repository, and allows repository users to set permissions on groups of datasets. Every dataset name must specify the collection it belongs to. Authors are free to specify additional “folders” after the collection name but this is not required.
Next, let’s look at this plugins
declaration:
"plugins": ["terrazzo_geometry"],
This tells trzo
to load the Terrazzo Geometry plugin, which provides some geospatial functions we’ll be using shortly- stay tuned.
The next two directives are very important:
"from": {"id": "us-national/census/nyct2020@1.0.0"},
"with": {
"eds": {"path": "nyed.parquet"}
},
from
describes where the primary dataset can be found. The primary dataset is the dataset that the current dataset is derived from, and the final state of the primary dataset is the “output” of the build. This from
declaration uses the id
property to indicate that the primary dataset is another Terrazzo dataset and can be downloaded from default Terrazzo repository at terrazzo.dev
using the public id us-national/census/nyct2020@1.0.0
. See Sources of data for an explanation of different ways to refer to datasets for use in a build.
with
describes where input datasets can be found. Input datasets are additional datasets that are used during the build. Each input dataset has a unique name that can be used to refer to it later in the build. In this case, there is a single input dataset named eds
and it uses the path
property to refer to a file on the local filesystem named nyed.parquet
. This means that in order to build this dataset, you must have a copy of that file on your computer in the folder you pass as the CONTEXT
argument to trzo build
.
The rest of this build definition consists of transformations listed in the transform
array. A transformation is a change applied to a primary or input dataset that alters its structure or contents in some way. The transformations in a build definition are applied in the order they appear in the transform
array. The content of the primary dataset after the last transformation is applied becomes the “output” of the build.
Let’s look at the transformations applied by the build definition above.
The first thing that happens is that we rename a column in both the primary and input datasets:
{"operation": "renameColumn", "column": "geometry", "to": "ct_geometry"},
{"operation": "renameColumn", "in": "eds", "column": "geometry", "to": "ed_geometry"},
Every transformation has an operation
property that identifies that type of transformation to be performed. The renameColumn
operation needs to know the name of the column being renamed (column
) and what its new name should be (to
). renameColumn
also needs to know which dataset the target column lives in. The presence of the in
property indicates that the column belongs to the named input dataset; omitting this property indicates that the column belongs to the primary dataset.
These rename transformations are necessary because both the census tract data and the election district data originate from Shapefile databases that have been converted to Parquet format. As part of the conversion, the geospatial data that describes the shape of the districts is stored in a column named geometry
. Because we’re going to combine these two datasets in the next step, we need to ensure that the column names do not conflict with each other.
Now we cross join the primary dataset with the election district data:
{"operation": "join", "type": "cross", "to": "eds", "on": []},
This may look a little bit strange; it’s more common to join a dataset or table by matching one or more shared columns. But these two datasets don’t have any columns that are naturally equal between them. Instead, we’re hoping to find the pairs of districts whose district geometries overlap with each other, and the cross join helps us to do this by generating the Cartesian product of these two datasets- i.e., every combination of records. The result is quite large, but that’s not a problem for Terrazzo! Anyway, we’ll be shrinking it down quite a bit in the next step.
{
"operation": "filter",
"by": {
"fun": "geom_intersects",
"args": [{"col": "ed_geometry"}, {"col": "ct_geometry"}]
}
},
This filter
transformation tests every row in the primary dataset by applying the geom_intersects
function to the two geometry columns we renamed earlier. geom_intersects
returns a boolean value according to whether its arguments intersect, and the filter throws away any rows for which the function doesn’t return true
. This is also an example of using column expressions to refer to the value of a column across every row in a dataset.
Next, we use the compute
transformation to generate some additional information about the intersecting districts.
{
"operation": "compute",
"newColumn": "intersection_geometry",
"as": {
"fun": "geom_intersection",
"args": [{"col": "ed_geometry"}, {"col": "ct_geometry"}]
}
},
{
"operation": "compute",
"newColumn": "ed_area",
"as": {"fun": "geom_area", "args": [{"col": "ed_geometry"}]}
},
{
"operation": "compute",
"newColumn": "intersection_area",
"as": {
"fun": "geom_area",
"args": [{"col": "intersection_geometry"}]
}
},
Each of the three tranformations above adds a new column to the primary dataset by applying a function from the Terrazzo Geometry plugin across all of the rows in the dataset. The first new column is created by applying geom_intersection
to the two district geometry columns; we already know they intersect, now we want to see the exact boundaries of their intersection. Then we use geom_area
twice: First to calculate the area of the election district, then to calculate the area of the intersection.
We’ll be using those two new area columns to calculate the percentage of the election district’s area that falls within each census tract. But first we need to do a little bit of housekeeping:
{
"operation": "filter",
"by": {
"left": {"col": "intersection_area"},
"op": ">",
"right": {"lit": 0}
}
},
This filter operation, which throws away records with an empty intersection area, is necessary because even though we previously filtered to include only pairs of districts that overlap, a few districts with an empty intersection area somehow pass the filter. (This could be due to some quirks in the geometry of the source datasets- which were produced by different government agencies.)
This is also an example of a binary expression, where a left and a right operand are combined with an operator- in this case, a “greater than” comparison. Note also the literal expression, which is a way of including an explicit, static value in an expression, independent of the dynamic value of any columns.
The next computation produces the really interesting datapoint: The percentage of the election district that overlaps the census tract.
{
"operation": "compute",
"newColumn": "overlap_pct",
"as": {
"left": {"col": "intersection_area"},
"op": "/",
"right": {"col": "ed_area"}
}
},
This is another binary expression, but instead of performing a comparison, we’re dividing the area of intersection by the total area of the election district to produce a value for the new overlap_pct
column.
Finally we do a little bit of cleanup on the primary dataset to throw away all of the columns except the ones we want to appear in the final schema.
{"operation": "dropColumns", "except": ["GEOID", "CDTANAME", "ElectDist", "overlap_pct"]}
And we’re done! If you were to run trzo dump
on the resulting dataset, you’d see output like this:
GEOID,CDTANAME,ElectDist,overlap_pct
36061001402,MN03 Lower East Side-Chinatown (CD 3 Equivalent),65030,0.29005219366950663
36061001402,MN03 Lower East Side-Chinatown (CD 3 Equivalent),65031,0.43683679587189234
36061001402,MN03 Lower East Side-Chinatown (CD 3 Equivalent),65033,0.41387083670618263
36061001402,MN03 Lower East Side-Chinatown (CD 3 Equivalent),65034,0.4205040699342719
Build definition reference
The following section is a complete reference to the JSON schema for a Terrazzo build definition.
Header fields
The build definition begins with four properties that provide some foundational metadata for the dataset. These properties are also copied into the manifest for the dataset once it is built.
Property | Required | Type | Description |
---|---|---|---|
name | string | The “name” portion of the dataset’s public id | |
version | string | The “version” portion of the dataset’s public id | |
author | string | The email address of the dataset’s author | |
description | string | A description of the content, or explanatory notes |
See the Concepts page for an explanation of public ids.
In particular, the name
property must have a “path-like” structure such that it is made up of multiple segments separated by forward-slashes, where each segment begins with a letter and is followed by a combination of zero or more letters, numbers, hyphens, and underscores. The first segment in a dataset name is its “collection,” a grouping used by a Terrazzo repository to control access to datasets.
abc/easy-as-123
is a valid dataset nameabc/123
is not a valid dataset name because the second part does not begin with a lettereasy-as-123
is not a valid dataset name because it does not have a collection
Plugins
In order to allow Terrazzo to be extended with additional functionality, the build system exposes a plugin API for registering “scalar functions” - that is, functions that are invoked for individual records in a dataset and which produce a single value as output. One plugin can register many scalar functions, making them available to for invocation as part of function expressions in a compute or filter transformation.
A plugin must be declared in the build definition and loaded by Terrazzo before it can supply its functions to a build. Use the plugins
property to list the plugins containing the functions needed by your build.
The naming conventions for plugin library filenames are different on different operating systems. For example, the Terrazzo Geometry plugin has the following filenames on Linux, MacOS and Windows:
-
libterrazzo_geometry.so
-
libterrazzo_geometry.dylib
-
terrazzo_geometry.dll
As an affordance, to avoid special-casing your plugin declarations for different operating systems, Terrazzo lets you identify a plugin by its “library name,” i.e., without the lib
prefix or file extension. Terrazzo delegates the process of searching for a plugin library file to the underlying operating system. That means that certain environment variables will affect where plugins will be looked for. See the Installation for more information on how this works.
See the “Hello, world” example plugin source code to learn more about writing your own plugins for Terrazzo.
If your build definition does not invoke any user-defined functions, you may omit the plugins
property.
Examples
Loading the Terrazzo Geometry Plugin
"plugins": ["terrazzo_geometry"]
Sources of data
There are two ways to pull Parquet data into a Terrazzo build definition. You can include a data as a primary dataset or as an input dataset. The primary dataset is the foundation from which a new dataset is derived, and over the course of being transformed during the build, becomes the content for the new dataset. Input datasets are named sources of data you can use to complement the primary dataset in various ways, often via a join transformation.
For both primary and input datasets, there are different ways to “reference” the data you wish to include. Refer to a dataset stored as a Parquet file on the local filesystem using a file reference:
Property | Required | Type | Description |
---|---|---|---|
path | string | The path to the file, relative to the build context | |
sortColumns | array | The columns by which the dataset is sorted, if known |
If sortColumns
is given, it must be an array of “sort descriptors.” See sort for more information.
Refer to another Terrazzo dataset hosted by a Terrazzo repository using a manifest reference:
Property | Required | Type | Description |
---|---|---|---|
id | string | The public id of the dataset, in name@version form | |
repository | string | The base URI of the repository hosting the dataset |
If repository
is omitted, it is assumed to be the base URI of the default Terrazzo repository, https://terrazzo.dev
.
Refer to a dataset available as Parquet data that can be fetched from a specified URI over HTTP using a URI reference:
Property | Required | Type | Description |
---|---|---|---|
uri | string | The URI of the dataset | |
sortColumns | array | The columns by which the dataset is sorted, if known |
If sortColumns
is given, it must be an array of “sort descriptors.” See sort for more information.
from
Use the from
property to specify the source of the primary dataset. Every build definition must have a from
.
-
"from": { "path": "/opt/data/phonebook.parquet" "sortColumns": [ {"column": "last_name", "dir": "asc"}, {"column": "first_name", "dir": "asc"} ] }
-
"from": { "id": "nynex/phonebook@1.0.0", "repository": "https://terrazzo.dev" }
-
"from": { "uri": "https://my-website.net/data/phonebook.parquet", "sortColumns": [ {"column": "last_name", "dir": "asc"}, {"column": "first_name", "dir": "asc"} ] }
with
Use the with
property to specify the sources of any input datasets, along with the names that should be used to refer to them elsewhere in the build definition.
-
"with": { "phones": { "path": "/opt/data/phonebook.parquet" "sortColumns": [ {"column": "last_name", "dir": "asc"}, {"column": "first_name", "dir": "asc"} ] }, "addresses": {"path": "/opt/data/addressbook.parquet"} }
-
"with": { "phones": { "id": "nynex/phonebook@1.0.0", "repository": "https://terrazzo.dev" }, "addresses": {"id": "rebny/addressbook.parquet@1.0.0"} }
-
"with": { "phones": { "uri": "https://my-website.net/data/phonebook.parquet", "sortColumns": [ {"column": "last_name", "dir": "asc"}, {"column": "first_name", "dir": "asc"} ] }, "address": { "uri": "https://my-other-website.net/data/addressbook.parquet" } }
Within the mapping for the with
property, you may use any of the different dataset reference types. You may omit the with
property if you are not using any input datasets.
Transform
The following section describes the different types of transformations that may be applied to primary and input datasets in a build definition. Transformations go into the transform
array and are applied in the order they appear in the array. Each transformation directive is a JSON object of the form:
{
"operation": [OPERATION TYPE],
...
}
where the operation
property identifies the type of transformation.
aggregate
The aggregate
transformation
Properties
Name | Required | Type | Description |
---|---|---|---|
columns | object | A mapping of new column names to aggregate expression | |
groupBy | array | An array of column names defining the groups | |
in | string | The dataset over which the aggregation should be performed |
The columns
object maps new column names to aggregate expressions: Applications of aggregate functions over segments of data that share the same values for the columns in the groupBy
array. Expressions that cannot be evaluated over a row group may not appear at the top level of an expression in the columns
mapping, though they can appear at a lower level, e.g. to transform a row or value before computing an aggregate.
The new schema of the target dataset will consist only of the columns from the groupBy
array, plus the new columns containing the aggregates, as given by the columns
mapping.
If in
is ommitted, the aggregation will be performed over the primary dataset.
Examples
Counting records in a group
{
"operation": "aggregate",
"columns": {
"grouped_count": {
"fun": "count",
"args": [{"lit": 1}]
}
},
"groupBy": ["x", "y"]
}
compute
The compute
transformation adds a new column to a dataset by evaluating a specified expression over every row in that dataset. The expression may be a simple numerical operation over columns in that dataset, a function application, or something more complex. See Expressions for more information.
Properties
Name | Required | Type | Description |
---|---|---|---|
newColumn | string | The name of the new column to create | |
in | string | The dataset in which the column should be created | |
as | object | The expression giving the values for the column |
If in
omitted, the new column will be creatd in the primary dataset.
Examples
Adding two columns together
{
"operation": "compute",
"newColumn": "sum",
"as": {"left": {"col": "leftColumn"}, "op": "+", "right": {"col": "rightColumn"}}
}
Applying a function to a column
{
"operation": "compute",
"newColumn": "ln_x",
"as": {"fun": "ln", "args": [{"col": "x"}]}
}
Applying a function to a column in an input dataset
{
"operation": "compute",
"newColumn": "ln_x",
"in": "natural_logarithms",
"as": {"fun": "ln", "args": [{"col": "x"}]}
}
dropColumns
Use the dropColumns
transformation to discard unwanted columns from a primary or input dataset.
dropColumns
may be configured in one of two flavors: Inclusive, where you use the columns
property to specify which columns you want to drop; or exclusive, where you use the except
property ro specify which columns you want to keep.
Properties
Name | Required | Type | Description |
---|---|---|---|
columns | array | The columns to drop | |
except | array | The columns to preserve | |
in | string | The target dataset |
If in
is specified, it gives the name of the input dataset from which the columns should be dropped. If it is omitted, the primary dataset is assumed to own the columns. Exactly one of columns
or except
must be specified.
Examples
Dropping specific columns from the primary dataset
{"operation": "dropColumns", "columns": ["x", "y", "z"]}
Preserving specific columns from an input dataset
{"operation": "dropColumns", "in": "noisy_input", "except": ["important_datapoint"]}
filter
Use the filter
transformation to remove any row from a target dataset for which the specified expression evaluates to false
.
Properties
Name | Required | Type | Description |
---|---|---|---|
dataset | string | The target dataset | |
by | object | The expression to use as a filter |
If dataset
is specified, it gives the name of the input dataset to be filtered. If it is omitted, the primary dataset is assumed to be the target for filtering.
Examples
Filtering rows in the primary dataset
{"operation": "filter", "by": {"left": {"col": "threshold"}, "op": ">", "right": {"lit": 0}}}
Filtering rows in an input dataset
{
"operation": "filter",
"dataset": "ranks",
"by": {"left": {"col": "threshold"}, "op": ">", "right": {"lit": 0}}
}
join
Use the join
transformation to combine two datasets to produce a new dataset that includes columns from both of the original datasets. The nature of the combination, including which pairs of records are combined, is determined by the “join type.” Terrazzo supports five different join types, some of which may be familiar from SQL databases: CROSS
, FULL
, INNER
, LEFT
, and RIGHT
.
The Wikipedia page on SQL JOINs gives a good explanation of the semantics of these different join types.
Properties
Name | Required | Type | Description |
---|---|---|---|
type | string | The join type; one of cross full inner left right | |
dataset | string | The dataset on the “left” side of the join | |
to | string | The dataset on the “right” side of the join | |
on | array | The columns to join on |
If dataset
is omitted, the primary dataset is taken as the left side of the join. For join types that perform matches on column values, the on
clause specifies the column names to use in the match. A column used for matching must have the same name in both the left and right datasets participating the join. For example:
{
"operation": "join",
"type": "inner",
"dataset": "left",
"to": "right",
"on": ["id"]
}
…means that there must be a column named id
in both left
and right
.
If type
is set to cross
the on
property is ignored; the full Cartesian product of the two datasets will be generated.
renameColumn
Use the renameColumn
transformation to change the name of a single column in a dataset. The new column name must not already exist in the schema for the target dataset.
Properties
Name | Required | Type | Description |
---|---|---|---|
column | string | The column to rename | |
in | string | The dataset that owns the column | |
to | string | The new name for the column |
If in
is specified, it gives the name of the input dataset that owns the column. If it is omitted, the primary dataset is assumed to own the column.
Examples
Renaming a column in the primary dataset
{
"operation": "renameColumn",
"column": "oldName",
"to": "newName"
}
Renaming a column in an input dataset
{
"operation": "renameColumn",
"in": "myInputDataset",
"column": "oldName",
"to": "newName"
}
sort
The sort
transformation sorts a target dataset by one or more columns.
Sorting as part of a Terrazzo build is most useful for ensuring that your dataset can be processed efficiently by downstream, external tools after the build completes. But sorting can also speed up your build within Terrazzo if applied at the right point in the build and with the right ordering criteria.
Under the hood, Terrazzo uses “external sorting” techniques to allow it to sort datasets that are too large to fit in memory all at once. Terrazzo will spill portions of the dataset to disk as necessary during the sort operation.
Properties
Name | Required | Type | Description |
---|---|---|---|
dataset | string | The target dataset | |
onColumns | array | The sort descriptors |
If dataset
is specified, it gives the name of the input dataset to be sorted. If it is omitted, the primary dataset is assumed to be the target for sorting. Each sort descriptor has the form:
Name | Required | Type | Description |
---|---|---|---|
column | string | The column to sort on | |
dir | string | The direction of the sort; either asc or desc |
Examples
Sorting by a single column
{
"operation": "sort",
"onColumns": [{ "column": "priority", "dir": "asc" }]
}
Sorting by multiple columns
{
"operation": "sort",
"onColumns": [
{ "column": "lastName", "dir": "asc" },
{ "column": "middleInitial", "dir": "asc" }
]
}
Sorting an input dataset
{
"operation": "sort",
"dataset": "workOrders",
"onColumns": [{ "column": "priority", "dir": "asc" }]
}
Expressions
The JSON data structures that are used in the compute
and filter
transformations are different from the other transformation criteria. Not only are they more dynamic, with multiple possible structures; they can also be recursively nested in order to express arbitrary complex types and operations. This section describes the different types of expressions supported by Terrazzo and how they may be combined in a build definition.
Binary expressions
A binary expression combines two subexpressions with an operator. The subexpressions may be any type of expression.
Properties
Name | Required | Type | Description |
---|---|---|---|
left | object | The left subexpression | |
op | string | The operator | |
right | object | The right subexpression |
Examples
A comparison expression
{"left": {"lit": -1}, "op": "<", "right": {"lit": 0}}
A binary expression adding two function expressions
{
"left": {
"fun": "cube_root",
"args": [{"col": "x"}]
},
"op": "+",
"right": {
"fun": "natural_log",
"args": [{"col": "y"}]
}
}
N-ary expressions
In addition to the binary left
right
expression syntax, some operators—specifically, and
and or
—can be applied over a list of subexpressions. (An n-ary and
expression is true if all of its subexpressions are true; an n-ary or
is true if at least one of its subexpressions is true.) Although and
and or
can be used in binary expressions, their n-ary form can be a more concise notation in cases where there are many subexpressions.
Properties
Name | Required | Type | Description |
---|---|---|---|
and | array | The list of subexpressions | |
or | array | The list of subexpressions |
Examples
{
"and": [
{"left": {"col": "x"}, "op": ">", "right": {"col": "y"}},
{"left": {"col": "z"}, "op": "<=", "right": {"lit": 100}},
{"left": {"col": "y"}, "op": "=", "right": {"lit": 0}}
]
}
Case expressions
Use a case expression to evaluate conditions for each row or sequence of expressions in a dataset. A case expression evaluates one or more “when” expressions in sequence until one evaluates to true
, at which point the associated “then” expression becomes the value of the entire case expression. If no “when” expression matches, the value of the “else” expression becomes the value of the entire case expression.
Properties
Name | Required | Type | Description |
---|---|---|---|
case | array | ||
when | object | ||
then | object | ||
else | object |
Case expressions can take one of two forms: A single when
-then
-else
clause, or an array of when
-then
expressions with an else
at the top level. In both forms, all when
expressions must evaluate to a boolean value, and the types of all then
expressions and the required else
expression must either have the same type or all evaluate to numeric types.
Examples
A single, top level when
-then
-else
form
{
"when": {"left": {"col": "x"}, "op": ">", "right": {"col": "y"}},
"then": {"lit": 1},
"else": {"lit": 0}
}
An array of when
-then
clauses
{
"case": [{
"when": {"left": {"col": "x"}, "op": ">", "right": {"col": "y"}},
"then": {"lit": 1}
}, {
"when": {"left": {"col": "z"}, "op": "<=", "right": {"lit": 100}},
"then": {"lit": 2}
}, {
"when": {"left": {"col": "y"}, "op": "=", "right": {"lit": 0}}
"then": {"lit": 3}
}],
"else": {"lit": 0}
}
Cast expressions
Use a cast expression to coerce a value from one data type to another, when possible, such as between numeric types with different widths. Casting may be necessary when joining on columns that have different types or to satisfy the constraints of a function’s type signature.
The target type must be a supported Arrow / Parquet data type.
Properties
Name | Required | Type | Description |
---|---|---|---|
cast | object | The subexpression to be cast | |
asType | string | The name of the target type |
Column expressions
A column expression is a reference to the value of a column at a particular row. Using a column expression as an argument to a function in a function expression will pass the value of the column in each row to which the function is applied, across the entire dataset.
Properties
Name | Required | Type | Description |
---|---|---|---|
col | string | The name of the column |
Examples
{"col": "my_column_name"}
Function expressions
A function expression represents the result of applying a function to some region of selection of values within a build definition. Functions may be scalar, in which case they apply to a list of argument expressions that correspond to a single row in a dataset; or they may be aggregate, in which case they apply to a “group” of rows as given by a group definition, producing a single value for the entire group.
Function expressions in a build definiton must either refer to scalar functions built into Terrazzo or supplied by a loaded plugin library, or to aggregate functions built into Terrazzo. (There is currently no suppor for user-defined aggregate functions.)
Terrazzo uses the Apache DataFusion library for expression evaluation. All of the functions provided by DataFusion can be used in a function expression in a Terrazzo build. You can find a list of available scalar functions here: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html, and a list of available aggregate functions here: https://arrow.apache.org/datafusion/user-guide/sql/aggregate_functions.html.
Properties
Name | Required | Type | Description |
---|---|---|---|
fun | string | The name of the function to be applied | |
args | array | The array of argument expressions |
The expressions in the argument array may be any type of expression, including another function expression.
Examples
Applying a function to a column and a literal expression
{
"fun": "exponent",
"args": [
{"col": "base"}
{"lit": 3}
]
}
Applying a function with no arguments
{"fun": "rand_int", "args": []}
Literal expressions
A literal expression represents an explicit, static value that independent of any dynamic column values. Use literal expressions to represent strings or numbers in larger composite expressions, or for default values for new columns.
Properties
Name | Required | Type | Description |
---|---|---|---|
lit | any | The literal value |
The type of the lit
property may be any JSON primitive type: String, number, or boolean.
Examples
A literal String
{"lit": "Hello, world!"}
A literal number
{"lit": 1.23}
A literal boolean
{"lit": false}