Skip to main content Link Menu Expand (external link) Document Search Copy Copied

The build definition file, usually named terrazzo.json, is the means for communicating the steps that Terrazzo should carry out to produce a dataset for you. Think of the build definition as a way of “programming” your dataset in a simple, JSON-based language that stays as close as possible to plain English.

Not familiar with JSON? You can find a good introduction here.

This page describes the format of the build definition file by example and provides a complete reference for the properties and directives available for use.

Skip to reference

Anatomy of a build definition

Let’s begin by exploring a real-world example of a Terrazzo build definition that is non-trivial in scope and exercises a good portion of Terrazzo’s build functionality.

The example below (nys/census-political-overlap@1.0.0) attempts to link United States Census tracts to New York State election districts by their overlapping geographies. A census tract is a geographic district produced by the U.S. Census as a basis for reporting demographic and economic statistics. An election district is a fundamental geographic district within New York State that is used as a building block for political districting: City Council, State Assembly, and State Senate districts are all partitionings of NYS election districts.

These two datasets originate at different levels of government and do not share a common “surrogate key” (such as a numeric id) that can be used to join them. Instead, this build definition must generate a “cross join” of all possible combinations of census tract and election district, then filters out any pairs of districts that do not overlap. Finally, for those districts that do overlap, the build definition calculates the percentage of overlap relative to the total area of the election district.

The resulting dataset can be used to generate coarse demographics for the residents of an election district, by scaling the demographics of the overlapping census tracts relative to their overlap of the district.

The raw build definition follows. Scroll below it for a closer look at each section.

{
    "name": "nys/census-political-overlap",
    "version": "1.0.0",
    "author": "Julian Graham <joolean@undecidable.net>"
    "description": "List the overlap of 2020 US Census tracts with NYS election districts"

    "plugins": ["terrazzo_geometry"],
    
    "from": {"id": "us-national/census/nyct2020@1.0.0"},
    "with": {
    	"eds": {"path": "nyed.parquet"}
    },

    "transform": [
        {"operation": "renameColumn", "column": "geometry", "to": "ct_geometry"},
        {"operation": "renameColumn", "in": "eds", "column": "geometry", "to": "ed_geometry"},

        {"operation": "join", "type": "cross", "to": "eds", "on": []},
        
        {
            "operation": "filter",
            "by": {
                "fun": "geom_intersects",
                "args": [{"col": "ed_geometry"}, {"col": "ct_geometry"}]
            }
        },

        {
            "operation": "compute",
            "newColumn": "intersection_geometry",
            "as": {
                "fun": "geom_intersection",
                "args": [{"col": "ed_geometry"}, {"col": "ct_geometry"}]
            }
        },
        {
            "operation": "compute",
            "newColumn": "ed_area",
            "as": {"fun": "geom_area", "args": [{"col": "ed_geometry"}]}
        },
        {
            "operation": "compute",
            "newColumn": "intersection_area",
            "as": {
                "fun": "geom_area",
                "args": [{"col": "intersection_geometry"}]
            }
        },
        {
            "operation": "filter",
            "by": {
                "left": {"col": "intersection_area"},
                "op": ">",
                "right": {"lit": 0}
            }
        },
        {
            "operation": "compute",
            "newColumn": "overlap_pct",
            "as": {
                "left": {"col": "intersection_area"},
                "op": "/",
                "right": {"col": "ed_area"}
            }
        },

        {"operation": "dropColumns", "except": ["GEOID", "CDTANAME", "ElectDist", "overlap_pct"]}
    ]
}

Let’s start at the top!

The fields at the “head” of the definition provide some general metadata about what the dataset is for, who created it, and how it can be found in a Terrazzo repository:

    "name": "nys/census-political-overlap",
    "version": "1.0.0",
    "author": "Julian Graham <joolean@undecidable.net>"
    "description": "List the overlap of 2020 US Census tracts with NYS election districts"

Together, the name and version fields form the public id for this dataset: nys/census-political-overlap@1.0.0.

The nys in nys/census-political-overlap is the collection. A collection is like a folder in a Terrazzo repository, and allows repository users to set permissions on groups of datasets. Every dataset name must specify the collection it belongs to. Authors are free to specify additional “folders” after the collection name but this is not required.

Next, let’s look at this plugins declaration:

    "plugins": ["terrazzo_geometry"],

This tells trzo to load the Terrazzo Geometry plugin, which provides some geospatial functions we’ll be using shortly- stay tuned.

The next two directives are very important:

    "from": {"id": "us-national/census/nyct2020@1.0.0"},
    "with": {
    	"eds": {"path": "nyed.parquet"}
    },

from describes where the primary dataset can be found. The primary dataset is the dataset that the current dataset is derived from, and the final state of the primary dataset is the “output” of the build. This from declaration uses the id property to indicate that the primary dataset is another Terrazzo dataset and can be downloaded from default Terrazzo repository at terrazzo.dev using the public id us-national/census/nyct2020@1.0.0. See Sources of data for an explanation of different ways to refer to datasets for use in a build.

with describes where input datasets can be found. Input datasets are additional datasets that are used during the build. Each input dataset has a unique name that can be used to refer to it later in the build. In this case, there is a single input dataset named eds and it uses the path property to refer to a file on the local filesystem named nyed.parquet. This means that in order to build this dataset, you must have a copy of that file on your computer in the folder you pass as the CONTEXT argument to trzo build.

The rest of this build definition consists of transformations listed in the transform array. A transformation is a change applied to a primary or input dataset that alters its structure or contents in some way. The transformations in a build definition are applied in the order they appear in the transform array. The content of the primary dataset after the last transformation is applied becomes the “output” of the build.

Let’s look at the transformations applied by the build definition above.

The first thing that happens is that we rename a column in both the primary and input datasets:

        {"operation": "renameColumn", "column": "geometry", "to": "ct_geometry"},
        {"operation": "renameColumn", "in": "eds", "column": "geometry", "to": "ed_geometry"},

Every transformation has an operation property that identifies that type of transformation to be performed. The renameColumn operation needs to know the name of the column being renamed (column) and what its new name should be (to). renameColumn also needs to know which dataset the target column lives in. The presence of the in property indicates that the column belongs to the named input dataset; omitting this property indicates that the column belongs to the primary dataset.

These rename transformations are necessary because both the census tract data and the election district data originate from Shapefile databases that have been converted to Parquet format. As part of the conversion, the geospatial data that describes the shape of the districts is stored in a column named geometry. Because we’re going to combine these two datasets in the next step, we need to ensure that the column names do not conflict with each other.

Now we cross join the primary dataset with the election district data:

        {"operation": "join", "type": "cross", "to": "eds", "on": []},

This may look a little bit strange; it’s more common to join a dataset or table by matching one or more shared columns. But these two datasets don’t have any columns that are naturally equal between them. Instead, we’re hoping to find the pairs of districts whose district geometries overlap with each other, and the cross join helps us to do this by generating the Cartesian product of these two datasets- i.e., every combination of records. The result is quite large, but that’s not a problem for Terrazzo! Anyway, we’ll be shrinking it down quite a bit in the next step.

        {
            "operation": "filter",
            "by": {
                "fun": "geom_intersects",
                "args": [{"col": "ed_geometry"}, {"col": "ct_geometry"}]
            }
        },

This filter transformation tests every row in the primary dataset by applying the geom_intersects function to the two geometry columns we renamed earlier. geom_intersects returns a boolean value according to whether its arguments intersect, and the filter throws away any rows for which the function doesn’t return true. This is also an example of using column expressions to refer to the value of a column across every row in a dataset.

Next, we use the compute transformation to generate some additional information about the intersecting districts.

        {
            "operation": "compute",
            "newColumn": "intersection_geometry",
            "as": {
                "fun": "geom_intersection",
                "args": [{"col": "ed_geometry"}, {"col": "ct_geometry"}]
            }
        },
        {
            "operation": "compute",
            "newColumn": "ed_area",
            "as": {"fun": "geom_area", "args": [{"col": "ed_geometry"}]}
        },
        {
            "operation": "compute",
            "newColumn": "intersection_area",
            "as": {
                "fun": "geom_area",
                "args": [{"col": "intersection_geometry"}]
            }
        },

Each of the three tranformations above adds a new column to the primary dataset by applying a function from the Terrazzo Geometry plugin across all of the rows in the dataset. The first new column is created by applying geom_intersection to the two district geometry columns; we already know they intersect, now we want to see the exact boundaries of their intersection. Then we use geom_area twice: First to calculate the area of the election district, then to calculate the area of the intersection.

We’ll be using those two new area columns to calculate the percentage of the election district’s area that falls within each census tract. But first we need to do a little bit of housekeeping:

        {
            "operation": "filter",
            "by": {
                "left": {"col": "intersection_area"},
                "op": ">",
                "right": {"lit": 0}
            }
        },

This filter operation, which throws away records with an empty intersection area, is necessary because even though we previously filtered to include only pairs of districts that overlap, a few districts with an empty intersection area somehow pass the filter. (This could be due to some quirks in the geometry of the source datasets- which were produced by different government agencies.)

This is also an example of a binary expression, where a left and a right operand are combined with an operator- in this case, a “greater than” comparison. Note also the literal expression, which is a way of including an explicit, static value in an expression, independent of the dynamic value of any columns.

The next computation produces the really interesting datapoint: The percentage of the election district that overlaps the census tract.

        {
            "operation": "compute",
            "newColumn": "overlap_pct",
            "as": {
                "left": {"col": "intersection_area"},
                "op": "/",
                "right": {"col": "ed_area"}
            }
        },

This is another binary expression, but instead of performing a comparison, we’re dividing the area of intersection by the total area of the election district to produce a value for the new overlap_pct column.

Finally we do a little bit of cleanup on the primary dataset to throw away all of the columns except the ones we want to appear in the final schema.

        {"operation": "dropColumns", "except": ["GEOID", "CDTANAME", "ElectDist", "overlap_pct"]}

And we’re done! If you were to run trzo dump on the resulting dataset, you’d see output like this:

GEOID,CDTANAME,ElectDist,overlap_pct
36061001402,MN03 Lower East Side-Chinatown (CD 3 Equivalent),65030,0.29005219366950663
36061001402,MN03 Lower East Side-Chinatown (CD 3 Equivalent),65031,0.43683679587189234
36061001402,MN03 Lower East Side-Chinatown (CD 3 Equivalent),65033,0.41387083670618263
36061001402,MN03 Lower East Side-Chinatown (CD 3 Equivalent),65034,0.4205040699342719

Build definition reference

The following section is a complete reference to the JSON schema for a Terrazzo build definition.

Header fields

The build definition begins with four properties that provide some foundational metadata for the dataset. These properties are also copied into the manifest for the dataset once it is built.

Property Required Type Description
name string The “name” portion of the dataset’s public id
version string The “version” portion of the dataset’s public id
author   string The email address of the dataset’s author
description   string A description of the content, or explanatory notes

See the Concepts page for an explanation of public ids.

In particular, the name property must have a “path-like” structure such that it is made up of multiple segments separated by forward-slashes, where each segment begins with a letter and is followed by a combination of zero or more letters, numbers, hyphens, and underscores. The first segment in a dataset name is its “collection,” a grouping used by a Terrazzo repository to control access to datasets.

  • abc/easy-as-123 is a valid dataset name
  • abc/123 is not a valid dataset name because the second part does not begin with a letter
  • easy-as-123 is not a valid dataset name because it does not have a collection

Plugins

In order to allow Terrazzo to be extended with additional functionality, the build system exposes a plugin API for registering “scalar functions” - that is, functions that are invoked for individual records in a dataset and which produce a single value as output. One plugin can register many scalar functions, making them available to for invocation as part of function expressions in a compute or filter transformation.

A plugin must be declared in the build definition and loaded by Terrazzo before it can supply its functions to a build. Use the plugins property to list the plugins containing the functions needed by your build.

The naming conventions for plugin library filenames are different on different operating systems. For example, the Terrazzo Geometry plugin has the following filenames on Linux, MacOS and Windows:

  • libterrazzo_geometry.so
    
  • libterrazzo_geometry.dylib
    
  • terrazzo_geometry.dll
    

As an affordance, to avoid special-casing your plugin declarations for different operating systems, Terrazzo lets you identify a plugin by its “library name,” i.e., without the lib prefix or file extension. Terrazzo delegates the process of searching for a plugin library file to the underlying operating system. That means that certain environment variables will affect where plugins will be looked for. See the Installation for more information on how this works.

See the “Hello, world” example plugin source code to learn more about writing your own plugins for Terrazzo.

If your build definition does not invoke any user-defined functions, you may omit the plugins property.

Examples

Loading the Terrazzo Geometry Plugin

"plugins": ["terrazzo_geometry"]

Sources of data

There are two ways to pull Parquet data into a Terrazzo build definition. You can include a data as a primary dataset or as an input dataset. The primary dataset is the foundation from which a new dataset is derived, and over the course of being transformed during the build, becomes the content for the new dataset. Input datasets are named sources of data you can use to complement the primary dataset in various ways, often via a join transformation.

For both primary and input datasets, there are different ways to “reference” the data you wish to include. Refer to a dataset stored as a Parquet file on the local filesystem using a file reference:

Property Required Type Description
path string The path to the file, relative to the build context
sortColumns   array The columns by which the dataset is sorted, if known

If sortColumns is given, it must be an array of “sort descriptors.” See sort for more information.

Refer to another Terrazzo dataset hosted by a Terrazzo repository using a manifest reference:

Property Required Type Description
id string The public id of the dataset, in name@version form
repository   string The base URI of the repository hosting the dataset

If repository is omitted, it is assumed to be the base URI of the default Terrazzo repository, https://terrazzo.dev.

Refer to a dataset available as Parquet data that can be fetched from a specified URI over HTTP using a URI reference:

Property Required Type Description
uri string The URI of the dataset
sortColumns   array The columns by which the dataset is sorted, if known

If sortColumns is given, it must be an array of “sort descriptors.” See sort for more information.

from

Use the from property to specify the source of the primary dataset. Every build definition must have a from.

  • "from": {
        "path": "/opt/data/phonebook.parquet"
        "sortColumns": [
            {"column": "last_name", "dir": "asc"}, 
            {"column": "first_name", "dir": "asc"}
        ]
    }
    
  • "from": {
        "id": "nynex/phonebook@1.0.0",
        "repository": "https://terrazzo.dev"
    }
    
  • "from": {
        "uri": "https://my-website.net/data/phonebook.parquet",
        "sortColumns": [
            {"column": "last_name", "dir": "asc"}, 
            {"column": "first_name", "dir": "asc"}
        ]
    }
    

with

Use the with property to specify the sources of any input datasets, along with the names that should be used to refer to them elsewhere in the build definition.

  • "with": {
        "phones": {
            "path": "/opt/data/phonebook.parquet"
            "sortColumns": [
                {"column": "last_name", "dir": "asc"}, 
                {"column": "first_name", "dir": "asc"}
            ]
        },
        "addresses": {"path": "/opt/data/addressbook.parquet"}
    }
    
  • "with": {
        "phones": {
            "id": "nynex/phonebook@1.0.0",
            "repository": "https://terrazzo.dev"
        },
        "addresses": {"id": "rebny/addressbook.parquet@1.0.0"}
    }
    
  • "with": {
        "phones": {
            "uri": "https://my-website.net/data/phonebook.parquet",
            "sortColumns": [
                {"column": "last_name", "dir": "asc"}, 
                {"column": "first_name", "dir": "asc"}
            ]
        },
        "address": {
            "uri": "https://my-other-website.net/data/addressbook.parquet"
        }
    }
    

Within the mapping for the with property, you may use any of the different dataset reference types. You may omit the with property if you are not using any input datasets.

Transform

The following section describes the different types of transformations that may be applied to primary and input datasets in a build definition. Transformations go into the transform array and are applied in the order they appear in the array. Each transformation directive is a JSON object of the form:

{
    "operation": [OPERATION TYPE],
    ...
}

where the operation property identifies the type of transformation.

aggregate

The aggregate transformation

Properties

Name Required Type Description
columns object A mapping of new column names to aggregate expression
groupBy array An array of column names defining the groups
in   string The dataset over which the aggregation should be performed

The columns object maps new column names to aggregate expressions: Applications of aggregate functions over segments of data that share the same values for the columns in the groupBy array. Expressions that cannot be evaluated over a row group may not appear at the top level of an expression in the columns mapping, though they can appear at a lower level, e.g. to transform a row or value before computing an aggregate.

The new schema of the target dataset will consist only of the columns from the groupBy array, plus the new columns containing the aggregates, as given by the columns mapping.

If in is ommitted, the aggregation will be performed over the primary dataset.

Examples

Counting records in a group

{
    "operation": "aggregate",
    "columns": {
        "grouped_count": {
            "fun": "count",
            "args": [{"lit": 1}]
        }
    },
    "groupBy": ["x", "y"]
}

compute

The compute transformation adds a new column to a dataset by evaluating a specified expression over every row in that dataset. The expression may be a simple numerical operation over columns in that dataset, a function application, or something more complex. See Expressions for more information.

Properties

Name Required Type Description
newColumn string The name of the new column to create
in   string The dataset in which the column should be created
as object The expression giving the values for the column

If in omitted, the new column will be creatd in the primary dataset.

Examples

Adding two columns together

{
    "operation": "compute",
    "newColumn": "sum",
    "as": {"left": {"col": "leftColumn"}, "op": "+", "right": {"col": "rightColumn"}}
}

Applying a function to a column

{
    "operation": "compute",
    "newColumn": "ln_x",
    "as": {"fun": "ln", "args": [{"col": "x"}]}
}

Applying a function to a column in an input dataset

{
    "operation": "compute",
    "newColumn": "ln_x",
    "in": "natural_logarithms",
    "as": {"fun": "ln", "args": [{"col": "x"}]}
}

dropColumns

Use the dropColumns transformation to discard unwanted columns from a primary or input dataset.

dropColumns may be configured in one of two flavors: Inclusive, where you use the columns property to specify which columns you want to drop; or exclusive, where you use the except property ro specify which columns you want to keep.

Properties

Name Required Type Description
columns   array The columns to drop
except   array The columns to preserve
in   string The target dataset

If in is specified, it gives the name of the input dataset from which the columns should be dropped. If it is omitted, the primary dataset is assumed to own the columns. Exactly one of columns or except must be specified.

Examples

Dropping specific columns from the primary dataset

{"operation": "dropColumns", "columns": ["x", "y", "z"]}

Preserving specific columns from an input dataset

{"operation": "dropColumns", "in": "noisy_input", "except": ["important_datapoint"]}

filter

Use the filter transformation to remove any row from a target dataset for which the specified expression evaluates to false.

Properties

Name Required Type Description
dataset   string The target dataset
by object The expression to use as a filter

If dataset is specified, it gives the name of the input dataset to be filtered. If it is omitted, the primary dataset is assumed to be the target for filtering.

Examples

Filtering rows in the primary dataset

{"operation": "filter", "by": {"left": {"col": "threshold"}, "op": ">", "right": {"lit": 0}}}

Filtering rows in an input dataset

{
    "operation": "filter", 
    "dataset": "ranks",
    "by": {"left": {"col": "threshold"}, "op": ">", "right": {"lit": 0}}
}

join

Use the join transformation to combine two datasets to produce a new dataset that includes columns from both of the original datasets. The nature of the combination, including which pairs of records are combined, is determined by the “join type.” Terrazzo supports five different join types, some of which may be familiar from SQL databases: CROSS, FULL, INNER, LEFT, and RIGHT.

The Wikipedia page on SQL JOINs gives a good explanation of the semantics of these different join types.

Properties

Name Required Type Description
type   string The join type; one of cross full inner left right
dataset   string The dataset on the “left” side of the join
to string The dataset on the “right” side of the join
on array The columns to join on

If dataset is omitted, the primary dataset is taken as the left side of the join. For join types that perform matches on column values, the on clause specifies the column names to use in the match. A column used for matching must have the same name in both the left and right datasets participating the join. For example:

{
    "operation": "join", 
    "type": "inner",
    "dataset": "left",
    "to": "right", 
    "on": ["id"]
}

…means that there must be a column named id in both left and right.

If type is set to cross the on property is ignored; the full Cartesian product of the two datasets will be generated.

renameColumn

Use the renameColumn transformation to change the name of a single column in a dataset. The new column name must not already exist in the schema for the target dataset.

Properties

Name Required Type Description
column string The column to rename
in   string The dataset that owns the column
to string The new name for the column

If in is specified, it gives the name of the input dataset that owns the column. If it is omitted, the primary dataset is assumed to own the column.

Examples

Renaming a column in the primary dataset

{
    "operation": "renameColumn",
    "column": "oldName",
    "to": "newName"
}

Renaming a column in an input dataset

{
    "operation": "renameColumn",
    "in": "myInputDataset",
    "column": "oldName",
    "to": "newName"
}

sort

The sort transformation sorts a target dataset by one or more columns.

Sorting as part of a Terrazzo build is most useful for ensuring that your dataset can be processed efficiently by downstream, external tools after the build completes. But sorting can also speed up your build within Terrazzo if applied at the right point in the build and with the right ordering criteria.

Under the hood, Terrazzo uses “external sorting” techniques to allow it to sort datasets that are too large to fit in memory all at once. Terrazzo will spill portions of the dataset to disk as necessary during the sort operation.

Properties

Name Required Type Description
dataset   string The target dataset
onColumns array The sort descriptors

If dataset is specified, it gives the name of the input dataset to be sorted. If it is omitted, the primary dataset is assumed to be the target for sorting. Each sort descriptor has the form:

Name Required Type Description
column string The column to sort on
dir string The direction of the sort; either asc or desc

Examples

Sorting by a single column

{
    "operation": "sort",
    "onColumns": [{ "column": "priority", "dir": "asc" }]
}

Sorting by multiple columns

{
    "operation": "sort",
    "onColumns": [
        { "column": "lastName", "dir": "asc" },
        { "column": "middleInitial", "dir": "asc" }
    ]
}

Sorting an input dataset

{
    "operation": "sort",
    "dataset": "workOrders",
    "onColumns": [{ "column": "priority", "dir": "asc" }]
}

Expressions

The JSON data structures that are used in the compute and filter transformations are different from the other transformation criteria. Not only are they more dynamic, with multiple possible structures; they can also be recursively nested in order to express arbitrary complex types and operations. This section describes the different types of expressions supported by Terrazzo and how they may be combined in a build definition.

Binary expressions

A binary expression combines two subexpressions with an operator. The subexpressions may be any type of expression.

Properties

Name Required Type Description
left object The left subexpression
op string The operator
right object The right subexpression

Examples

A comparison expression

{"left": {"lit": -1}, "op": "<", "right": {"lit": 0}}

A binary expression adding two function expressions

{
    "left": {
        "fun": "cube_root",
        "args": [{"col": "x"}]
    },
    "op": "+",
    "right": {
        "fun": "natural_log",
        "args": [{"col": "y"}]
    }
}

N-ary expressions

In addition to the binary left right expression syntax, some operators—specifically, and and or—can be applied over a list of subexpressions. (An n-ary and expression is true if all of its subexpressions are true; an n-ary or is true if at least one of its subexpressions is true.) Although and and or can be used in binary expressions, their n-ary form can be a more concise notation in cases where there are many subexpressions.

Properties

Name Required Type Description
and   array The list of subexpressions
or   array The list of subexpressions

Examples

{
    "and": [
        {"left": {"col": "x"}, "op": ">", "right": {"col": "y"}},
        {"left": {"col": "z"}, "op": "<=", "right": {"lit": 100}},
        {"left": {"col": "y"}, "op": "=", "right": {"lit": 0}}
    ]
}

Case expressions

Use a case expression to evaluate conditions for each row or sequence of expressions in a dataset. A case expression evaluates one or more “when” expressions in sequence until one evaluates to true, at which point the associated “then” expression becomes the value of the entire case expression. If no “when” expression matches, the value of the “else” expression becomes the value of the entire case expression.

Properties

Name Required Type Description
case   array  
when   object  
then   object  
else object  

Case expressions can take one of two forms: A single when-then-else clause, or an array of when-then expressions with an else at the top level. In both forms, all when expressions must evaluate to a boolean value, and the types of all then expressions and the required else expression must either have the same type or all evaluate to numeric types.

Examples

A single, top level when-then-else form

{
    "when": {"left": {"col": "x"}, "op": ">", "right": {"col": "y"}},
    "then": {"lit": 1},
    "else": {"lit": 0}
}

An array of when-then clauses

{
    "case": [{
        "when": {"left": {"col": "x"}, "op": ">", "right": {"col": "y"}},
        "then": {"lit": 1}
    }, {
        "when": {"left": {"col": "z"}, "op": "<=", "right": {"lit": 100}},
        "then": {"lit": 2}
    }, {
        "when": {"left": {"col": "y"}, "op": "=", "right": {"lit": 0}}
        "then": {"lit": 3}
    }],
    "else": {"lit": 0}
}

Cast expressions

Use a cast expression to coerce a value from one data type to another, when possible, such as between numeric types with different widths. Casting may be necessary when joining on columns that have different types or to satisfy the constraints of a function’s type signature.

The target type must be a supported Arrow / Parquet data type.

Properties

Name Required Type Description
cast object The subexpression to be cast
asType string The name of the target type

Column expressions

A column expression is a reference to the value of a column at a particular row. Using a column expression as an argument to a function in a function expression will pass the value of the column in each row to which the function is applied, across the entire dataset.

Properties

Name Required Type Description
col string The name of the column

Examples

{"col": "my_column_name"}

Function expressions

A function expression represents the result of applying a function to some region of selection of values within a build definition. Functions may be scalar, in which case they apply to a list of argument expressions that correspond to a single row in a dataset; or they may be aggregate, in which case they apply to a “group” of rows as given by a group definition, producing a single value for the entire group.

Function expressions in a build definiton must either refer to scalar functions built into Terrazzo or supplied by a loaded plugin library, or to aggregate functions built into Terrazzo. (There is currently no suppor for user-defined aggregate functions.)

Terrazzo uses the Apache DataFusion library for expression evaluation. All of the functions provided by DataFusion can be used in a function expression in a Terrazzo build. You can find a list of available scalar functions here: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html, and a list of available aggregate functions here: https://arrow.apache.org/datafusion/user-guide/sql/aggregate_functions.html.

Properties

Name Required Type Description
fun string The name of the function to be applied
args array The array of argument expressions

The expressions in the argument array may be any type of expression, including another function expression.

Examples

Applying a function to a column and a literal expression

{
    "fun": "exponent", 
    "args": [
        {"col": "base"}
        {"lit": 3}
    ]
}

Applying a function with no arguments

{"fun": "rand_int", "args": []}

Literal expressions

A literal expression represents an explicit, static value that independent of any dynamic column values. Use literal expressions to represent strings or numbers in larger composite expressions, or for default values for new columns.

Properties

Name Required Type Description
lit any The literal value

The type of the lit property may be any JSON primitive type: String, number, or boolean.

Examples

A literal String

{"lit": "Hello, world!"}

A literal number

{"lit": 1.23}

A literal boolean

{"lit": false}