The following concepts are important for getting the most out of Terrazzo.
This documentation uses the term dataset to describe a collection of records organized into a table with rows and columns, such as might be represented by a single sheet in an Excel file, or a single dataframe in Pandas. Terrazzo provides tools for working with datasets in the form of Apache Parquet files. Parquet is a column-oriented data file format supported by a wide range of data tools and software systems. Being column-oriented means that all of the data for a particular column in a dataset is kept in a contiguous region of memory or disk storage, making it efficient to read and transform.
You can learn about the Parquet file format on the official Apache Parquet site.
Terrazzo datasets are distinguished from each other by their public id which consists of a name and a version. A dataset’s name is a series of identifiers separated by slashes, similar to a file’s path in a filesystem. A version is a special number that distinguishes multiple datasets with the same name. Versions in Terrazzo follow the “semantic versioning” rules, which means they generally have the form major.minor.patch
where major
minor
and patch
are all integers.
Read more about semantic versioning at semver.org.
Dataset names begin with a collection, a prefix that acts like a folder for grouping multiple related datasets. Every dataset must be part of a collection. For example, the datasets my_collection/my_political_dataset@1.0.0
and my_collection/census/my_census_tracts@2.3.0
are both part of the collection my_collection
.
Datasets can also be uniquely identified by their build digest which is a cryptographic signature based on the steps used to generate a dataset, plus its public id. The build digest is used internally by Terrazzo’s tools to detect cases in which two different datasets have been mistakenly labeled with the same public id.
The steps to generate a dataset are provided by its build definition. The build definition is a special file (usually named terrazzo.json
) that lists the sources of data from which the target dataset should be derived, along with zero or more transformations that should be applied along the way, such as sorting, filtering, and computing additional columns. Every built dataset must be derived from at least one set of source data, which is called the primary dataset; but other sources of data can be “linked in” as auxiliary input datasets and used to augment the primary dataset through joins or other transformations.
Regardless of what transformations are applied to the input datasets, the contents of the primary dataset at the end of the build are what become the “output” of the build.
Source and input data can be specified in different ways:
- As a path to a Parquet file on the local filesystem
- As a URI to some Parquet data served over a network
- As a public id for a Terrazzo dataset hosted on a repository server
You can read more about the format of terrazzo.json
in the section of this manual titled The build definition.
Once a dataset has been built, a manifest is generated for it. The manifest contains metadata for the dataset useful for other tools in the Terrazzo toolchain, such as its public id, its build digest, and its Parquet column schema.
Built datasets are automatically stored in Terrazzo’s environment, an area on your local filesystem reserved for use by Terrazzo’s tools. But you can also upload a dataset (along with its build definition) to a Terrazzo repository server. A repository is a network server that hosts Terrazzo datasets for use at runtime by Terrazzo’s build tools, plus a web frontend to allow humans to browse the manifests for available datasets.
The “default” Terrazzo repository is located at https://terrazzo.dev/