Skip to main content Link Menu Expand (external link) Document Search Copy Copied

The primary means of building, sharing, and interacting with Terrazzo datasets is trzo, the Terrazzo command-line interface. The various things trzo can do are represented as “commands” such that each command represents a discrete workflow to be carried out. For example, type trzo build to build a dataset from a definition file; run trzo dump to output the contents of a dataset.

You can get help from trzo itself by running trzo help, but detailed documentation on each command and how to use it can be found below.

build

Build a dataset from a definition file

Usage

trzo build [OPTIONS] <CONTEXT>

Options

Name, shorthand Default Description
--file, -f terrazzo.json The path to the definition file
--no-cache false Whether this build may use the cache
--rebuild false Whether to force a rebuild
--terrazzo-home $HOME/.terrazzo The Terrazzo environment

The build command parses a build definition from a JSON file and carries out the steps it describes to generate a dataset. The required CONTEXT argument gives a path to use for resolving relative paths that appear in the build definition. See The build definition for a detailed description of the build definition format, or Getting started for a simple example definition.

Once the build is complete, the new dataset is registered under its public id in the current environment so that it is available for other trzo commands or to serve as the basis for another dataset definition. Built datasets are immutable; at the moment you cannot build a dataset that shares a public id with an already-built dataset but has different source or input data or different transformations. Instead of modifying an already-built dataset, consider defining a dataset with the same name but a different version.

The build command will output the steps it is taking during the build in a human-readable format:

(1/2) Fetching uri https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
(2/2) Filter primary dataset @f90c137b by  to produce @93b63d39
example/yellow-taxi-example@1.0.0

Where possible, build will attempt to show a proress bar to report the status of long-running operations like network fetches or joins; however, the duration of some operations cannot be predicted. build will also try to reuse previously-generated stages of both the primary and any input datasets if it can determine that nothing about their state has changed. (Use the --no-cache flag to disable this behavior.)

The final line of build output is the public id of the dataset, written to standard output so that you can use it as part of a command pipeline. (The other messages logged during the build are written to standard error by default.)

Ordinarily, if a dataset with the same public id is found in the environment, the build will jump straight to printing out the public id and then exit. Use the --rebuild flag to force build to go through the normal sequence of analyzing and executing the build definition.

Examples

Building the default terrazzo.json definition relative to the current directory

trzo build .

Building relative to another folder

trzo build /media/Volumes/my_data

Building a definition with a non-default filename

trzo build . -f terrazzo.wip.json

Piping the result to trzo dump

tzro build . | xargs trzo dump --format=csv

clean

Clean the environment’s temp folder and work cache

Usage

trzo clean [OPTIONS]

Options

Name, shorthand Default Description
--terrazzo-home $HOME/.terrazzo The Terrazzo environment

A trzo environment includes a folder for storing temporary files that are used during the generation of a single stage of the build (such as spilling portions of a large sort operation) as well as a folder for caching the results of build stages so they do not need to be recomputed.

The temporary files are almost always cleared out automatically by trzo or by the operating system itself, but the work cache is preserved across builds because it often speeds up build operations significantly- especially when you are iterating on a build definition or debugging an error in a build. The cached files can be quite large, so run clean when you want to free up some space.

describe

Print a human-readable description of the build steps in a definition

Usage

trzo describe [OPTIONS]

Options

Name, shorthand Default Description
--file, -f terrazzo.json The path to the definition file

Use the describe command to print a log of the steps a build definition would take if you were to execute it using trzo build. This can be useful if you are having trouble visualizing the behavior of a definition; or if you would like to validate the structure of a definition you are working on, without actually performing the steps. The output of describe is almost exactly the same as the output produced during a build.

Because it processes a build file entirely “offline” (i.e., without actual data) describe does not require an environment to be available.

dump

Dump the contents of a dataset to standard output

Usage

trzo dump [OPTIONS] <PUBLIC_ID>

Options

Name, shorthand Default Description
--format   The output format; one of csv json parquet
--terrazzo-home $HOME/.terrazzo The Terrazzo environment

The more straightforward way to export a dataset from your Terrazzo environment is by using the dump command. dump writes the contents of the dataset specified via the PUBLIC_ID argument to standard output, where you can redirect it to a file or use it as part of a “pipeline” (at least, on a Unix-based system).

Note that depending on the types in your dataset’s schema, it may not be possible to render it as CSV or JSON- for example, if it includes geospatial data in a binary column. In this case, you may need to dump it as Parquet and use Parquet-based tools to finish processing it outside of Terrazzo.

Examples

Dumping into a Parquet file

trzo dump --format=parquet example/yellow-taxi-example@1.0.0 > taxi_data.parquet

Dumping to CSV, then piping to csvkit

trzo dump --format=parquet example/yellow-taxi-example@1.0.0 | csvgrep -c 1 '2022-01-01'

get

Fetch a dataset from a Terrazzo repository

Usage

trzo get [OPTIONS] <PUBLIC_ID>

Options

Name, shorthand Default Description
--repository-url, -r https://terrazzo.dev The target repository URL
--terrazzo-home $HOME/.terrazzo The Terrazzo environment

Use get to retrieve the data and definition files for the dataset specified by PUBLIC_ID from a Terrazzo repository. This can be handy when you want to access the content of a dataset in a way other than using it as an input in a definition- e.g., via trzo dump. You can also use get to ensure that a dataset is available in your local environment before commencing a build.

Examples

Fetching a dataset from terrazzo.dev

trzo get example/yellow-taxi-example@1.0.0

Fetching a dataset from a different Terrazzo repository

trzo get -r https://my-own-private-repo.net/ example/yellow-taxi-example@1.0.0

init

Initialize a new Terrazzo environment

Usage

trzo init [TERRAZZO_HOME]

Terrazzo’s “environment” is an area on the filesystem that Terrazzo uses for storing the data behind your datasets, along with dataset manifests and miscellaneous files needed in the course of building a dataset from a definition. An environment is needed by almost all trzo commands, with a few exceptions.

The easiest way to set up a new environent is to run trzo init with no additional arguments, which will create the files and folders for the environment in a default location for your user. On most systems, this location will be a subfolder named .terrazzo within your user’s “home” folder.

In the event that you need to create an environment in a different location, you can pass that location as an argument to init (see below). Note that if you do this, you will need to tell any subsequent trzo commands where to find the environment, by using the --terrazzo-home option. The user who runs trzo must have “read” and “write” access to the environment folder. It must also have permission to list files in that folder. (On Unix-like systems this requires the “execute” permission.)

init will fail if the target environment location…

  • …does not exist and cannot be created
  • …exists but is not a directory
  • …exists as a directory but is not empty

Examples

Creating a Terrazzo environment in the default location

trzo init

Creating a Terrazzo environment in a different location

trzo init /tmp/environment

login

Log in to a Terrazzo repository server

Usage

trzo login [OPTIONS] --username <USERNAME> [REPOSITORY_URL]

Options

Name, shorthand Default Description
--password-stdin false Whether to read the password piped from stdin
--terrazzo-home $HOME/.terrazzo The Terrazzo environment
--username, -u   The username of the user to log in as

In order to perform certain actions on a Terrazzo repository server, you must authenticate yourself to the server. The login command will read your username and password and submit them over a secure channel to the remote server, establishing a session that is stored in your Terrazzo environment and can be used for subsequent interactions with that repository. Unless you specify the repository you wish to log in to via the REPOSITORY_URL argument, login will connect to the default repository server at https://terrazzo.dev but you may have distinct, active sessions with multiple repositories- just run the login command for each one to establish your session.

By default, login operates interactively, reading your password from the console as you type it, without displaying the characters. In some cases it may be more convenient or secure to pipe your password to login as part of a command pipeline. If you wish to do this, add the --password-stdin flag to direct login to read your password from standard input.

While you may use other trzo commands to interact with repositories that serve data over an unsecured channel like HTTP (e.g. fetching non-protected datasets) login will refuse to submit credentials to those servers, since doing so could reveal them to a third party monitoring your network traffic.

Examples

Logging in to terrazzo.dev

trzo login -u me@my-email-server.net

Logging to a different Terrazzo repository

trzo login -u me@my-email-server.net https://my-own-private-repo.net/

Piping credentials from standard input

trzo login -u me@my-email-server.net < my_password.txt

logout

Log out of a Terrazzo repository server

Usage

trzo logout [OPTIONS] [REPOSITORY_URL]

Options

Name, shorthand Default Description
--terrazzo-home $HOME/.terrazzo The Terrazzo environment

Use logout to end your session with a particular Terrazzo repository. Your sessions for other repositories will be unaffected. If the REPOSITORY_URL argument is ommitted, logout defaults to logging you out of the default repository server at terrazzo.dev.

Examples

Logging out of terrazzo.dev

trzo logout

Logging out of a different Terazzo repository

trzo logout https://my-own-private-repo.net/

ls

List the datasets available in the environmentbesides

Usage

trzo ls [TERRAZZO_HOME]

Options

Name, shorthand Default Description
--terrazzo-home $HOME/.terrazzo The Terrazzo environment

Use ls to see a list of datasets you’ve built or fetched. The output of ls is a list of datasets by public id and the first eight characters of the build digest, separated by the tab character:

public_id	build_id
nyc/census-political-overlay@1.0.0	9f665a93
example/yellow-taxi-example@1.0.0	0a40612e

purge

Permanently delete a dataset from a Terrazzo repository

Usage

trzo purge [OPTIONS] <PUBLIC_ID>

Options

Name, shorthand Default Description
--repository-url, -r https://terrazzo.dev The target repository URL
--terrazzo-home $HOME/.terrazzo The Terrazzo environment

The purge command allows you to permanently remove a dataset from a Terrazzo repository server. Note that a dataset may only be purged by its original uploader or by users with superuser permissions on the target repository server.

Examples

trzo purge example/yellow-taxi-example@1.0.0

push

Publish a dataset to a Terrazzo repository

Usage

trzo push [OPTIONS] <PUBLIC_ID>

Options

Name, shorthand Default Description
--repository-url, -r https://terrazzo.dev The target repository URL
--terrazzo-home $HOME/.terrazzo The Terrazzo environment

In order to share your dataset with others, you may wish to upload it to a Terrazzo repository server. The push command uploads the Parquet data that backs the dataset given by PUBLIC_ID along with its definition file and a manifest describing its contents. The dataset must have been built (via trzo build) before it can be pushed.

Because Terrazzo datasets may be hundreds of megabytes or even gigabytes in size, push will display a progress bar showing the status of the upload, and will attempt resume uploads that have been interrupted by changes in network connectivity.

push will fail if the repository rejects your dataset for some reason- for example, if it has already been shared by someone else; or if you don’t have permission to add to the collection the dataset belongs to. Note that you will need to authenticate to the repository server via trzo login before it will accept your upload.

Examples

Pushing a dataset to terrazzo.dev

trzo push example/yellow-taxi-example@1.0.0

Pushing a dataset to a different Terrazzo repository

trzo push -r https://my-own-private-repo.net/ example/yellow-taxi-example@1.0.0

rm

Permanently delete a dataset from the environment

Usage

trzo rm [OPTIONS] <PUBLIC_ID>

Options

Name, shorthand Default Description
--terrazzo-home $HOME/.terrazzo The Terrazzo environment

Examples

trzo rm example/yellow-taxi-example@1.0.0