The primary means of building, sharing, and interacting with Terrazzo datasets is trzo
, the Terrazzo command-line interface. The various things trzo
can do are represented as “commands” such that each command represents a discrete workflow to be carried out. For example, type trzo build
to build a dataset from a definition file; run trzo dump
to output the contents of a dataset.
You can get help from trzo
itself by running trzo help
, but detailed documentation on each command and how to use it can be found below.
build
Build a dataset from a definition file
Usage
trzo build [OPTIONS] <CONTEXT>
Options
Name, shorthand | Default | Description |
---|---|---|
--file , -f | terrazzo.json | The path to the definition file |
--no-cache | false | Whether this build may use the cache |
--rebuild | false | Whether to force a rebuild |
--terrazzo-home | $HOME/.terrazzo | The Terrazzo environment |
The build
command parses a build definition from a JSON file and carries out the steps it describes to generate a dataset. The required CONTEXT
argument gives a path to use for resolving relative paths that appear in the build definition. See The build definition for a detailed description of the build definition format, or Getting started for a simple example definition.
Once the build is complete, the new dataset is registered under its public id in the current environment so that it is available for other trzo
commands or to serve as the basis for another dataset definition. Built datasets are immutable; at the moment you cannot build a dataset that shares a public id with an already-built dataset but has different source or input data or different transformations. Instead of modifying an already-built dataset, consider defining a dataset with the same name but a different version.
The build
command will output the steps it is taking during the build in a human-readable format:
(1/2) Fetching uri https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
(2/2) Filter primary dataset @f90c137b by to produce @93b63d39
example/yellow-taxi-example@1.0.0
Where possible, build
will attempt to show a proress bar to report the status of long-running operations like network fetches or joins; however, the duration of some operations cannot be predicted. build
will also try to reuse previously-generated stages of both the primary and any input datasets if it can determine that nothing about their state has changed. (Use the --no-cache
flag to disable this behavior.)
The final line of build output is the public id of the dataset, written to standard output so that you can use it as part of a command pipeline. (The other messages logged during the build are written to standard error by default.)
Ordinarily, if a dataset with the same public id is found in the environment, the build will jump straight to printing out the public id and then exit. Use the --rebuild
flag to force build
to go through the normal sequence of analyzing and executing the build definition.
Examples
Building the default terrazzo.json
definition relative to the current directory
trzo build .
Building relative to another folder
trzo build /media/Volumes/my_data
Building a definition with a non-default filename
trzo build . -f terrazzo.wip.json
Piping the result to trzo dump
tzro build . | xargs trzo dump --format=csv
clean
Clean the environment’s temp folder and work cache
Usage
trzo clean [OPTIONS]
Options
Name, shorthand | Default | Description |
---|---|---|
--terrazzo-home | $HOME/.terrazzo | The Terrazzo environment |
A trzo
environment includes a folder for storing temporary files that are used during the generation of a single stage of the build (such as spilling portions of a large sort operation) as well as a folder for caching the results of build stages so they do not need to be recomputed.
The temporary files are almost always cleared out automatically by trzo
or by the operating system itself, but the work cache is preserved across builds because it often speeds up build operations significantly- especially when you are iterating on a build definition or debugging an error in a build. The cached files can be quite large, so run clean
when you want to free up some space.
describe
Print a human-readable description of the build steps in a definition
Usage
trzo describe [OPTIONS]
Options
Name, shorthand | Default | Description |
---|---|---|
--file , -f | terrazzo.json | The path to the definition file |
Use the describe
command to print a log of the steps a build definition would take if you were to execute it using trzo build
. This can be useful if you are having trouble visualizing the behavior of a definition; or if you would like to validate the structure of a definition you are working on, without actually performing the steps. The output of describe
is almost exactly the same as the output produced during a build.
Because it processes a build file entirely “offline” (i.e., without actual data) describe
does not require an environment to be available.
dump
Dump the contents of a dataset to standard output
Usage
trzo dump [OPTIONS] <PUBLIC_ID>
Options
Name, shorthand | Default | Description |
---|---|---|
--format | The output format; one of csv json parquet | |
--terrazzo-home | $HOME/.terrazzo | The Terrazzo environment |
The more straightforward way to export a dataset from your Terrazzo environment is by using the dump
command. dump
writes the contents of the dataset specified via the PUBLIC_ID
argument to standard output, where you can redirect it to a file or use it as part of a “pipeline” (at least, on a Unix-based system).
Note that depending on the types in your dataset’s schema, it may not be possible to render it as CSV or JSON- for example, if it includes geospatial data in a binary
column. In this case, you may need to dump it as Parquet and use Parquet-based tools to finish processing it outside of Terrazzo.
Examples
Dumping into a Parquet file
trzo dump --format=parquet example/yellow-taxi-example@1.0.0 > taxi_data.parquet
Dumping to CSV, then piping to csvkit
trzo dump --format=parquet example/yellow-taxi-example@1.0.0 | csvgrep -c 1 '2022-01-01'
get
Fetch a dataset from a Terrazzo repository
Usage
trzo get [OPTIONS] <PUBLIC_ID>
Options
Name, shorthand | Default | Description |
---|---|---|
--repository-url , -r | https://terrazzo.dev | The target repository URL |
--terrazzo-home | $HOME/.terrazzo | The Terrazzo environment |
Use get
to retrieve the data and definition files for the dataset specified by PUBLIC_ID
from a Terrazzo repository. This can be handy when you want to access the content of a dataset in a way other than using it as an input in a definition- e.g., via trzo dump
. You can also use get
to ensure that a dataset is available in your local environment before commencing a build.
Examples
Fetching a dataset from terrazzo.dev
trzo get example/yellow-taxi-example@1.0.0
Fetching a dataset from a different Terrazzo repository
trzo get -r https://my-own-private-repo.net/ example/yellow-taxi-example@1.0.0
init
Initialize a new Terrazzo environment
Usage
trzo init [TERRAZZO_HOME]
Terrazzo’s “environment” is an area on the filesystem that Terrazzo uses for storing the data behind your datasets, along with dataset manifests and miscellaneous files needed in the course of building a dataset from a definition. An environment is needed by almost all trzo
commands, with a few exceptions.
The easiest way to set up a new environent is to run trzo init
with no additional arguments, which will create the files and folders for the environment in a default location for your user. On most systems, this location will be a subfolder named .terrazzo
within your user’s “home” folder.
In the event that you need to create an environment in a different location, you can pass that location as an argument to init
(see below). Note that if you do this, you will need to tell any subsequent trzo
commands where to find the environment, by using the --terrazzo-home
option. The user who runs trzo
must have “read” and “write” access to the environment folder. It must also have permission to list files in that folder. (On Unix-like systems this requires the “execute” permission.)
init
will fail if the target environment location…
- …does not exist and cannot be created
- …exists but is not a directory
- …exists as a directory but is not empty
Examples
Creating a Terrazzo environment in the default location
trzo init
Creating a Terrazzo environment in a different location
trzo init /tmp/environment
login
Log in to a Terrazzo repository server
Usage
trzo login [OPTIONS] --username <USERNAME> [REPOSITORY_URL]
Options
Name, shorthand | Default | Description |
---|---|---|
--password-stdin | false | Whether to read the password piped from stdin |
--terrazzo-home | $HOME/.terrazzo | The Terrazzo environment |
--username , -u | The username of the user to log in as |
In order to perform certain actions on a Terrazzo repository server, you must authenticate yourself to the server. The login
command will read your username and password and submit them over a secure channel to the remote server, establishing a session that is stored in your Terrazzo environment and can be used for subsequent interactions with that repository. Unless you specify the repository you wish to log in to via the REPOSITORY_URL
argument, login
will connect to the default repository server at https://terrazzo.dev
but you may have distinct, active sessions with multiple repositories- just run the login
command for each one to establish your session.
By default, login
operates interactively, reading your password from the console as you type it, without displaying the characters. In some cases it may be more convenient or secure to pipe your password to login
as part of a command pipeline. If you wish to do this, add the --password-stdin
flag to direct login
to read your password from standard input.
While you may use other trzo
commands to interact with repositories that serve data over an unsecured channel like HTTP (e.g. fetching non-protected datasets) login
will refuse to submit credentials to those servers, since doing so could reveal them to a third party monitoring your network traffic.
Examples
Logging in to terrazzo.dev
trzo login -u me@my-email-server.net
Logging to a different Terrazzo repository
trzo login -u me@my-email-server.net https://my-own-private-repo.net/
Piping credentials from standard input
trzo login -u me@my-email-server.net < my_password.txt
logout
Log out of a Terrazzo repository server
Usage
trzo logout [OPTIONS] [REPOSITORY_URL]
Options
Name, shorthand | Default | Description |
---|---|---|
--terrazzo-home | $HOME/.terrazzo | The Terrazzo environment |
Use logout
to end your session with a particular Terrazzo repository. Your sessions for other repositories will be unaffected. If the REPOSITORY_URL
argument is ommitted, logout
defaults to logging you out of the default repository server at terrazzo.dev
.
Examples
Logging out of terrazzo.dev
trzo logout
Logging out of a different Terazzo repository
trzo logout https://my-own-private-repo.net/
ls
List the datasets available in the environmentbesides
Usage
trzo ls [TERRAZZO_HOME]
Options
Name, shorthand | Default | Description |
---|---|---|
--terrazzo-home | $HOME/.terrazzo | The Terrazzo environment |
Use ls
to see a list of datasets you’ve built or fetched. The output of ls
is a list of datasets by public id and the first eight characters of the build digest, separated by the tab character:
public_id build_id
nyc/census-political-overlay@1.0.0 9f665a93
example/yellow-taxi-example@1.0.0 0a40612e
purge
Permanently delete a dataset from a Terrazzo repository
Usage
trzo purge [OPTIONS] <PUBLIC_ID>
Options
Name, shorthand | Default | Description |
---|---|---|
--repository-url , -r | https://terrazzo.dev | The target repository URL |
--terrazzo-home | $HOME/.terrazzo | The Terrazzo environment |
The purge
command allows you to permanently remove a dataset from a Terrazzo repository server. Note that a dataset may only be purged by its original uploader or by users with superuser permissions on the target repository server.
Examples
trzo purge example/yellow-taxi-example@1.0.0
push
Publish a dataset to a Terrazzo repository
Usage
trzo push [OPTIONS] <PUBLIC_ID>
Options
Name, shorthand | Default | Description |
---|---|---|
--repository-url , -r | https://terrazzo.dev | The target repository URL |
--terrazzo-home | $HOME/.terrazzo | The Terrazzo environment |
In order to share your dataset with others, you may wish to upload it to a Terrazzo repository server. The push
command uploads the Parquet data that backs the dataset given by PUBLIC_ID
along with its definition file and a manifest describing its contents. The dataset must have been built (via trzo build
) before it can be pushed.
Because Terrazzo datasets may be hundreds of megabytes or even gigabytes in size, push
will display a progress bar showing the status of the upload, and will attempt resume uploads that have been interrupted by changes in network connectivity.
push
will fail if the repository rejects your dataset for some reason- for example, if it has already been shared by someone else; or if you don’t have permission to add to the collection the dataset belongs to. Note that you will need to authenticate to the repository server via trzo login
before it will accept your upload.
Examples
Pushing a dataset to terrazzo.dev
trzo push example/yellow-taxi-example@1.0.0
Pushing a dataset to a different Terrazzo repository
trzo push -r https://my-own-private-repo.net/ example/yellow-taxi-example@1.0.0
rm
Permanently delete a dataset from the environment
Usage
trzo rm [OPTIONS] <PUBLIC_ID>
Options
Name, shorthand | Default | Description |
---|---|---|
--terrazzo-home | $HOME/.terrazzo | The Terrazzo environment |
Examples
trzo rm example/yellow-taxi-example@1.0.0