When your data is too big for Excel but still small enough to fit on a thumb drive
Terrazzo is desktop ETL for small- to medium-sized datasets. Create a formal description of a dataset and allow your collaborators to reproduce it from its component parts. Or make it available over the web for others to use in their own computations.
This is the documentation site for the Terrazzo data management framework.
Terrazzo is a framework and toolchain for generating and publishing columnar datasets in the Parquet file format. Using Terrazzo, you can quickly transform existing public or proprietary datasets by filtering and combining them using familiar relational operations, or compute additional data points by applying user-defined functions across millions of records. Terrazzo’s tools can operate on files from your local filesystem or pull data from the web or from file hosting services such as Amazon S3 and Google Drive.
Terrazzo supports an easy-to-edit, human-readable format for describing datasets and transformation operations. The keywords and directives in a Terrazzo build definition are close to plain English, so you don’t need to be a programmer to produce complex and useful output.
Terrazzo was designed to support groups of researchers working asynchronously and independently. After you’ve generated the data you need, you can export it (as Parquet or CSV or JSON) for use with your preferred data science toolchain, or you can publish it to a Terrazzo repository so that others can pick up where you left off by deriving their own datasets based on your work.
Getting started
Follow these steps to get started producing
-
Browse to the Terrazzo Releases page and download the zip file for your operating system.
-
Extract the contents to a location of your choosing. The Terrazzo client package with the Terrazzo command-line tool
trzo
and the Terrazzo Geometry plugin for geospatial computations. To take Terrazzo for a test drive you can leave these files where they are and temporarily update the following environment variables. See Installation when you are ready to create a permanent home for Terrazzo.
Before setting these variables, ensure that you have unzipped the files to their own subfolder, or to another folder on your computer that includes only files that you trust.
-
cd [FOLDER WHERE YOU PLACED THE FILES] export PATH="$PATH:$PWD" export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$PWD"
-
cd [FOLDER WHERE YOU PLACED THE FILES] export PATH="$PATH:$PWD" export DYLD_LIBRARY_PATH="$DYLD_LIBRARY_PATH:$PWD"
-
No changes necessary.
- Initialize a new local Terrazzo environment. Run:
trzo init
You’re ready to compute!
Let’s create a toy example dataset. Paste the following block of code into a text editor of your choice:
{
"name": "example/yellow-taxi-example",
"author": "[YOUR_EMAIL_ADDRESS]",
"version": "1.0.0",
"from": {"uri": "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"},
"transform": [{
"operation": "filter",
"by": {
"left": {"col": "trip_distance"},
"op": ">",
"right": {"lit": 20.0}
}
}]
}
and save it as terrazzo.json
in a location of your choosing. Open a command line in the same folder and run:
trzo build .
(Don’t forget the dot!) You should see some output like this:
(1/2) Fetching uri https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
(2/2) Filter primary dataset @f90c137b by to produce @93b63d39
example/yellow-taxi-example@1.0.0
What just happened?
You created a new dataset definition file for a new dataset named example/yellow-taxi-example
to be derived from New York City’s TLC Trip Record Data for January 2022. (That dataset already happens to be in Parquet format, so Terrazzo can operate with it directly.) You can see the format of the trip record data here. Your definition then specifies that you want to filter that initial dataset to include only those trips that had a distance greater than 20 miles.
When you invoked trzo
to build your dataset, it read your definition file, then fetched the source data from the specified URI and applied the filter
transformation to it to produce a new dataset named example/yellow-taxi-example
with version 1.0.0
. That new dataset is now part of your local Terrazzo environment. To look at the contents of the dataset, you can export it to CSV like so:
trzo dump example/yellow-taxi-example@1.0.0 --format=csv
…which should produce some output like this:
VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
1,2022-01-01T22:57:11.000000000,2022-01-01T23:31:55.000000000,2.0,20.1,1.0,N,132,181,1,54.5,1.75,0.5,11.4,0.0,0.3,68.45,0.0,1.25
2,2022-01-01T22:20:45.000000000,2022-01-01T22:56:59.000000000,2.0,22.03,2.0,N,132,87,1,52.0,0.0,0.5,12.37,6.55,0.3,75.47,2.5,1.25
2,2022-01-01T22:54:07.000000000,2022-01-01T23:41:12.000000000,2.0,25.86,1.0,N,132,47,1,70.0,0.5,0.5,10.0,6.55,0.3,89.1,0.0,1.25
2,2022-01-01T22:56:22.000000000,2022-01-01T23:33:02.000000000,1.0,22.23,1.0,N,132,189,1,60.0,0.5,0.5,12.26,0.0,0.3,74.81,0.0,1.25
2,2022-01-01T22:28:32.000000000,2022-01-01T23:08:25.000000000,1.0,20.09,2.0,N,132,231,1,52.0,0.0,0.5,10.0,6.55,0.3,73.1,2.5,1.25
What’s next?
Continue reading this documentation to learn more about generating and sharing datasets using Terrazzo. If you want to build on the taxi example, some possible directions include:
- Creating a new version of
yellow-taxi-example
that transforms the source data in more and different ways - Defining a new dataset that derives from
yellow-taxi-example@1.0.0
as its source dataset - Defining a new dataset that derives from some other dataset and joins with
yellow-taxi-example@1.0.0
to yield a result
Installation
If you think you’ll be using trzo
on a regular basis, consider installing it to a shared location on your computer. The trzo
executable should go in whatever folder is most appropriate for your operating system for storing program files. (On Linux and MacOS this is often /usr/local/bin
.)
The other files are “plugins” that can be optionally loaded as part of the build process for datasets that rely on additional functions, such as those that deal with geospatial data. Plugins are discussed in greater detail elsewhere in this documentation, but for the purposes of installation, they should be installed wherever is most appropriate for your operating system for storing library files. (On Linux and MacOS this is often /usr/local/lib
.)
For computers running Windows, the conventions for installing small applications like trzo
are less well-defined. One possibility is to do as this StackExchange post suggests and place them in a top-level folder such as C:\Tools\Terrazzo
created expressly to be a home for this type of application. See also these instructions for adding folders to the list of folders Windows searches for executables and libraries: https://www.wikihow.com/Change-the-PATH-Environment-Variable-on-Windows
Here are some sample commands for installing Terrazzo to a shared location on your computer:
-
sudo cp trzo /usr/local/bin sudo cp libterrazzo_geometry.so /usr/local/lib sudo cp libterrazzo_nyc_address.so /usr/local/lib
-
sudo cp trzo /usr/local/bin sudo cp libterrazzo_geometry.dylib /usr/local/lib sudo cp libterrazzo_nyc_address.dylib /usr/local/lib
-
IF NOT EXIST "C:\Tools\Terrazzo\" MKDIR C:\Tools\Terrazzo COPY trzo.exe C:\Tools\Terrazzo COPY terrazzo_geometry.dll C:\Tools\Terrazzo COPY terrazzo_nyc_address.dll C:\Tools\Terrazzo