Skip to main content Link Menu Expand (external link) Document Search Copy Copied

When your data is too big for Excel but still small enough to fit on a thumb drive

Terrazzo is desktop ETL for small- to medium-sized datasets. Create a formal description of a dataset and allow your collaborators to reproduce it from its component parts. Or make it available over the web for others to use in their own computations.

Quickstart Join us on GitLab

This is the documentation site for the Terrazzo data management framework.

Terrazzo is a framework and toolchain for generating and publishing columnar datasets in the Parquet file format. Using Terrazzo, you can quickly transform existing public or proprietary datasets by filtering and combining them using familiar relational operations, or compute additional data points by applying user-defined functions across millions of records. Terrazzo’s tools can operate on files from your local filesystem or pull data from the web or from file hosting services such as Amazon S3 and Google Drive.

Terrazzo supports an easy-to-edit, human-readable format for describing datasets and transformation operations. The keywords and directives in a Terrazzo build definition are close to plain English, so you don’t need to be a programmer to produce complex and useful output.

Terrazzo was designed to support groups of researchers working asynchronously and independently. After you’ve generated the data you need, you can export it (as Parquet or CSV or JSON) for use with your preferred data science toolchain, or you can publish it to a Terrazzo repository so that others can pick up where you left off by deriving their own datasets based on your work.

Getting started

Follow these steps to get started producing

  1. Browse to the Terrazzo Releases page and download the zip file for your operating system.

  2. Extract the contents to a location of your choosing. The Terrazzo client package with the Terrazzo command-line tool trzo and the Terrazzo Geometry plugin for geospatial computations. To take Terrazzo for a test drive you can leave these files where they are and temporarily update the following environment variables. See Installation when you are ready to create a permanent home for Terrazzo.

Before setting these variables, ensure that you have unzipped the files to their own subfolder, or to another folder on your computer that includes only files that you trust.

  • cd [FOLDER WHERE YOU PLACED THE FILES]
    export PATH="$PATH:$PWD"
    export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$PWD"
    
  • cd [FOLDER WHERE YOU PLACED THE FILES]
    export PATH="$PATH:$PWD"
    export DYLD_LIBRARY_PATH="$DYLD_LIBRARY_PATH:$PWD"
    
  • No changes necessary.

  1. Initialize a new local Terrazzo environment. Run:
    trzo init
    

You’re ready to compute!

Let’s create a toy example dataset. Paste the following block of code into a text editor of your choice:

{
  "name": "example/yellow-taxi-example",
  "author": "[YOUR_EMAIL_ADDRESS]",
  "version": "1.0.0",
  
  "from": {"uri": "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"},
  
  "transform": [{
      "operation": "filter",
      "by": {
        "left": {"col": "trip_distance"},
        "op": ">",
        "right": {"lit": 20.0}
      }
  }]
}

and save it as terrazzo.json in a location of your choosing. Open a command line in the same folder and run:

trzo build .

(Don’t forget the dot!) You should see some output like this:

(1/2) Fetching uri https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
(2/2) Filter primary dataset @f90c137b by  to produce @93b63d39
example/yellow-taxi-example@1.0.0

What just happened?

You created a new dataset definition file for a new dataset named example/yellow-taxi-example to be derived from New York City’s TLC Trip Record Data for January 2022. (That dataset already happens to be in Parquet format, so Terrazzo can operate with it directly.) You can see the format of the trip record data here. Your definition then specifies that you want to filter that initial dataset to include only those trips that had a distance greater than 20 miles.

When you invoked trzo to build your dataset, it read your definition file, then fetched the source data from the specified URI and applied the filter transformation to it to produce a new dataset named example/yellow-taxi-example with version 1.0.0. That new dataset is now part of your local Terrazzo environment. To look at the contents of the dataset, you can export it to CSV like so:

trzo dump example/yellow-taxi-example@1.0.0 --format=csv

…which should produce some output like this:

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
1,2022-01-01T22:57:11.000000000,2022-01-01T23:31:55.000000000,2.0,20.1,1.0,N,132,181,1,54.5,1.75,0.5,11.4,0.0,0.3,68.45,0.0,1.25
2,2022-01-01T22:20:45.000000000,2022-01-01T22:56:59.000000000,2.0,22.03,2.0,N,132,87,1,52.0,0.0,0.5,12.37,6.55,0.3,75.47,2.5,1.25
2,2022-01-01T22:54:07.000000000,2022-01-01T23:41:12.000000000,2.0,25.86,1.0,N,132,47,1,70.0,0.5,0.5,10.0,6.55,0.3,89.1,0.0,1.25
2,2022-01-01T22:56:22.000000000,2022-01-01T23:33:02.000000000,1.0,22.23,1.0,N,132,189,1,60.0,0.5,0.5,12.26,0.0,0.3,74.81,0.0,1.25
2,2022-01-01T22:28:32.000000000,2022-01-01T23:08:25.000000000,1.0,20.09,2.0,N,132,231,1,52.0,0.0,0.5,10.0,6.55,0.3,73.1,2.5,1.25

What’s next?

Continue reading this documentation to learn more about generating and sharing datasets using Terrazzo. If you want to build on the taxi example, some possible directions include:

  • Creating a new version of yellow-taxi-example that transforms the source data in more and different ways
  • Defining a new dataset that derives fromyellow-taxi-example@1.0.0 as its source dataset
  • Defining a new dataset that derives from some other dataset and joins with yellow-taxi-example@1.0.0 to yield a result

Installation

If you think you’ll be using trzo on a regular basis, consider installing it to a shared location on your computer. The trzo executable should go in whatever folder is most appropriate for your operating system for storing program files. (On Linux and MacOS this is often /usr/local/bin.)

The other files are “plugins” that can be optionally loaded as part of the build process for datasets that rely on additional functions, such as those that deal with geospatial data. Plugins are discussed in greater detail elsewhere in this documentation, but for the purposes of installation, they should be installed wherever is most appropriate for your operating system for storing library files. (On Linux and MacOS this is often /usr/local/lib.)

For computers running Windows, the conventions for installing small applications like trzo are less well-defined. One possibility is to do as this StackExchange post suggests and place them in a top-level folder such as C:\Tools\Terrazzo created expressly to be a home for this type of application. See also these instructions for adding folders to the list of folders Windows searches for executables and libraries: https://www.wikihow.com/Change-the-PATH-Environment-Variable-on-Windows

Here are some sample commands for installing Terrazzo to a shared location on your computer:

  • sudo cp trzo /usr/local/bin
    sudo cp libterrazzo_geometry.so /usr/local/lib
    sudo cp libterrazzo_nyc_address.so /usr/local/lib
    
  • sudo cp trzo /usr/local/bin
    sudo cp libterrazzo_geometry.dylib /usr/local/lib
    sudo cp libterrazzo_nyc_address.dylib /usr/local/lib
    
  • IF NOT EXIST "C:\Tools\Terrazzo\" MKDIR C:\Tools\Terrazzo
    COPY trzo.exe C:\Tools\Terrazzo
    COPY terrazzo_geometry.dll C:\Tools\Terrazzo
    COPY terrazzo_nyc_address.dll C:\Tools\Terrazzo