Data Mesh showcase with dbt-trino

2021-09-19trinodbt

Perform data transformations over a Data Mesh of several databases with dbt-trino adapter for dbt.

Introduction

One frequently asked question in the context of using dbt tool is:

Can I connect my dbt project to two databases?

(see the answered question on the dbt website).

tldr; dbt stands for transformation as in T within ELT pipelines, it doesn’t move data from source to a warehouse.

The creators of the dbt however have added support for handling such scenarios via dbt-presto adapter.

Trino is a fork of the popular presto high performance, distributed SQL query engine for big data. This SQL query engine offers a helping hand in performing SQL queries on top of a myriad of data sources. Trino supports talking to the common relational databases (Postgres, MySQL) and also to data sources that don’t support SQL (AWS S3, Apache Kafka, Apache Cassandra, etc.). Feel free to check the list of supported Trino connectors for more details.

By using Trino, there can be queried data from fully separated databases. This makes Trino the analytics engine for data mesh (quote from Starburst Data website).

dbt Trino Architecture Image taken from Trino Community Broadcast

Data Mesh

Data Mesh is a paradigm to the data engineering domain which provides an alternative to the common recipe of using a centralized, monolithic data warehouse.

The principles on which this paradigm is being founded are quoted below:

domain-oriented decentralization of data ownership and architecture
domain-oriented data served as a product
self-serve data infrastructure as a platform to enable autonomous, domain-oriented data teams
federated governance to enable ecosystems and interoperability.

`jaffle_shop` Data Mesh

jaffle_shop is a self-contained project showcasing the functionality of dbt data warehouse transformation tool.

In the context of the project jaffle_shop there are being used the domains:

customers
orders (customers make orders)
payments (each completed order has a corresponding payment)

Entity Relationship Diagram image taken from jaffle_shop project

In the tutorial jaffle_shop project, all the domain entities (customers, orders, payments) as well as the results of the dbt transformations are obviously located in tables within the same database.

This blog post provides the answer to the questions answered by the project jaffle_shop in a data mesh decentralized data warehouse context where each domain (customer, order, payment) is being stored in a separate database. The insights gained from the dbt transformations will be also saved in a separate database.

Demo

Check out the dbt_jaffle_shop_data_mesh project to try out the concepts described in this blog post.

The project is using docker to spin up the jaffle_shop Data Mesh environment and for performing the dbt transformations. No other local installation or configuration of a tool is required on the workstation used for testing this showcase.

`dbt-trino`

dbt stands for transformation as in T within ELT pipelines, it doesn’t move data from source to a warehouse. dbt shines at orchestrating the transformations performed inside a data warehouse.

Below is a quote taken from Trino: The Definitive Guide book, Chapter 1, Introducing Trino :

Data warehouse systems have created not only huge benefits for users but also a burden on organizations:

Running and maintaining the data warehouse is a large, expensive project.

Dedicated teams run and manage the data warehouse and the associated ETL processes.

Getting the data into the warehouse requires users to break through red tape and typically takes too much time.

Trino, on the other hand, can be used as a virtual data warehouse. It can be used to define your semantic layer by using one tool and standard ANSI SQL. Once all the databases are configured as data sources in Trino, you can query them. Trino pro‐ vides the necessary compute power to query the storage in the databases. Using SQL and the supported functions and operators, Trino can provide you the desired data straight from the source. There is no need to copy, move, or transform the data before you can use it for your analysis.

Thanks to the standard SQL support against all connected data sources, you can cre‐ ate the desired semantic layer for querying from tools and end users in a simpler fashion. And that layer can encompass all underlying data sources without the need to migrate any data. Trino can query the data at the source and storage level.

dbt-trino is a dbt community adapter which offers a possible solution in dealing with transformations performed over a virtually distributed data warehouse.

By using this adapter, dbt is still interacting with only one SQL query engine, but in case of Trino, this is a fast distributed SQL query engine for big data analytics that helps you explore your distributed data universe.

trino> show tables in insights.default;
Table
-------------------
customer_orders
customer_payments
dim_customers
fct_orders
order_payments
stg_customers
stg_orders
stg_payments
(8 rows)

Query 20210918_214153_00126_zb79t, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0.21 [8 rows, 240B] [37 rows/s, 1.09KB/s]

As showcased in the demo project, in the snippet above, note that the tables:

stg_customers
stg_orders
stg_payments

are actually views, virtual tables, whose contents are defined by queries that are actually performed on external databases via Trino.

In the model dim_customers.sql from the demo project:

with customers as (

    select * from {{ ref('stg_customers') }}

),

customer_orders as (

    select * from {{ ref('customer_orders') }}

),

customer_payments as (

    select * from {{ ref('customer_payments') }}

),

final as (

    select
        customers.customer_id,
        customer_orders.first_order,
        customer_orders.most_recent_order,
        customer_orders.number_of_orders,
        customer_payments.total_amount as customer_lifetime_value

    from customers

    left join customer_orders on customers.customer_id = customer_orders.customer_id

    left join customer_payments on customers.customer_id = customer_payments.customer_id

)

select * from final

the data corresponding to the domain models of the jaffle_shop project:

customer
order
payment

is being joined in order to provide insights across all the aforementioned domain models.

Conclusion

In case that this post made you interested, feel free to check out the dbt-trino dbt adapter.

Try out on your workstation dbt_jaffle_shop_data_mesh to get you a quick look&feel on how dbt and Trino can be used together effectively for performing transformations on top of data distributed across various silos.

Feedback is very much welcome.