The Modern Data Stack

The Modern Data Stack

Brian Kane June 10, 2021

Our Modern Data Stack with DBT, Fivetran, Snowflake, SeekWell, and ThoughtSpot

At SeekWell, we're hard at work integrating with our new parent company, ThoughtSpot, and imagining the future of analytics workflows. Selecting the right tools is critical to making that journey a success, so we wanted to share what we're using. Here's a glimpse of how we're leveraging dbt, Fivetran, Snowflake, ThoughtSpot, and of course, SeekWell, to get the most out of our data.

The Modern Data Stack

You probably have heard a recent buzzword in the data world: the "Modern Data Stack". The Modern Data Stack is built on the new cloud-native technologies that have emerged in the last decade that are fast, reliable, scalable, and most importantly, accessible everywhere.

Some of the technologies that have made this possible are MPP (massively parallel processing) cloud data warehouses like Redshift and BigQuery, ingestion tools like Stitch and Fivetran that have improved reliability and connector coverage, and analytics platforms like ThoughtSpot, that enhance users' experience finding and sharing data insights, and make that data accessible to everyone, everywhere.

DBT

Before you get to those data insights, though, you need a rock-solid foundation of modeled, documented, and version-controlled data. That's where dbt (Data Build Tool) comes in. It's an open-source framework for turning raw production data into clean analytics tables using simple "Select" statements written in SQL. We're already somewhat familiar with this concept at SeekWell, as we help people use "Select" statements to extend data pipelines to other apps. Importantly, this means the only prerequisite to start transforming data is to be able to write SQL (don't worry if you don't know SQL—the BI part of the data stack is still for you).

This is huge for empowering data analysts and engineers, who might otherwise be stuck endlessly writing and re-writing SQL to answer the same questions asked by product managers. No one wants to be stuck in the loop of "Hey, can you pull this data again for me?" or "These numbers don't look right, how are you defining x?".

Instead, data engineers should be creating the foundational definitions and metrics that empower product managers to self-serve answers, and providing rich documentation for others to reference if they do have questions about metric definitions or data integrity.

dbt makes all of this easy by providing a framework for doing data transformations that has documentation and testing built right in. Here is how we are using both the documentation and the tests in dbt to better standardize, clarify and verify our data transformations.

Auto-generated Documentation

If you're not familiar, a dbt project consists of "models" (which can be thought of as "select" statements in SQL that produce either tables or views) and a "schema.yml" file that defines at a high-level each model. In the "schema.yml" file, you can write descriptions for each table and column generated by dbt. For example, we have a "stripe_customers" table where we keep track of all of our customer-level data in Stripe (our payment processor). The "id" field is the primary key, and so should be unique and not null as defined below:

image

Then, when we want to generate docs based off of this model, we simply run the command dbt docs serve and a static site of your dbt model documentation is created. When others want to see the descriptions and/or tests for each column they can look those up in the documentation like below.

image

Even cooler, in the docs you can easily see what other models and/or tables each model both depends on and is referenced by. In the case of our model "stripe_mrr_by_month"—a table that shows monthly recurring revenue by customer by month—you can see below that it depends on the "stripe_customers", "stripe_subscriptions", "user_facts", and "user_team_lookup" models. If you find a bug (which your tests should find for you!) like a customer having two records in a single month, you know exactly where to go to fix the bug.

image

There's even a "Lineage Graph" that illustrates how different models combine to create other models. To the right, you can see visually how other models flow into the "stripe_mrr_by_month" model. If you found, for example, that two customers have separate records but are really on the same team and so should be the same "customer", you know that the broken logic is in the "user_team_lookup" model. Once you fix it there, all the models that depend on that lookup will be immediately updated and fixed. This allows for clear, verifiable debugging when errors crop up, along with fewer questions and fixes after development.

When there's little or even no maintenance to do on existing models, data engineers can focus on building other models and features!

image

Tests

In addition to schema tests defined in the "schema.yml" file, there are "Data Tests" in dbt that are essentially queries that should always return 0 rows, but will return rows if something is wrong. In our case, we have a test to make sure the total payment amount calculated in our "stripe_customers" table generated by dbt is the same as the total payment amount in our pre-transformed payments table. An example test for this could look like below:

with transformed_table as (
select sum(payment_amount) as payments, month from {{ ref('transformed_table' )}}
group by month
),
raw_table as (
select sum(payment_amount) as payments, month from public.raw_data
group by month
)
select t.payments as calculated_payments,
r.payments as original_payments, t.month
from transformed_table as t left join
raw_table as r on t.month = r.month
where a.payments <> b.payments

In this (very simple) test, payments are aggregated by month in both the "transformed" table and the original raw data table that the transformations are built off of. Then they're joined together by month, with the added condition that the payment amounts must be different. Therefore, this query should return 0 rows since the payment amount total calculated by the transformed table should equal the payment amount in the raw data for every month. But if for some reason they are different in any month, this query will return rows and the dbt test will fail. You can configure the test to trigger an email whenever it fails so you're always informed right away.

So instead of waiting for somehow to say, "Hey, the payment amount in this table doesn't match the payment amount in this other table", you can write a test so that you know that they always match, and are alerted when they don't. Tests are also integrated directly into the docs so others can look up the tests that are associated with each model.

Getting Data out of Siloes

Now all this data transformation in dbt is great for having clean, standardized, documented analytics tables, but what if that final data is "trapped" in a database that's not being used for production analytics? It doesn't matter how hard you work to create clean data if it's not actually getting into the hands of business operators and frontline decision makers, as research from Harvard Business Review points out.

In our case, the production database used by ThoughtSpot is a Snowflake database that is separate from our own PostgreSQL analytics database. Snowflake is a great database for analytics, and works great with ThoughtSpot, so it's a no-brainer for us to send our analytics data to Snowflake for company-wide use and analysis.

Instead of tearing down our database and building everything from scratch again in Snowflake (which would entail re-configuring connectors, application code, and a bunch of other complications), we decided to just keep doing our data transformations on our PostgreSQL database, and then only send our final, clean analytics tables to Snowflake.

Fivetran

To do this, we used Fivetran to regularly send the analytics tables generated by dbt to Snowflake. In Fivetran, this was as simple as entering in the connection details for the source and destination, and just choosing the tables to be sent. If, however, your tables generated by dbt are being fully recreated every time a job runs, instead of built incrementally, then Fivetran may not be able to tell which rows are new and so will do a full re-sync. This may not be ideal for some so you should find the solution that works best with your data and job-cadences.

Snowflake

Once the analytics data is in Snowflake, we can take advantage of some of Snowflake's awesome features like the Data Marketplace (which we wrote about here). We leveraged the Crunchbase data in Snowflake, for example, to help with scoring and categorizing leads.

You can also use external libraries in Snowflake, for example the StarSnow HTTP client, which allows you to interact with web APIs directly from SQL statements. We've even leveraged Snowflake external functions ourselves to create a SeekWell function, so you can call SeekWell features (like sending data to Sheets or other apps like Salesforce), right from the Snowflake UI.

Finally, Snowflake works great with ThoughtSpot, so once your data is in Snowflake, it's easy to start leveraging all of ThoughtSpot's features like Search and ThoughtSpot Everywhere so finding data insights can be as easy as a Google Search, and done from anywhere. You can try using ThoughtSpot right away with the free trial.

Using SeekWell to push insights elsewhere

We can't talk about our internal data stack without mentioning SeekWell, which is of course an integral part of our data operations. We don't have enough time to go through all the ways that we use SeekWell internally, but here's one example: sending our modeled Stripe data (as described above in the dbt section) to Intercom, where the data can be used to customize messaging.

Stripe Modeled Data → Intercom

Once our Stripe data was modeled, we wanted it in our messaging provider, Intercom, so that payment and subscription changes could be set to trigger automatic messages (e.g. a bounced payment triggers a custom email asking to re-submit payment information).

To implement this, we leveraged SeekWell to take the final data points from our dbt model and sync them with Intercom. This is really simple to do with SeekWell's mapping tool. You just run the query with the columns you want to send to Intercom (or another destination) and map each column to its corresponding Intercom field.

image

Once you've done that, and checked that the sync is working in Intercom, it's just a few more clicks to make that sync happen on a regular schedule. Just select the time and period (e.g. hourly, daily, weekly) under "Repeat" and your sync will run regularly. Easy, right?

image

These were just a few examples of the ways in which we're leveraging tools like dbt, Snowflake, ThoughtSpot, Fivetran, and SeekWell to make our data operations smooth and performant. It's important for us not only that we use our own tools internally, but to make sure we're building tools that are additive to—and interoperable with—all the other great data tools out there in the "Modern Data Stack".

It's an exciting time to be in the data space, and we know the tools are only going to keep getting better from here!