<  Blog

Simplifying Data Pipeline Management with Shakudo and DBT

February 6, 2023

The Data Pipeline Struggle

The increasing volume and complexity of data has made the management of data pipelines a challenging task for many data professionals. In response, DBT is becoming a leading solution to streamline the process of managing and analyzing data. 

DBT is an open-source command-line tool that assists analysts and data engineers with creating, testing, and maintaining SQL code. It's quickly becoming the go-to industry standard tool for data transformation. It is specifically designed to simplify the transformation aspect of ELT (Extract, Load, Transform) processes in a straightforward and efficient way. DBT enables this by using select statements to perform transformations, which are then converted into tables and views, making the transformation process more simple and effective.

data transformation pipeline diagram visual with dbt

What Problem Does DBT Solve?

Before tools like DBT, data pipeline management posed a significant challenge for data professionals. Many extensive tasks needed to be done manually, for example, monitoring dependencies between models, testing data accuracy and updating data correctly was complicated and time-consuming.

The complexity of data pipelines, with multiple steps and components, also made it difficult to keep track of everything and detect potential issues. DBT addresses this by automating the process of managing dependencies between models, documenting the data pipeline and testing the data quality and lineage. 

DBT’s visualization also provides a clear view of the data flow, making it easier to identify potential issues in the data pipeline. You can write your transformations and apply your business logic to move your data from its raw format, standardize and curate it into your format for downstream processing. In parallel, version control is managed through Git for continuous deployment in a test environment for early feedback and validation of code changes.

Optimizing Even Further

Shakudo integrates dozens of open source tools to optimize your work with data, including DBT, Superset, Airflow, Grafana, and many others. By taking advantage of this, data teams can create their data products inside the Shakludo Platform from ideation to production, streamlining the process and saving time and resources.

Inside the Shakudo platform, on the Apps tab, you’ll be able to find DBT as one of the dozens of open-source applications that have been integrated.

Apps tab on Shakudo

Inside the Shakudo DBT Dashboard, you’ll be able to find the DBT documentation of your dbt projects including queries and relations, while also providing visual representation for the lineage of each one of these items. Here’s a simple example of the table user documentation.

Screenshot of dbt

(example of the DBT Dashboard)

Getting Started

To create a new DBT project, in the terminal, navigate to the directory where you want to create your DBT project and run ‘dbt init’, which will create a new directory with the basic file structure for a DBT project.

Your queries will be stored in the models folder, along with the schema.yml file, which is a configuration file that defines the structure, tests and descriptions of the tables and views within a DBT project. It is used to specify the relationships between different tables and to define the properties of individual columns, such as data types and primary keys.

Here’s an example of a simple schema.yml file. We’re defining a single model called users, which is created from a SELECT statement of a table called raw_users . The columns section specifies the structure of the table, with id as primary key, name and email as varchar(255).

This example includes the unique_email test, which tests that all emails in the table are unique. DBT allows for a wide range of tests, from simple column nullability or uniqueness, to more complex constraints like unique keys, and custom SQL tests.You can give each test a name, test statement and success_message that will be shown if the test passes.

The Deployment 

Let’s see an example of how you can start running your DBT project on Shakudo. 

After creating your Session on the platform, enter your IDE of preference and follow the steps listed above if that’s your first time using DBT in this environment. If you don’t know how to connect your preferred IDE to Shakudo yet, watch our video tutorial “SSH connect to Shakudo in 3 minutes“ here.

Now that you have the skeleton of your DBT project created, let’s configure the additional files you need to deploy your project. The first step to is to create a bash file (for this example “run_dbt.sh”), with the following commands:

The second step is to create a YAML file (for this example “job_pipeline.yaml”) with the following commands:

After all your files are created, commit your project to Git.

Finally, if you would like to create a scheduled job to run it, just go to the Shakudo Home Screen > Jobs >  Scheduled > Create a new scheduled job.

Adding a scheduled job in Shakudo

On this tab, paste the path for the “job_pipeline.yaml” file and set your preferred schedule for it. You can also set up other configurations. For example, if your application requires any variables or settings to be passed to the job at runtime, you can add the ‘Key’ and ‘Value’ variables on the parameters field. 

schedule jobs parameters on Shakudo

After filling in all necessary fields for your use case, click the Create button and your job will start running in the background immediately. To access the documentation for it, inside the WebApp home page, go to Apps > All > DBT Dashboard.

To access the details generated from the job, go to Jobs >  Scheduled > Search your job’s name and go to the View menu.

scheduling jobs on shakudo

There you can View the job details, clone it, or access its Logs on Grafana so you can track the progress and status of a job, as well as to troubleshoot any issues that may arise.

Versioning with DBT

One of the most helpful features of DBT is its versioning system. The feature "source control" allows you to version your models and keep track of the changes being made over time. When you create a model, it automatically creates a new version of the model each time you run it. All of these versions are stored in a source control system similar to Git, which allows you to keep track of all model changes happening over time as well as who made them. 

Each time you make changes to the SQL code in the model's file and run ‘dbt run’, dbt will create a new version of the model and store it in your source control system. This is a great way to help data teams to test and validate changes before they are pushed and also to rollback the model to previous versions if any mistakes happen to be made in the production environment. 

To compare different versions of the model, you can run the command ‘dbt source diff <version_name>’ to see the differences between the current and specified versions of the model. For example: 

To roll back to a previous version of the model, check out the previous version of the model's file from your source control system and then run the DBT run command. For example: 

Please note that these commands may vary according to your own implementation of the model and you must adapt it accordingly. 

Still In Doubt?

DBT is becoming a must-have-tool for the data science stack because it addresses the common challenges data teams go through when working with data pipelines, like reusability, maintainability, data governance, and collaboration. One of the key benefits of using DBT inside the Shakudo platform is the seamless integration of your entire workflow with several other tools. 

By having DBT integrated into Shakudo, data professionals no longer have to go through the hassle of setting up, connecting, and maintaining dozens of different integrations. It makes the data pipeline management process more efficient and streamlined, allowing teams to focus on more important tasks such as data analysis and modeling. It also ensures that the data pipeline is reliable, accurate, and easy to maintain, as all the necessary tools and connections are already in place.

Other examples of tools integrated inside Shakudo that help the platform to provide a complete end-to-end data pipeline solution are Superset, Airflow, Dask and Grafana. To learn more about these tools and how they work on Shakudo, you can read our blog posts and walkthrough videos.

You can also find all of our current integrations here

Sabrina Aquino

Sabrina is a creative Software Developer who has managed to create a huge community by sharing her personal experiences with technologies and products. A great problem solver, an above average Age of Empires II player and a mediocre Linux user. Sabrina is currently an undergraduate in Computer Engineering at UFRN (Federal University of Rio Grande do Norte).

Build better data products - Faster.