How to Deploy and Schedule Data Pipelines in under 3 minutes

Updated on:

November 16, 2022

In this article, we’re going to discuss how to deploy and schedule data pipelines of any size, in just a few minutes, using the Shakudo platform. This is the most efficient way to get your code deployed especially if you have large amounts of data flowing between tasks and it will save you from doing any maintenance or further configuration. No catch.

You can start following these steps once you’ve already built the code for your application in the Sessions environment, have pushed it to git, and now you need a way to deploy this software so you start using it.

Although we can find many serverless cloud computing solutions for deploying your code in scheduled pipelines, they’re usually very rusty. The way we’re going to show you allows you to develop things more dynamically, like transferring big data between tasks, which is very hard to do with other scheduling tools alone. Most often you’ll have to integrate other tools to do this for you, which is a common problem Data Engineers face in their day-to-day work.

Because the Shakudo platform is fully integrated behind the scenes, you don’t have to worry about any of that. Let’s get started with showing you how you can easily deploy scheduled pipelines:

What is a data pipeline?

A data pipeline is essentially a series of processes that helps you move your data from one place to another. It involves extracting data from a source, like a database, transforming it into a new format, and then loading it into a target system. For example, you might use a data pipeline to move data from a database to a data warehouse for reporting and analysis, stream data from a social media platform and store it in a database, or transform data from a spreadsheet into a format that can be imported into a CRM system. Data pipelines can be either batch-oriented or real-time, depending on how quickly they process data. They're really useful for managing and working with large amounts of data, and are often used in data engineering and data management tasks.

Creating the Pipeline

Let’s get started with creating the pipeline you want to trigger. Here you need to decide which files you want to add and the order they’ll be processed. Also decide the frequency you want this pipeline to be triggered. After you have that decided, let’s create the .yaml file.

Create a .YAML file

You can think of a yaml file as a recipe you’re sending to the computer to let him know which steps to take. For example, where to find all files you want to run and in what exact order to run them. In other words, it’s a language that can be used to configure files. And this is what it looks like:

pipeline:
  name: "distributed_lgbm_pipeline"
  tasks:
  - name: "data prep lgbm model"
    type: "jupyter notebook"
    notebook_path: "dask-lightgbm-aws/dask_lgbm_data_prep.ipynb"
    notebook_output_path: "step1_output.ipynb"
  - name: "train lgbm model"
    type: "jupyter notebook"
    notebook_path: "dask-lightgbm-aws/dask_lgbm_train.ipynb"
    notebook_output_path: "step2_output.ipynb"
  - name: "inference lgbm model"
    type: "jupyter notebook"
    notebook_path: "dask-lightgbm-aws/dask_lgbm_inference.ipynb"
    notebook_output_path: "step3_output.ipynb"

Here we’re creating a pipeline called “distributed_lgbm_pipeline” with three tasks, which will be triggered from the top down. First the "data prep lgbm model", then the "train lgbm model" and finally the "inference lgbm model". Each one of these tasks points to the respective file with the code we want to run on “notebook_path”, and add its output to “notebook_output_path”.

Remember to commit your changes to the git branch you’re using. This way the backend of the platform can recognize any changes made in your development and synchronize in real time.

Deploying Your Data Pipeline

That is basically all the code you’ll need to deploy your pipeline on the Shakudo environment, so let’s get started. Back to the main page of the Shakudo Platform, click on Jobs > Scheduled > Create a new scheduled job.

In this window, the main things you need to fill are the job type, which needs to match the session type used to commit the code, the path for the .yaml file you created and finally how often you’d like this pipeline to run using a cron expression. If you don’t know how to use cron, you can visit https://crontab.guru/ for more information.

Other fields you can optionally fill are the git branch or commit ID you want to use, parameters you’d like to change in your code when deploying, and the name of the pipeline. After you’re all set, just click the Create button and that’s it. Your pipeline is already deployed and fully integrated with the platform. No need to worry about data management between tasks or complicated further configuration or maintenance needed. The platform also makes sure that your code is running in the most cost effective way so that you don’t have to worry about spending money with idle infrastructure.

Your application is Deployed!

There are many problems Data Engineers face when they try to put their software into production are data pipelines maintenance or too much data to handle. Using Shakudo’s pipeline system you can get everything working faster and smoother with no need to maintain after it is deployed. If you need to debug your job, just look for your job on the table, click on View Menu > Logs. This will take you to Grafana where you’ll be able to see all the logs from your job.

Also, if you’re working with large amounts of data you can also benefit from our integrated distributed computing frameworks which allows you to use the power of distributed computing by spinning up Dask clusters in the most cost-effective way and also with just one line of code. That way you can work with large amounts of data in the most simple and fast way available today.

The Shakudo Factor

It doesn't stop there! Shakudo is an end-to-end platform built for you to create, deploy and integrate your whole application on it seamlessly and faster. We understand that things need to be dynamic and this doesn’t have to mean a large team or a long development process, if you just have the right platform. No more Data Science dependence on the engineering team for support and data teams have more autonomy, being able to deploy their code directly into production.

How Are Customers Using This?

Quantum Metric is one of Shakudo’s customers with a very interesting use case for the scheduling jobs feature. The pipeline jobs and services convert the data science team's code directly into production jobs so they can see the impact right away and there is no more Data Science dependence on the engineering team for DevOps.

The result is a more productive data science team and able to turn Proof of Concept (POC's) into useful products, getting real value out of machine learning, abstracting away everything related to infrastructure management and integrations.

In terms of data visualization, our dashboarding tools also enable the data science team to demo their experimentations and serve models to a frontend without the help of the engineering team, just by clicking a few buttons on the Shakudo platform.

Quantum Metric works with terabytes of streaming data by its nature, and that is a very hard challenge to deal with traditional MLOps or Data platforms. Being able to work with and easily manage big data coming from their product has had a great positive impact in their workflow, and now their data team is able to scale from experimentation to real-size production data with just one line of code using Shakudo’s built-in distributed systems.