Sign up to our upcoming webinar on Feb 16, 2022! We have speakers from NVIDIA, Oracle, and Shakudo doing a technical deep dive into how to scale up machine learning from using CPUs all the way to multi-GPUs.
In our latest webinar, Shakudo and NVIDIA RAPIDS team got together to showcase how RAPIDS on Shakudo can accelerate and scale massive volumes of geo-spatial data. Stella Wu, our Head of Machine Learning, and NVIDIA’s Nathan Stephens, Senior Manager of RAPIDS Developer Relations introduce how to use RAPIDS on Shakudo. Data scientists operating across various industries face similar challenges, including data preparation, managing different data sources, and effective collaboration with engineers and other team members. Issues at the forefront of some common bottlenecks include the sheer volume of data available today in combination with operational needs and functioning in multidisciplinary teams. But even if data is processed more slowly than one hopes for, is it such a big deal?
In short, yes. Often, even small delays in data aggregation can compound and lead to costly delays and inefficiencies. Needless to say, there has been room for improvement in data aggregation and efficiently getting solutions to market.
Shakudo is our solution to help data scientists mobilize and scale ideas, at times reaching rates of 10,000x in performance speedup. Psst! We have another webinar coming up on February 16th: Scaling machine learning with multi-GPU RAPIDS. Learn more and register here.
Launched in the beginning of 2021, Shakudo is an end-to-end platform that empowers AI teams by putting data scientists in the driver’s seat. The user has complete control, from pushing the gas on development through to steering production. Shakudo integrates all of the tools data scientists know and love, so data scientists and developers can focus on – what?! – developing! There is no limit to the type of input or volume of data you’re working with– all you need to do is begin coding. With short and sweet one-liners of code, Shakudo gives users access to multiple GPUs to go from small data to petabyte-scale ETL. This means you can easily scale with the most commonly used distributed computing frameworks, like Dask, Ray, or Spark.
Now, processing massive amounts of geospatial data in a short time frame at a rate pushing 10,000x is not only possible, but can be done on a user-friendly platform.
RAPIDS is a collection of a suite of open source libraries. These libraries help data scientists with workflows and pipelines, such as data preparation, analytics, and visualization.
Traditional analytics is typically done on a CPU, but this type of processor comes with limitations. Architecturally, the CPU is composed of just a few cores with lots of cache memory that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously. To manage these challenges, we can replace the CPU with GPU. Going from CPU to GPU means you have to refactor your code, but with RAPIDS it is made much easier to do.
One major pull of using RAPIDS on Shakudo is just how easy it is to move between pandas and cuDF.
CuDF is a Python GPU DataFrame library for loading, joining, aggregating, filtering, and manipulating data. These libraries will both look the same for exploring data. Everything may look duplicated, and the API is basically the same, so as a result, the user won’t experience translation issues.
There are still some notable differences when preparing data between cuDF and pandas. For example, how to separate columns. In the movie dataset, categories had been separated with pipes. To organize the information into different columns, with pandas, you can use one command and the “add” prefix. With cuDA, the prefix option is unavailable, so additional steps are needed to get the same result.On the flip side, you may note that pandas has a concatenate step, but for cuDF, the same action isn't needed - you can get the same results with one command as opposed to two. There are some minor differences when using each library. While not all pain points have been mended between the two libraries, the RAPIDS team has aimed to make refactoring your code a little less difficult.
A common bottleneck for data scientists is the rate at which data can be aggregated. Fortunately, switching from CPU to GPU can result in significant performance gains. By making the shift, a user may achieve double-digit growth rates in data aggregation. A quick case study of 6 million records of data is processed on a single GPU node in a single linear model between pandas and cuDF. The results speak for themselves. In one exercise, baby name data was aggregated by year, name, and sex. Counting the records by pandas was completed in 1.8 seconds. CuDF surpassed pandas’ rate by a longshot – data aggregation was completed in only 0.08 seconds. ETL processes on large, complicated datasets can experience remarkable speed-ups – in some cases, by a factor of 10 or 20 on a single GPU node, and the scaling factors persist.
CuSpatial is a RAPIDS library built specifically to manage large GIS data. It has accelerators so that one can complete common geospatial computation with 10 to 10,000 times of speed-up, depending on what the operation is. CuSpatial supports all types of common data inputs, such as CSV or Parquet. In a multi-GPU example, New York city yellow taxi data from a CSV file is read using dask_cudf. Dask on a remote Dask cluster is also used, so it can handle data that is larger than memory. Additionally, cuSpatial has haversine distance functionality, which allows the user to calculate the distance between two points on the surface of a globe .
It is possible to apply a map partition of any function to a Dask data frame. This is particularly valuable for geospatial data scientists looking to do point distance calculations, such as haversine on a large dataframe. It will be completed much faster and in parallel. Moreover, in comparison to pandas, Dask can handle data larger than memory. By using one line of code from the Shakudo package, users can start a distributed Dask cluster on GPUs, expediting the speed-up even further compared to local Dask clusters. Just one line of code! Extra GPUs = extra speed.
Just one line of code! Extra GPUs = extra speed.
Like momentum, with larger data sets, you can get a larger speed gain, without any memory issues.
By using sessions on Shakudo, you can automate jobs. Easily schedule your code and let it run when you want it to.
A YAML file can help with this process. YAML is a recipe for your pipeline jobs, so you can list and view all of your tasks under one menu. A YAML file can mix and match any runnable script.
YAML is a recipe for your pipeline jobs, so you can list and view all of your tasks under one menu.
The notebook path is a key consideration when running automated code. The user can copy the path of the Jupyter Notebook and add it in the YAML file. Simply commit the pipeline YAML and the steps of scripts to the Github repository. Once everything is committed, it is automatically picked up by the platform and you can spin up jobs or scheduled jobs to automate this task and essentially deploy it in production. If you wanted to make the same calculation on new data on demand, you could wrap a service around it and then create an access point for your users. If a model or traffic is large, users can gain speed increases by setting up services with the NVIDIA Triton server, which is fully integrated with Shakudo.
By accessing the panel, the user can easily spin up a new job. Simply add your new job by adding a new YAML path, and change the job type. Your production environment will be exactly the same as the development environment, so no need to sweat over any post-development surprises. Other parameters are available to control optimization of resources. You can modify variables using parameters right at the job panel without changing anything in your original script - there is no need for intricate logic or developer support. As you create your job, a GraphQL API is auto-generated. This API is shareable for seamless communication to engineering teams to trigger jobs from other pipelines. For consistent incoming data, the user can create batch jobs using nearly the same process, only with a schedule assignment.
Using Shakudo, data scientists can write code, deploy code, real-time debugging, and get results, while seeing how models perform in production, on real production data. You no longer have to think about which model to put into production, instead, you can select which one has the best performance and alignment with business objectives.
Watch our December 16th webinar
Access our slide deck and demo notebook GitRepo
We have a couple of key objectives when looking at the road ahead:
We are an integrator with a belief in the exponential power of combining the best tools on the market. We want you to save on production, and easily speed up deployment.
Stella Wu - Head of Machine Learning, Shakudo
Stella is a machine learning researcher experienced in developing AI models for real life applications. Stella has built machine learning models in natural language processing, time-series prediction, self-supervised learning, recommendation system and image processing at BMO, Borealis AI and several startups. Stella has a PhD in geophysical modeling from University of Münster in Germany.