Large language models (LLM) aren’t exactly new, but they are newly relevant, thanks primarily to the phenomenal success of OpenAI’s ChatGPT, which debuted less than a year ago.
LLMs are an example of generative AI, which has displayed an ability to mimic human-like creativity, feign human-like empathy, and surface connections across a vast corpus of knowledge.
Businesses are keen to incorporate LLMs and other types of generative AI technologies into their processes and workflows, but they understandably have serious questions — and more than a few reservations — about these technologies. The most serious of these are ethical and legal, having to do with the problem of “aligning” the behavior of AI solutions with human values, laws, and regulations.
But businesses are also grappling with the pragmatic dimension of generative AI. What practical steps must a business take to integrate generative AI into its processes and workflows? What skills, resources, tools, and practices does it need to master to be successful with AI? This blog will explore this pragmatic dimension of successfully using LLMs and other types of generative AI tools.
Getting started with generative AI seems simple enough: OpenAI, Google, Microsoft, Anthropic, and others offer API-based access to their LLMs, ostensibly making it easy for organizations to integrate generative AI into their apps and workflows. One problem with using a commercial AI solution is that a business must share proprietary information with the AI vendor, possibly “leaking” intellectual property (IP) or other types of sensitive information. Another issue is vendor lock-in, which happens as the business tightly integrates AI into the applications, services, and workflows powering its business processes.
This is why a growing number of businesses are enticed by the promise of open source LLMs, like LLaMA or Falcon, as well as by the rich ecosystem of available open source libraries, frameworks, and other resources. Not only is the performance of open source LLMs quickly catching up to that of proprietary models, but the history of open source software suggests that eventual parity between the two is a question of when, not if. The success of open source statistical analysis and machine learning (ML) libraries, frameworks, and tools is a great example of this. In less than 15 years, open source technologies have effectively displaced proprietary tools (like SAS and SPSS) for statistics and ML.
The rise of open source generative AI solutions seems to be happening even more quickly.
Using open source software to build your own LLM has several benefits, in addition to protecting against unauthorized information disclosure, reducing the risk of IP leakage, and acting as a hedge against vendor lock-in. Let’s quickly cover the concrete benefits of an open source LLM strategy before pivoting to the challenge of actually building, scaling, operating, and maintaining this stack.
Deep customization. Training LLMs on internal data allows for a high degree of customization, ensuring more accurate and relevant outcomes tailored to a business’s unique requirements.
Data security. Training your own LLM, using open source models and internal data, also mitigates the risk of a data breach — as when a third-party vendor’s systems are infiltrated by outside attackers — and makes it easier to comply with data protection regulations, minimizing potential legal exposure.
Business-specific insights. A custom-trained and tuned LLM can offer insights that are uniquely aligned with an individual business’s goals and challenges, producing more useful, pertinent results.
Operational Independence. You get complete control over the model’s lifecycle, allowing you to decide when or how to update or upgrade your LLM, ensuring that feature or function updates are driven by the business’s needs — rather than being forced by a third-party vendor’s support or maintenance terms.
These benefits come at a cost, however. The downside to training and operating your own LLMs is that you’re responsible for taking care of all of the things a commercial service handles for you.
For any business whose core competency is not data work, this can be intimidating.
The list of challenges includes:
Provisioning heavy-duty data processing capabilities. Operating a custom LLM requires a robust and resilient data-processing infrastructure that scales dynamically in response to unpredictable data and computational requirements. Most importantly, the software layer running atop this infrastructure must be flexible enough to accommodate the huge variety of tools and practices used in working with data.
Connecting cloud services, data sources, tools, etc. can be daunting, time-consuming, and costly. Any business that wants to build and operate a custom LLM must navigate the intricacies of:
If “garbage in, garbage out” is true in programming, it’s especially true in model training. Using low quality data will result in LLMs that produce inaccurate, hallucinatory, and/or biased outputs.
Governance and compliance are key, too. Governance and compliance are critical pillars in LLM model training. Organizations must rigorously vet the provenance and handling of the data they use to train their LLMs, making sure that how they use this data comports with regulations and legal statutes, as well as aligns with ethical values and standards.
Decentralized data access is the new ground-level expectation. Teams expect to be able to easily exchange data with one another. The challenge is to promote data exchange while ensuring the integrity and traceability of data — and at the same time prevent the proliferation of redundant datasets.
Continuous learning is easier said than done. LLMs need to be retrained so they’re in sync with evolving language, concepts, or domain-specific information. A general-purpose LLM might get updated on a yearly basis, but a domain-specific LLM, designed to perform simple tasks, might get refreshed more frequently. Some open source LLMs can be trained on commodity hardware, while LLMs like LLaMA and Falcon have been demonstrated running on extremely lightweight compute resources, like smartphones and single-board computers. Therefore, a business might realistically operate different kinds of task-specific LLMs that do get updated frequently. In any case, automated monitoring and feedback loops are essential for maintaining the efficacy of LLMs in production.
At a glance, these challenges might seem daunting and technologically insurmountable. But fear not!
The hidden secret of LLMs is that the way you build, deploy, operate, and maintain them isn’t all that different from the way you build, deploy, operate, and maintain any other component of the modern data stack. In the first place, training an LLM entails cleansing and conditioning a large volume of data that’s aggregated from a wide variety of sources. For large LLMs, businesses usually opt to persist this data in a data lake or in an optimized columnar storage format. For task-specific LLMs, they have more flexibility, with options ranging from cloud relational database services to cloud object storage.
Today, IT experts develop repeatable solutions, called “patterns,” to manage data integration processes exactly like these. Most of these patterns involve using manual scripts or workflow management engines, although several vendors market software and services designed to simplify these tasks.
The step that’s unique to LLMs is that of selecting the large language model that’s most suited to the specific use case or application at hand. A complex, computationally demanding, transformer-based model like LLaMA might be employed for use cases that require natural language processing (NLP), whereas a less sophisticated recurrent neural network (RNN) model could suffice for a task like sequence prediction in time-series data. At a high level, the foundational steps involved in training, validating, and deploying even a sophisticated LLM are comparable to those of a conventional ML model, although certain nuances and complexities specific to LLMs do require specialized expertise. (This is also true with parameter-efficient tuning techniques like low-rank adaptation, or LoRA, which are used to cost-effectively fine-tune a pre-trained LLM.) Once the LLM has been trained and tuned to suit the use case at hand, the business integrates it into production, typically via APIs. To maintain the performance and accuracy of the production LLM, the business must also periodically retrain it on fresh datasets, while making other adjustments based on feedback loops from production use.
Let’s briefly decompose these steps and look at how an organization might implement them today.
First, there’s automatic provisioning and configuration. For data scientists, ML engineers, or other experts, the manual overhead involved in provisioning and configuring software is a massive resource drain. In most cases, these experts are forced to create and maintain scripts or custom-designed workflows to at least partially automate the provisioning and configuration of the modern data stack. But such “solutions” tend to be labor-intensive, error-prone, and lack the efficiency of a fully automated solution. Ideally, experts would be able to interact with a self-service tool they could use to provision and connect disparate compute engines, cloud storage and ETL services, ML frameworks, etc.
Second, there’s data access and integration. Useful data is distributed across a large variety of sources — not just databases, data lakes, and data warehouses, but cloud object storage and SaaS tools, too. Again, data scientists and other experts can use manually maintained scripts, or build (and maintain) custom workflows, to partially automate the provisioning and configuration of the compute and storage services used to process this data. In a perfect world, however, they would interact with a pre-configured service that streamlines access to these data sources, integrating with data ingestion pipelines and connectors, and reducing the hassle and latency associated with data preparation.
Third, there’s workflow management for LLM training. Training or retraining an LLM involves a specific sequence of operations: data extraction, preprocessing, model training, tuning, validation, and deployment. Data scientists and other experts typically leverage workflow management tools to automate these steps, albeit at the cost of having to design, maintain, and trouble-shoot the scripts or artifacts (e.g., DAGs) used by these tools. This approach would use preconfigured software orchestration to automate these workflows, ensuring a deterministic, reproducible, optimized process.
Fourth, there’s model retraining pipelines. To maintain the efficacy of production LLMs, experts need to design pipelines and automate feedback loops that monitor model performance, identify anomalies or degradation, and, optionally, trigger model retraining. Currently, much of this heavy lifting falls on the shoulders of data scientists and other experts, who manually design and maintain these workflows. Most experts would prefer to use a tool that offers preconfigured software patterns for LLM monitoring and retraining. This solution would minimize manual intervention, ensuring that experts could focus on improving models and solutions, rather than the nitty-gritty of pipeline management.
Finally, there’s integration with DevOps, DataOps, and MLOps. Continuous integration and continuous deployment (CI/CD) has become critical for ML models, permitting rapid model iteration and deployment, and ensuring that the most up-to-date versions of ML models are operationalized. CI/CD helps accelerate delivery timelines, maintain consistent quality, and align ML development with business needs. By integrating automated software provisioning into these lifecycle practices, experts can expedite model versioning and testing, while ensuring hassle-free deployment into production.
In almost all cases, the “automation” experts depend on to provision and configure the components of the modern data stack takes the form of human-created scripts or workflow code artifacts.
However, the “ideal” solution alluded to in the section above already exists: Shakudo, the operating system for data and AI stacks that eliminates the complexity of provisioning, configuring, and connecting the software components of the modern data stack. By seamlessly integrating with operational lifecycle practices like DevOps, DataOps, and MLOps, Shakudo not only streamlines operations, but also improves the quality and efficiency of data workflows, including model (re)training and deployment. By automating and standardizing these and other operations, Shakudo reduces the chance of errors or unexpected outcomes in data workflows and model outputs, aligning ML development with business requirements.
Consider the tedious, time-consuming task of provisioning and configuring the software stack required to train and deploy an LLM. Data scientists might need to provision high-performance GPU clusters, configure a cloud storage service with an optimized columnar format to support very large datasets, and connect a confusing network of data ingestion pipelines — all before model training begins.
Shakudo streamlines this foundational process, automating the setup of essential services, configuring secure connectivity between them, and managing dependencies between the services used to access, transform, and analyze data. Working in Shakudo’s easy-to-use Web UI, a data scientist can select from among different types of predefined configurations designed specifically for LLM training and deployment. Shakudo automatically provisions, configures, and orchestrates the required services, eliminating trial-and-error, and allowing experts to kickstart LLM projects with almost no delay.
By the way, the same features and capabilities that make Shakudo so useful in streamlining the initial setup for LLM projects also apply to data science, data engineering, and analytic development in general. Shakudo provides practitioners in a wide range of roles with a hand-curated selection of tools, all configured to interoperate flawlessly. It helps streamline the design, testing, documentation, and dissemination of ML and predictive models, data pipelines, useful datasets and visualizations, recipes, and artifacts of all kinds.
Intrigued? Level up the potential of your data operations with Shakudo: the modern, cloud-native automation platform ideal for LLM training and deployment. Simplify, standardize, and accelerate your workflows while reducing costs and frustration — partner with Shakudo on your data journey today! Schedule a meeting with a Shakudo expert to learn more.