Retrieval-augmented generation (RAG) with Llama2 and Milvus

By:

Jeremie Zumer

Updated on:

October 6, 2023

Note: This article is a followup to our previous post on using Milvus for knowledge base management. We recommend that you read that post first if you would like to implement the chatbot described here.

Chatbots have improved by leaps and bounds in the past few years thanks to the advent of easily accessible LLMs. General LLM-based chatbots can be quite entertaining in their own right, but with proper tuning, they can also be a way to significantly improve productivity.

Why read web articles all day to parse through endless fluff in order to find the tiny nuggets of information that make all the difference? Instead, let the LLM summarize what’s new and guide you to find what you’re looking for.

Or how about parsing through the last decade’s worth of internal business documents to identify patterns, like which sectors will most be affected by ongoing weather events based on a collection of document sources, or recall details, such as who logged into the admin database at around 3 AM last December, without having to guess at which keywords will turn out the best results, or risk missing critical details?

Perhaps you have a collection of complicated texts from which you need a simple answer: Based on the National Bank of Canada’s last 12 quarters of financial documents, is their revenue increasing or decreasing?

Modern chatbots can do all that when leveraging retrieval-augmented generation. In this blog post, we will build such a system, using Llama2 to generate our answers, Milvus to store our documents and perform quick vector searches to identify relevant documents, and Shakudo to bind it all together without having to endure complicated setup or rely on expert IT skills.

RAG in a Nutshell

Retrieval-augmented generation, or RAG, is the task of generating text (generation) based on a document related to the query or search context (retrieval-augmented). There are many ways to architect a RAG system. Here, we will stick to a straightforward method based on prompt injection, discussed later in this post.

Our system will have the following features:

Semantic similarity search between the user query and our knowledge base
Multi-turn chat: The user can ask increasingly precise questions based on the chatbot’s responses
Context switch detection: The system will know if the user is asking something completely new or is continuing a discussion on the current topic
Realtime chat responses

To achieve semantically relevant search results and a pleasant chat experience with limited response delays, we will leverage Milvus’s highly efficient vector search capabilities. Since running everything we need (Llama2, Milvus, and our frontend) on one computer is rather difficult, and since setting up everything on a few devices could be the subject of its own blog post, we will leverage Shakudo to effortlessly access everything we need instead. As a bonus, our chatbot will scale seamlessly when we are ready to deploy it.

Step 1: Setting Up the Database

We will be using the same database that was setup in our previous post on Milvus for this demo, with the WikiHow dataset. Our chatbot will be able to give us tips on how to run faster or how to handle spicy food better. In addition, we will have access to the source document from which our Llama2 generated its answer, allowing users to double check anything interesting they may discover through our chatbot.

Since we have already covered how to insert our data into Milvus, we’ll assume the database of interest is ready for further operations. We simply have to create the collection and load the collection, as before:

from pymilvus import connections
connection = connections.connect(
  alias="default",
  host=os.environ['MILVUS_HOST'],
 port=os.environ['MILVUS_PORT'],
)

whcollection = Collection("WikiHow")
whcollection.load()

With this, our data is ready to be searched against, no additional setup required. As a reminder, we use the alias of ‘default’ because this saves us some typing, as this is the default connection reference name for other pymilvus functions.

We can test that we get reasonable results by doing a simple search against the database, as we did in our previous post, by defining our embeddings to match those we used when uploading data to Milvus like this:

embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

And then performing the search proper:

found = whcollection.search(
            [embeddings.embed_query("How do I run faster?")], # Vector for the query
            anns_field="vector", # Name of the field to search against
            param={'metric_type': 'L2', # Search params...
                        'offset': 0,
                        'params': {'nprobe': 1}
                        },
            limit=1,
            output_fields=['text', 'title']) # Also get the document title.
    match_title = found[0][0].entity.get('title')
    match_text = found[0][0].entity.get('text')

Let’s see the result!

Query: How do I run faster?
Match result:
Title: How to Be a Fast Runner (Always come up last in the big race? Want some tips on how to speed yourself up whether you're in the Olympics or just out on the playground? Here are some ideas to help you out.)

Contents:
How to Be a Fast Runner (Always come up last in the big race? Want some tips on how to speed yourself up whether you're in the Olympics or just out on the playground? Here are some ideas to help you out.):

Adjust your stride according to the distance you're running, if it's a sprint then quickly turn over your legs and keep your knees high.:
...

As expected of Milvus, blazing fast response and pretty great results!

Step 2: Preparing the Generator

Now that we can retrieve useful documents from our knowledge base, let’s hook the retrieved results into our generation system. First, we need a running instance. On Shakudo, this is just a question of launching a service:

Let’s connect to our LLM endpoint and make sure everything is up and running:

 SVC_EP=os.environ['LLM_ENDPOINT']
client = text_generation.Client(SVC_EP)
for tok in client.generate_stream(prompt="USER:\nhow to eat fruit for non fruit eaters?\nASSISTANT:\n", max_new_tokens=512, repetition_penalty=1.2):
    if not tok.token.special:
        print(tok.token.text, end='')

You may notice that we don’t simply use the query as prompt, but have to perform mild prompt engineering. That is because GPTs like Llama2 are trained with text completion objectives, not as dialogue systems. Therefore, it is important to ensure the prompt format matches the dialogue text corpora the model will have seen during training. Variations are acceptable as long as the basic format is followed (so for example, results may be a little worse, but will still resemble dialogue if using ROBOT: instead of ASSISTANT:, and so on).

Trying out this query returns us an interesting response, although probably not what we had in mind when asking the question:

1. Wash your hands thoroughly with soap and water before handling any fruits or vegetables. This is important for maintaining good hygiene and preventing the spread of germs.
2. Choose a variety of colorful, ripe fruits that you enjoy eating. Fresh fruits are rich in vitamins, minerals, fiber, and antioxidants which can help boost your immune system and improve overall health.
3. Cut open the fruit carefully using a sharp knife or fruit cutter. Be sure not to squeeze too hard as this could cause injury to the fruit. [...]

There are several solutions to fix generations like these, including using a larger model, modifying the prompt format used, or performing sampling and optimizing generation parameters to achieve the desired results.

All these methods are compatible with knowledge grounding, as we’ll do in this post, though they fall outside our scope this time. Additionally, knowledge grounding is very effective and far cheaper to implement than finetuning, which requires ready access to GPUs even with modern, parameter efficient techniques like LoRA QLoRA. It also doesn’t require additional effort when modifying knowledge base data.

We will have to prepare a prompt template to insert previous user turns and the document from the knowledge base that is most relevant to the input query:

assistant_string = "ASSISTANT:\n"
user_string = "USER:\n"
document_string="DOCUMENT:\n"

prompt_template = PromptTemplate.from_template(
    f"""\
{{turns}}
{document_string}{{context}}
{user_string}What does the document say about {{prompt}}
Give me a summary. If the information is not there let me know.

{assistant_string}
"""
)

While a discussion of the pros and cons of various models is not in scope for this post, it’s worth experimenting a bit to find what works best for your use case. For this demo, we found that Llama2-7b provided lukewarm results, but at 13 billion parameters, responses became quite natural in a lot of cases. In addition, finetuned models can provide better results, though as a rule of thumb, finetuning reduces general performance while improving task-specific performance.

Without further ado, let’s tie everything up and try our previous query again:

The document provides advice on incorporating more fruits into one's diet as part of maintaining overall health. It suggests several strategies for individuals who may find it challenging to consume fruits directly due to taste preferences or other reasons. These strategies include consuming fruit pies and desserts in moderation while taking a vitamin and mineral supplement; purchasing refrigerated or frozen juices without added sugars such as orange, pineapple, and apple; selecting easy-to-eat fruits like grapes, cherries, small apples, and pears after washing them properly; and replacing high protein dishes with those containing fruits and vegetables. The document emphasizes the importance of including fruits in one's diet over consuming excess amounts of processed foods and sweetened beverages.

Much better! This is the kind of response we would have liked the first time around.

The generation function to obtain this response looks like this:

def generate(what, turns, context, topic):
    found = find(what)
    match_title = found[0][0].entity.get('title')
    match_text = found[0][0].entity.get('text')
    match_dist = found[0][0].distance

    retrieved = ""

    if match_title != topic and match_dist < 0.75:
        retrieved = match_text
        retrieved = retrieved[:1024]
        context = retrieved
        topic = match_title
    preface = ("No information available" if context is None else context)
    turns = "\n".join(turns)[-2048:]
    return { 
        'stream': client.generate_stream(prompt=prompt_template.format(prompt=what,
                                                                       turns=turns,
                                                                       context=preface),
                                            max_new_tokens=512,
                                            repetition_penalty=1.2),

        'topic': topic,
        'context': context
        }

We simply get the document from Milvus based on our query, check that the match is strong enough, indicate a change of topic (otherwise we assume the user is continuing the previous conversation), prepare the prompt using our template, and get our generated response stream like we did before. The magic is all in the additional information provided to the model.

And yes, our chatbot handles multi-turn dialogues!

Q: how can i eat fruits?
A: The document suggests several strategies on how to incorporate more fruits into one's diet if someone doesn't enjoy eating fruits directly. These suggestions include consuming fruit pies and desserts as an occasional treat while taking a vitamin and mineral supplement; buying refrigerated or frozen juices without added sugars such as orange, pineapple, and apple; choosing easy-to-eat fruits like grapes, cherries, small apples, and pears after washing them properly; and adding raisins to various dishes including meat and vegetable ones. The overall message emphasizes the importance of balancing the diet with fruits despite personal preferences due to their numerous health benefits.

Q: health benefits?
A: The document highlights the numerous health advantages associated with consuming fruits regularly. It mentions that fruits provide essential nutrients, fiber, antioxidants, and help maintain a healthy weight, lower blood pressure, reduce risk of heart disease, stroke, type 2 diabetes, certain types of cancer, and promote digestive health. Overall, it encourages people to consume fruits as part of a well-balanced diet to support better health outcomes.

Step 3: Deploy!

This will be a very short step since we’re using Shakudo and Streamlit. All we have to do, in essence, is copy our notebook into a python file and wrap it in Streamlit UI elements. The full code can be found here The configuration and run script are fully reproduced here since they’re so tiny.

To deploy our RAG on Shakudo, we first create the following files:

run.sh:

streamlit run milvus-demo/chatbot/service/app.py \
            --server.port 8787 \
            --browser.serverAddress localhost

pipeline.yaml:

pipeline:
  name: "MultiturnRAG"
  requirements: "milvus-demo/chatbot/service/requirements.txt"
  tasks:
  - name: "Muliturn RAG demo"
    type: "bash script"
    port: 8787
    bash_script_path: "milvus-demo/chatbot/service/run.sh"

Then we simply create a service on Shakudo, pointing to the yaml we use for this task:

When the service is ready, we simply have to click the link that appears in the service section, and land on our user-friendly RAG:

That’s it! If you enjoy a good laugh, you are cordially invited to visit the Streamlit documentation on deploying with Kubernetes. Feel free to compare the configuration above with the proposed configuration in the official document. Shakudo makes a world of difference in this, and in all other configuration and management tasks for LLM use cases.

Conclusion

If you would like to toy with our RAG yourself, you can find the notebook and service files at this link.

Armed with the information in this post, you are now ready to prototype your own LLM-powered question-answering system leveraging your own private business data, without having to worry about your documents reaching third party servers as would be a concern when using third party hosted LLMs.

A lot of work remains to finetune the method and interface to achieve the answers you would like with your own branding at the scale that matters to you, but this is where Shakudo comes in: Shakudo allows you to deploy services that automatically scale up and down with usage, and takes care of all the configuration work required to interoperate the many moving pieces involved in creating robust RAGs. Focus on developing business value, not on tool, environment, and configuration management!

Whitepaper

RAG in a Nutshell

Our system will have the following features:

Semantic similarity search between the user query and our knowledge base
Multi-turn chat: The user can ask increasingly precise questions based on the chatbot’s responses
Context switch detection: The system will know if the user is asking something completely new or is continuing a discussion on the current topic
Realtime chat responses

Step 1: Setting Up the Database

from pymilvus import connections
connection = connections.connect(
  alias="default",
  host=os.environ['MILVUS_HOST'],
 port=os.environ['MILVUS_PORT'],
)

whcollection = Collection("WikiHow")
whcollection.load()

embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

And then performing the search proper:

found = whcollection.search(
            [embeddings.embed_query("How do I run faster?")], # Vector for the query
            anns_field="vector", # Name of the field to search against
            param={'metric_type': 'L2', # Search params...
                        'offset': 0,
                        'params': {'nprobe': 1}
                        },
            limit=1,
            output_fields=['text', 'title']) # Also get the document title.
    match_title = found[0][0].entity.get('title')
    match_text = found[0][0].entity.get('text')

Let’s see the result!

Query: How do I run faster?
Match result:
Title: How to Be a Fast Runner (Always come up last in the big race? Want some tips on how to speed yourself up whether you're in the Olympics or just out on the playground? Here are some ideas to help you out.)

Contents:
How to Be a Fast Runner (Always come up last in the big race? Want some tips on how to speed yourself up whether you're in the Olympics or just out on the playground? Here are some ideas to help you out.):

Adjust your stride according to the distance you're running, if it's a sprint then quickly turn over your legs and keep your knees high.:
...

As expected of Milvus, blazing fast response and pretty great results!

Step 2: Preparing the Generator

Let’s connect to our LLM endpoint and make sure everything is up and running:

 SVC_EP=os.environ['LLM_ENDPOINT']
client = text_generation.Client(SVC_EP)
for tok in client.generate_stream(prompt="USER:\nhow to eat fruit for non fruit eaters?\nASSISTANT:\n", max_new_tokens=512, repetition_penalty=1.2):
    if not tok.token.special:
        print(tok.token.text, end='')

Trying out this query returns us an interesting response, although probably not what we had in mind when asking the question:

1. Wash your hands thoroughly with soap and water before handling any fruits or vegetables. This is important for maintaining good hygiene and preventing the spread of germs.
2. Choose a variety of colorful, ripe fruits that you enjoy eating. Fresh fruits are rich in vitamins, minerals, fiber, and antioxidants which can help boost your immune system and improve overall health.
3. Cut open the fruit carefully using a sharp knife or fruit cutter. Be sure not to squeeze too hard as this could cause injury to the fruit. [...]

We will have to prepare a prompt template to insert previous user turns and the document from the knowledge base that is most relevant to the input query:

assistant_string = "ASSISTANT:\n"
user_string = "USER:\n"
document_string="DOCUMENT:\n"

prompt_template = PromptTemplate.from_template(
    f"""\
{{turns}}
{document_string}{{context}}
{user_string}What does the document say about {{prompt}}
Give me a summary. If the information is not there let me know.

{assistant_string}
"""
)

Without further ado, let’s tie everything up and try our previous query again:

The document provides advice on incorporating more fruits into one's diet as part of maintaining overall health. It suggests several strategies for individuals who may find it challenging to consume fruits directly due to taste preferences or other reasons. These strategies include consuming fruit pies and desserts in moderation while taking a vitamin and mineral supplement; purchasing refrigerated or frozen juices without added sugars such as orange, pineapple, and apple; selecting easy-to-eat fruits like grapes, cherries, small apples, and pears after washing them properly; and replacing high protein dishes with those containing fruits and vegetables. The document emphasizes the importance of including fruits in one's diet over consuming excess amounts of processed foods and sweetened beverages.

Much better! This is the kind of response we would have liked the first time around.

The generation function to obtain this response looks like this:

def generate(what, turns, context, topic):
    found = find(what)
    match_title = found[0][0].entity.get('title')
    match_text = found[0][0].entity.get('text')
    match_dist = found[0][0].distance

    retrieved = ""

    if match_title != topic and match_dist < 0.75:
        retrieved = match_text
        retrieved = retrieved[:1024]
        context = retrieved
        topic = match_title
    preface = ("No information available" if context is None else context)
    turns = "\n".join(turns)[-2048:]
    return { 
        'stream': client.generate_stream(prompt=prompt_template.format(prompt=what,
                                                                       turns=turns,
                                                                       context=preface),
                                            max_new_tokens=512,
                                            repetition_penalty=1.2),

        'topic': topic,
        'context': context
        }

And yes, our chatbot handles multi-turn dialogues!

Q: how can i eat fruits?
A: The document suggests several strategies on how to incorporate more fruits into one's diet if someone doesn't enjoy eating fruits directly. These suggestions include consuming fruit pies and desserts as an occasional treat while taking a vitamin and mineral supplement; buying refrigerated or frozen juices without added sugars such as orange, pineapple, and apple; choosing easy-to-eat fruits like grapes, cherries, small apples, and pears after washing them properly; and adding raisins to various dishes including meat and vegetable ones. The overall message emphasizes the importance of balancing the diet with fruits despite personal preferences due to their numerous health benefits.

Q: health benefits?
A: The document highlights the numerous health advantages associated with consuming fruits regularly. It mentions that fruits provide essential nutrients, fiber, antioxidants, and help maintain a healthy weight, lower blood pressure, reduce risk of heart disease, stroke, type 2 diabetes, certain types of cancer, and promote digestive health. Overall, it encourages people to consume fruits as part of a well-balanced diet to support better health outcomes.

Step 3: Deploy!

To deploy our RAG on Shakudo, we first create the following files:

run.sh:

streamlit run milvus-demo/chatbot/service/app.py \
            --server.port 8787 \
            --browser.serverAddress localhost

pipeline.yaml:

pipeline:
  name: "MultiturnRAG"
  requirements: "milvus-demo/chatbot/service/requirements.txt"
  tasks:
  - name: "Muliturn RAG demo"
    type: "bash script"
    port: 8787
    bash_script_path: "milvus-demo/chatbot/service/run.sh"

Then we simply create a service on Shakudo, pointing to the yaml we use for this task:

When the service is ready, we simply have to click the link that appears in the service section, and land on our user-friendly RAG:

Conclusion

If you would like to toy with our RAG yourself, you can find the notebook and service files at this link.

Get the whitepaper

Retrieval-augmented generation (RAG) with Llama2 and Milvus

Thank you for filling out the form. The whitepaper you have requested is available for download below.

Download White Paper

Oops! Something went wrong while submitting the form.

Key results

About

industry

Tech Stack

RAG in a Nutshell

Our system will have the following features:

Semantic similarity search between the user query and our knowledge base
Multi-turn chat: The user can ask increasingly precise questions based on the chatbot’s responses
Context switch detection: The system will know if the user is asking something completely new or is continuing a discussion on the current topic
Realtime chat responses

Step 1: Setting Up the Database

from pymilvus import connections
connection = connections.connect(
  alias="default",
  host=os.environ['MILVUS_HOST'],
 port=os.environ['MILVUS_PORT'],
)

whcollection = Collection("WikiHow")
whcollection.load()

embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

And then performing the search proper:

found = whcollection.search(
            [embeddings.embed_query("How do I run faster?")], # Vector for the query
            anns_field="vector", # Name of the field to search against
            param={'metric_type': 'L2', # Search params...
                        'offset': 0,
                        'params': {'nprobe': 1}
                        },
            limit=1,
            output_fields=['text', 'title']) # Also get the document title.
    match_title = found[0][0].entity.get('title')
    match_text = found[0][0].entity.get('text')

Let’s see the result!

Query: How do I run faster?
Match result:
Title: How to Be a Fast Runner (Always come up last in the big race? Want some tips on how to speed yourself up whether you're in the Olympics or just out on the playground? Here are some ideas to help you out.)

Contents:
How to Be a Fast Runner (Always come up last in the big race? Want some tips on how to speed yourself up whether you're in the Olympics or just out on the playground? Here are some ideas to help you out.):

Adjust your stride according to the distance you're running, if it's a sprint then quickly turn over your legs and keep your knees high.:
...

As expected of Milvus, blazing fast response and pretty great results!

Step 2: Preparing the Generator

Let’s connect to our LLM endpoint and make sure everything is up and running:

 SVC_EP=os.environ['LLM_ENDPOINT']
client = text_generation.Client(SVC_EP)
for tok in client.generate_stream(prompt="USER:\nhow to eat fruit for non fruit eaters?\nASSISTANT:\n", max_new_tokens=512, repetition_penalty=1.2):
    if not tok.token.special:
        print(tok.token.text, end='')

Trying out this query returns us an interesting response, although probably not what we had in mind when asking the question:

1. Wash your hands thoroughly with soap and water before handling any fruits or vegetables. This is important for maintaining good hygiene and preventing the spread of germs.
2. Choose a variety of colorful, ripe fruits that you enjoy eating. Fresh fruits are rich in vitamins, minerals, fiber, and antioxidants which can help boost your immune system and improve overall health.
3. Cut open the fruit carefully using a sharp knife or fruit cutter. Be sure not to squeeze too hard as this could cause injury to the fruit. [...]

We will have to prepare a prompt template to insert previous user turns and the document from the knowledge base that is most relevant to the input query:

assistant_string = "ASSISTANT:\n"
user_string = "USER:\n"
document_string="DOCUMENT:\n"

prompt_template = PromptTemplate.from_template(
    f"""\
{{turns}}
{document_string}{{context}}
{user_string}What does the document say about {{prompt}}
Give me a summary. If the information is not there let me know.

{assistant_string}
"""
)

Without further ado, let’s tie everything up and try our previous query again:

The document provides advice on incorporating more fruits into one's diet as part of maintaining overall health. It suggests several strategies for individuals who may find it challenging to consume fruits directly due to taste preferences or other reasons. These strategies include consuming fruit pies and desserts in moderation while taking a vitamin and mineral supplement; purchasing refrigerated or frozen juices without added sugars such as orange, pineapple, and apple; selecting easy-to-eat fruits like grapes, cherries, small apples, and pears after washing them properly; and replacing high protein dishes with those containing fruits and vegetables. The document emphasizes the importance of including fruits in one's diet over consuming excess amounts of processed foods and sweetened beverages.

Much better! This is the kind of response we would have liked the first time around.

The generation function to obtain this response looks like this:

def generate(what, turns, context, topic):
    found = find(what)
    match_title = found[0][0].entity.get('title')
    match_text = found[0][0].entity.get('text')
    match_dist = found[0][0].distance

    retrieved = ""

    if match_title != topic and match_dist < 0.75:
        retrieved = match_text
        retrieved = retrieved[:1024]
        context = retrieved
        topic = match_title
    preface = ("No information available" if context is None else context)
    turns = "\n".join(turns)[-2048:]
    return { 
        'stream': client.generate_stream(prompt=prompt_template.format(prompt=what,
                                                                       turns=turns,
                                                                       context=preface),
                                            max_new_tokens=512,
                                            repetition_penalty=1.2),

        'topic': topic,
        'context': context
        }

And yes, our chatbot handles multi-turn dialogues!

Q: how can i eat fruits?
A: The document suggests several strategies on how to incorporate more fruits into one's diet if someone doesn't enjoy eating fruits directly. These suggestions include consuming fruit pies and desserts as an occasional treat while taking a vitamin and mineral supplement; buying refrigerated or frozen juices without added sugars such as orange, pineapple, and apple; choosing easy-to-eat fruits like grapes, cherries, small apples, and pears after washing them properly; and adding raisins to various dishes including meat and vegetable ones. The overall message emphasizes the importance of balancing the diet with fruits despite personal preferences due to their numerous health benefits.

Q: health benefits?
A: The document highlights the numerous health advantages associated with consuming fruits regularly. It mentions that fruits provide essential nutrients, fiber, antioxidants, and help maintain a healthy weight, lower blood pressure, reduce risk of heart disease, stroke, type 2 diabetes, certain types of cancer, and promote digestive health. Overall, it encourages people to consume fruits as part of a well-balanced diet to support better health outcomes.

Step 3: Deploy!

To deploy our RAG on Shakudo, we first create the following files:

run.sh:

streamlit run milvus-demo/chatbot/service/app.py \
            --server.port 8787 \
            --browser.serverAddress localhost

pipeline.yaml:

pipeline:
  name: "MultiturnRAG"
  requirements: "milvus-demo/chatbot/service/requirements.txt"
  tasks:
  - name: "Muliturn RAG demo"
    type: "bash script"
    port: 8787
    bash_script_path: "milvus-demo/chatbot/service/run.sh"