Large Language Models (LLMs) are a transformative technology that enables developers to build applications that were previously impossible. However, using LLMs in isolation is often insufficient for useful application.
Due to their broad training, general-purpose LLMs may produce outputs that are not finely tuned to specific domains or use cases and the generated text may lack specialized knowledge or context. Overall, these models may generate responses that are factually incorrect or biased since they learn from unfiltered internet text, that can contain misinformation or subjective viewpoints.
In this blog, we’ll focus on the “knowledge” part. More specifically, supplying the right context for the LLMs to produce cleaner, more accurate results.
As projects like LangChain or AutoGPT have shown the world, the real power of the technology comes into play when you connect them with other sources of computation or knowledge.
Context Aware LLM’s
Generally, there are 2 main options to add knowledge (or context) to an LLM:
- Fine-tuning an LLM on text data which includes the additional information we would like to include.
- Using Retrieval Augmented Generation (RAG), a technique that implements an information retrieval component on top of an LLM generation process. This allows us to retrieve relevant information and feed it into the generation model as an additional source of information, which we'll actually do in this article.
Building a Smarter Science Bot
To demonstrate the technique, we are going to build a Science Q&A bot that will help us answer science related questions by using the following building blocks:
- **sciq dataset** - The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.****
- Chroma as the Vector Store (our “Knowledge base”)
- Sentence transformers multi-qa-MiniLM-L6-cos-v1 (which is trained specifically for semantic search use cases) model to generate embedding for the store.
- **Falcon 7B Instruct** model as our open source generative model. Falcon-7B is a 7B parameters causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. Its bigger version Falcon 40B is currently the the #1 LLM on Hugging Face's Open LLM Leaderboard.
Required installations for running the code in this article are:
! pip install -qU \
transformers=4.30.2 \
torch=2.0.1+cu118 \
einops=0.6.1 \
accelerate=0.20.3 \
datasets=2.13.0 \
chromadb=0.3.26 \
sentence-transformers=2.2.2 \
The code in the article was executed on a A100 Colab instance. The actual code needed to run the falcon-7b-instruct model varies based on the specific hardware used.
Building the Knowledge Base
As mentioned, the dataset we will use is sciq, which is a Hugging Face dataset.
from datasets import load_dataset
dataset = load_dataset("sciq", split='train')
print(dataset[0])
{'question': 'What type of organism is commonly used in preparation of foods such as cheese and yogurt?',
'distractor3': 'viruses',
'distractor1': 'protozoa',
'distractor2': 'gymnosperms',
'correct_answer': 'mesophilic organisms',
'support': 'Mesophiles grow best in moderate temperature, typically between 25°C and 40°C (77°F and 104°F). Mesophiles are often found living in or on the bodies of humans or other animals. The optimal growth temperature of many pathogenic mesophiles is 37°C (98°F), the normal human body temperature. Mesophilic organisms have important uses in food preparation, including cheese, yogurt, beer and wine.'}
This dataset contains the questions, answers, and support text for 13,679 crowdsourced science exam questions.
Next, we need to build the knowledge base, represented in a vector database. To generate embedding for the answers, we use multi-qa-MiniLM-L6-cos-v1 , which has been specifically trained for semantic search use cases. Given a question / search query, this model is able to find relevant text passages, which is exactly what we are aiming for.
The embedding is added to a Chroma vector store, an open-source embedding database. In the following example, embedding (stored in collections) are stored in-memory.
import chromadb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="sciq")
for i in range(len(dataset)):
collection.add(
embeddings=[model.encode(f"{dataset[i]['correct_answer']}. {dataset[i]['support']}").tolist()],
documents=[dataset[i]['support']],
ids=[f"id_{i}"]
)
Generating Answers
For the generative text task, we are going to use the falcon-7b-instruct model, downloaded from Hugging Face as well. Falcon is a new family of state-of-the-art language models created by the Technology Innovation Institute in Abu Dhabi, and released under the Apache 2.0 license.
Falcon-40B is the first (and currently only) “truly open” model with capabilities rivaling many current closed-source model.
import chromadb
import transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model = "tiiuae/falcon-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
First, let’s see the result from the model without any additional context provided to it:
user_question = "What is a nuclear symbol?"
sequences = pipeline(
user_question, # only the message, without any context
max_length=500,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
A nuclear symbol refers to a graphic or visual representation of nuclear technology or concepts, such as atomic nuclei or radiation. They are often used in scientific or educational contexts.
Without the context, we are getting response back, but the answer is very broad, and generally not so helpful.
Mixing it Up with Smart Context
Now, let’s provide our generative model with some proper context, fetched from the vector store.
Notice we use the same embedder (multi-qa-MiniLM-L6-cos-v1 ) to fetch the context as well:
user_question = "What is a nuclear symbol?"
context = collection.query(
query_embeddings=[model.encode(user_question).tolist()],
n_results=1
)
print(context)
{'ids': [['id_13']],
'embeddings': None,
'documents': [['Nuclear symbols are used to write nuclear equations for radioactive decay. Let’s consider the example of the beta-minus decay of thorium-234 to protactinium-234. This reaction is represented by the equation:.']],
'metadatas': [[None]],
'distances': [[0.47168487310409546]]}
Followed by some prompt templating which include the context seen above:
prompt = f"""Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say 'I don't know'.
Text: {"".join(context['documents'][0])}.
{message}
"""
sequences = pipeline(
prompt,
max_length=500,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
Which results in:
The nuclear symbols used to represent radioactive decay are based on the atomic structure of atoms. The most commonly used nuclear symbols include:
- For atomic number 1, the symbol is usually a single letter representing that element, such as 'A' for helium or 'B' for boron. For example, 'N' represents nitrogen.
- For atomic numbers 2 to 8, the symbols are usually a combination of the element's letter symbol, followed by a subscript representing the number of neutrons in the nucleus, such as 'Na' (Sodium-23) or 'Au' (Gold-79). For example, 'Na' represents the element's 23rd element, Sodium.
- For atomic numbers 9 and above, the symbols represent specific isotopes and their corresponding elements. For example, the symbol 'Xe' is used for the element (Radium- decaying) 'Xenon-135', which is an isotope of the element (Radon-222).
In the case of the beta-minus decay of thorium-234 in protactinium-234, the nuclear symbol representing the beta emission of the decay chain is 'B'.
A much better, accurate answer for our users' question. We can observe that the LLM’s now have much better context to derive from even expands on the answer very well.
Let’s Recap
In the above example, we demonstrated how to build an LLM based application powered by custom datasets.
Managing the model, running different experiments with different datasets, provisioning the required infrastructure, and having a working solution isn’t a simple process. Using Qwak, you can manage this process, get the resources you need, and have a working context aware LLM deployed on your environment in just a few hours.
In the near future, we plan on adding a cleaner and tighter integration to a built-in managed Vector Store as part of Qwak, including ETL and visualization capabilities. Stay tuned! 🙂
Next Steps
For a full example of how to build and deploy the application written in this article see our example repository.
Hope you found this article interesting. We welcome you to build your own LLM on top of Qwak and start your journey here for free!