Integrating Vector Databases with LLMs: A Hands-On Guide

Discover how to boost Large Language Models (LLMs) using vector databases for precise, context-aware AI solutions. Learn to build smarter bots with Qwak.
Ran Romano
Ran Romano
Co-founder & CPO at Qwak
February 20, 2024
Table of contents
Integrating Vector Databases with LLMs: A Hands-On Guide

Welcome to our hands-on guide where we dive into the world of Large Language Models (LLMs) and their synergy with Vector Databases. LLMs have been a game-changer in the tech world, driving innovation in application development. However, their full potential is often untapped when used in isolation. This is where Vector Databases step in, enhancing LLMs to produce not just any response, but the right one.

Typically, LLMs are trained on a wide array of data, which gives them a broad understanding but can lead to gaps in specific knowledge areas. Sometimes, they might even churn out information that’s off-target or biased - a byproduct of learning from the vast, unfiltered web. To address this, we introduce the concept of Vector Databases. These databases store data in a unique format known as 'vector embeddings,' which enable LLMs to grasp and utilize information more contextually and accurately.

This guide is about How to build an LLM with a Vector Database and improve LLM’s use of this flow. We'll look at how combining these two can make LLMs more accurate and useful, especially for specific topics.

Next, we offer a brief overview of Vector Databases, explaining the concept of vector embedding and its role in enhancing AI and machine learning applications. We’ll show you how these databases differ from traditional databases and why they are better suited for AI-driven tasks, particularly when working with unstructured data like text, images, and complex patterns.

Further, we'll explore the practical application of this technology in building a Closed-QA bot. This bot, powered by Falcon-7B and ChromaDB, demonstrates the effectiveness of LLMs when coupled with the right tools and techniques.

By the end of this guide, you'll have a clearer understanding of how to harness the power of LLMs and Vector Databases to create applications that are not only innovative but also context-aware and reliable. Whether you're an AI enthusiast or a seasoned developer, this guide is tailored to help you navigate this exciting field with ease and confidence.

A Brief Overview of Vector Databases

Before diving into what a vector database is, it's essential to understand the concept of vector embedding. Vector embeddings are essential in machine learning for transforming raw data into a numerical format that AI systems can understand. This involves converting data, like text or images, into a series of numbers, known as vectors, in a high-dimensional space. High-dimensional data refers to data that has many attributes or features, each representing a different dimension. These dimensions help in capturing the nuanced characteristics of the data.

The process of creating vector embeddings starts with the input data, which could be anything from words in a sentence to pixels in an image. Large Language Models and other AI algorithms analyze this data and identify its key features. For example, in text data, this might involve understanding the meanings of words and their context within a sentence. The embedding model then translates these features into a numerical form, creating a vector for each piece of data. Each number in a vector represents a specific feature of the data, and together, these numbers encapsulate the essence of the original input in a format that the machine can process.

These vectors are high-dimensional because they contain many numbers, each corresponding to a different feature of the data. This high dimensionality allows the vectors to capture complex, detailed information, making them powerful tools for AI models. The models use these embeddings to recognize patterns, relationships, and underlying structures in the data.

Vector databases are engineered to provide optimized storage and querying abilities tailored for the distinct nature of vector embeddings. They excel in offering efficient search capabilities, high performance, scalability, and data retrieval by drawing comparisons and identifying similarities among data points.

These numerical representations of complex, high-dimensional information distinguish vector databases from traditional systems that primarily store data in formats like text and numbers. Their primary strength is in managing and querying data types such as images, videos, and text, particularly useful when these are transformed into vector format for machine learning and AI applications.

In the next illustration, we present the conversion of text into word vectors. This step is fundamental in natural language processing, enabling us to quantify and analyze linguistic relationships. For example, the vector representation of 'puppy' would be positioned closer in vector space to 'dog' than to 'house,' reflecting their semantic proximity. This approach extends to analogical relationships as well. The vector distance and direction between 'man' and 'woman' can be analogous to that between 'king' and 'queen.' This illustrates how word vectors not only represent words but also allow for a meaningful comparison of their semantic relationships in a multidimensional vector space.

Source: Plos

Vector Databases before the rise of LLMs

Vector databases, designed to handle vector embeddings, have several key use-cases, especially in the field of machine learning and AI:

Similarity Search: This is a core function where vector databases excel. They can quickly find data points that are similar to a given query in a high-dimensional space. This is crucial for applications like image or audio retrieval, where you want to find items similar to a particular input. Here are some industry use-case examples:

  • E-Commerce: Enhancing product discovery by allowing customers to search for products visually similar to a reference image.
  • Music Streaming Services: Finding and recommending songs with audio features similar to a user's favorite tracks.
  • Healthcare Imaging: Assisting radiologists by retrieving medical images (like X-rays or MRIs) that display similar pathologies for comparative analysis.

Recommendation Systems: Vector databases support recommendation systems by handling user and item embeddings. They can match users with items (like products, movies, or articles) that are most similar to their interests or past interactions. Here are some industry use-cases:

  • Streaming Platforms: Personalizing viewing experiences by recommending movies and TV shows based on a viewer's watching history.
  • Online Retailers: Suggesting products to shoppers based on their browsing and purchase history, enhancing cross-selling and up-selling opportunities.
  • News Aggregators: Delivering personalized news feeds by matching articles with a reader's past engagement patterns and preferences.

Content-Based Retrieval: Here, vector databases are used to search for content based on its actual substance rather than traditional metadata. This is particularly relevant for unstructured data like text and images, where the content itself needs to be analyzed for retrieval. Here are a few industry use-cases:

  • Digital Asset Management: Enabling companies to manage vast libraries of digital media by facilitating search and retrieval of images or videos based on visual or audio content characteristics.
  • Legal and Compliance: Searching through large volumes of documents to find specific information or documents that are contextually related to legal cases or compliance inquiries.
  • Academic Research: Assisting researchers in finding scholarly articles and research papers that are contextually similar to their work, even if specific keywords are not mentioned.

This last point about content-based retrieval is increasingly significant and facilitates a novel application:

Enhancing LLMs with Contextual Understanding: By storing and processing text embeddings, vector databases enable LLMs to perform more nuanced and context-aware information retrieval. They help in understanding the semantic content of large volumes of text, which is pivotal in tasks like answering complex queries, maintaining conversation context, or generating relevant content. This application is rapidly becoming a prominent use-case for vector databases, showcasing their ability to augment the capabilities of advanced AI systems like LLMs.

Vector vs. Traditional Databases

Traditional SQL databases excel in structured data management, thriving on exact matches and well-defined conditional logic. They maintain data integrity and suit applications needing precise, structured data handling. However, their rigid schema design makes them less adaptable to the semantic and contextual nuances of unstructured data, which is crucial in AI applications like LLMs and Generative AI.

NoSQL databases, on the other hand, offer more flexibility compared to traditional SQL systems. They can handle semi-structured and unstructured data, like JSON documents, which makes them somewhat more adaptable to AI and machine learning use cases. Despite this, even NoSQL databases can fall short in certain aspects of handling the complex, high-dimensional vector data essential for LLMs and Generative AI, which often involves interpreting context, patterns, and semantic content beyond simple data retrieval.

Vector databases fill this gap. Tailored for AI-centric scenarios, they process data as vectors, allowing them to effectively manage the intricacies of unstructured data. When working with LLMs, vector databases support operations like similarity search and contextual understanding, offering capabilities beyond both traditional SQL and flexible NoSQL databases. Their proficiency in working with approximations and pattern recognition makes them particularly suitable for AI applications where nuanced data interpretation is more important than retrieving exact data matches.

Improving Vector Database Performance

Optimizing the performance of vector databases is important for applications that rely on fast and accurate retrieval of high-dimensional data. This involves improving query speed, ensuring high accuracy, and maintaining scalability to handle growing data volumes and user requests efficiently. A significant part of this optimization revolves around indexing strategies, which are techniques used to organize and search through vector data more efficiently. Below, we expand on these indexing strategies and how they contribute to improving vector database performance.

Indexing Strategies

Indexing strategies in vector databases are designed to facilitate quick and accurate retrieval of vectors that are similar to a query vector. These strategies can dramatically affect both the speed and accuracy of search operations.

  • Quantization: Quantization involves mapping vectors to a finite set of reference points in the vector space, effectively compressing the vector data. This strategy reduces the storage requirements and speeds up the search process by limiting the search to a subset of reference points rather than the entire dataset. There are various forms of quantization, including Scalar Quantization and Vector Quantization, each with its trade-offs between search speed and accuracy. 

Quantization is particularly effective for applications managing large-scale datasets where storage and memory efficiency are critical. It excels in environments where a balance between query speed and accuracy is acceptable, making it ideal for speed-sensitive applications that can tolerate some loss of precision. However, it is less recommended for use cases demanding the highest levels of accuracy and minimal information loss, such as precise scientific research, due to the inherent trade-offs between data compression and search precision.

  • Hierarchical Navigable Small World (HNSW) Graphs: HNSW is an indexing strategy that constructs a layered graph where each layer represents a different granularity of the dataset. Searches start from the top layer, which has fewer, more distant points, and move down to more detailed layers. This approach allows for rapid traversal of the dataset, significantly reducing the search time by quickly narrowing down the candidate set of similar vectors.

HNSW graphs strike an excellent balance between query speed and accuracy, making them well-suited for real-time search applications and recommendation systems that require immediate response times. They perform well with moderate to large datasets, offering scalable search capabilities. However, their memory consumption can become a limitation for extremely large datasets, making them less ideal for scenarios where memory resources are constrained or the dataset size significantly exceeds the practical in-memory capacity.

  • Inverted File Index (IVF): The IVF approach divides the vector space into a predefined number of clusters using algorithms like k-means. Each vector is assigned to the nearest cluster, and during a search, only vectors in the most relevant clusters are considered. This method reduces the search scope, improving query speed. Combining IVF with other techniques, such as Quantization (resulting in IVFADC - Inverted File Index with Asymmetric Distance Computation), can further enhance performance by reducing the computational cost of distance calculations.

The Inverted File Index (IVF) approach is recommended for handling high-dimensional data in scalable search environments, efficiently narrowing down search spaces by clustering similar items. It is particularly beneficial for datasets that are relatively static, where the overhead of occasional re-clustering is manageable. However, IVF may not be the best choice for low-dimensional data due to potential over-segmentation or for applications that demand the lowest possible latency, as the clustering process and the need to search across multiple clusters can introduce additional query time.

Additional Considerations for Optimization

  • Dimensionality Reduction: Before applying indexing strategies, reducing the dimensionality of vectors can be beneficial. Techniques like PCA or autoencoders help in preserving the essential features of the data while reducing its complexity, which can improve both the efficiency of indexing and the speed of search operations.
  • Parallel Processing: Many indexing strategies can be parallelized, either on CPUs with multiple cores or on GPUs. This parallel processing capability allows for handling multiple queries simultaneously, significantly improving throughput and reducing response times for large-scale applications.
  • Dynamic Indexing: For databases that frequently update their data, dynamic indexing strategies that allow for efficient insertion and deletion of vectors without significant reorganization of the index can be crucial. This ensures that the database remains responsive and up-to-date with minimal performance degradation over time.

Improving vector database performance through these indexing strategies and considerations involves a deep understanding of both the underlying data and the specific requirements of the application. By carefully selecting and tuning these strategies, developers can significantly enhance the responsiveness and scalability of their vector-based applications, ensuring that they meet the demands of real-world use cases.

Enriching Context for LLMs with Vector Databases

Large Language Models (LLMs) like Facebook’s LLama2 or TIIUAE’s Falcon, have significantly advanced AI capabilities with their human-like text generation. However, they face challenges in handling specialized contexts due to their training on broad, general datasets. 

Addressing the contextual limitations can be approached in two main ways:

  1. Targeted Training: This involves retraining or fine-tuning the LLM on a dataset focused on the specific area of interest. While this method can significantly enhance the model's expertise in particular topics or industries, it's often not feasible for many organizations or individuals. The reasons include the high costs associated with the computational resources required for training and  the expertise needed to effectively retrain such complex models: Effectively retraining LLMs requires a deep understanding of machine learning, natural language processing, and the specific architecture of the model in question.
  2. Incorporating Context via Vector Databases: Alternatively, the LLM can be augmented by adding context directly into its prompts, using data from a vector database. In this setup, the vector database stores specialized information as vector embeddings, which can be retrieved and used by the LLM to enhance its responses. This approach allows for the inclusion of relevant, specialized knowledge without the need for extensive retraining. It's particularly useful for organizations or individuals lacking the resources for targeted training, as it leverages existing model capabilities while providing focused contextual insights.

The second option is called RAG and we’ll explore it in more detail in the next sections.

Source: KDnuggets

Building a Closed-QA Bot with Falcon-7B and ChromaDB

In this section, we outline the process of how to build an LLM with a vector database. The model is a Closed Q&A bot. This bot is designed to effectively address science-related queries using a set of integrated technological components:

  1. databricks-dolly-15k HuggingFace Dataset: Is an open-source dataset of instruction-following records generated by Databricks employees. It's designed for training large language models (LLMs), synthetic data generation, and data augmentation. The dataset includes various types of prompts and responses in categories like brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
  2. Chroma as the Vector Store (Knowledge Base): We employ Chroma as our primary vector store, acting as the knowledge base for our bot.
  3. Sentence Transformers for Semantic Search: Specifically, we use the 'multi-qa-MiniLM-L6-cos-v1' model from Sentence Transformers, optimized for semantic search applications. This model is responsible for generating embeddings that are stored in Chroma.
  4. Falcon 7B Instruct Model: Serving as our open-source generative model, Falcon 7B is a decoder-only model with 7 billion parameters. Developed by TII, it's trained on an extensive 1,500B tokens dataset, RefinedWeb, supplemented with curated corpora. Notably, Falcon 40B, its larger counterpart, ranks as the top large language model on Hugging Face's Open LLM Leaderboard.

Setting up the Environment

For implementing the code discussed in this article, the following installations are necessary:


!pip install -qU \
    transformers==4.30.2 \
    torch==2.0.1+cu118 \
    einops==0.6.1 \
    accelerate==0.20.3 \
    datasets==2.14.5 \
    chromadb \
    sentence-transformers==2.2.2

The code was initially run on an ​​gpu.a10.2xl instance on Qwak’s Workspaces.. It's important to note that the specific code required for running the Falcon-7B-Instruct model might vary depending on the hardware configuration used.

Building the “Knowledge Base”

To begin, we acquire the Databricks-Dolly dataset, focusing specifically on the closed_qa category. These entries, typically characterized by their demand for precise information, pose a challenge for a generally trained Large Language Model (LLM) due to their specificity.


from datasets import load_dataset

# Load only the training split of the dataset
train_dataset = load_dataset("databricks/databricks-dolly-15k", split='train')

# Filter the dataset to only include entries with the 'closed_qa' category
closed_qa_dataset = train_dataset.filter(lambda example: example['category'] == 'closed_qa')

print(closed_qa_dataset[0])

A typical dataset entry appears as follows:


{
  "instruction": "When was Tomoaki Komorida born?",
  "context": "Komorida was born in Kumamoto Prefecture on July 10, 1981. After graduating from high school, he joined the J1 League club Avispa Fukuoka in 2000. His career involved various positions and clubs, from a midfielder at Avispa Fukuoka to a defensive midfielder and center back at clubs such as Oita Trinita, Montedio Yamagata, Vissel Kobe, and Rosso Kumamoto. He also played for Persela Lamongan in Indonesia before returning to Japan and joining Giravanz Kitakyushu, retiring in 2012.",
  "response": "Tomoaki Komorida was born on July 10, 1981.",
  "category": "closed_qa"
}

Next, we focus on generating word embeddings for each set of instructions and their respective contexts, integrating them into our vector database, ChromaDB.

Chroma DB, an open-source vector storage system, excels in managing vector embeddings. It's tailored for applications like semantic search engines, crucial in natural language processing and machine learning domains. The efficiency of Chroma DB, particularly as an in-memory database, facilitates rapid data access and manipulation, key in high-speed data processing. Its Python-friendly setup enhances its appeal for our project, streamlining integration into our workflow. For detailed documentation: Chroma DB Documentation.

Source: https://docs.trychroma.com/

To generate embedding for the answers, we use multi-qa-MiniLM-L6-cos-v1 , which has been specifically trained for semantic search use cases. Given a question / search query, this model is able to find relevant text passages, which is exactly what we are aiming for.

In the example below, we illustrate how embeddings are stored in Chroma's in-memory collections.


import chromadb
from sentence_transformers import SentenceTransformer

class VectorStore:

    def __init__(self, collection_name):
       # Initialize the embedding model
        self.embedding_model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')
        self.chroma_client = chromadb.Client()
        self.collection = self.chroma_client.create_collection(name=collection_name)

    # Method to populate the vector store with embeddings from a dataset
    def populate_vectors(self, dataset):
        for i, item in enumerate(dataset):
            combined_text = f"{item['instruction']}. {item['context']}"
            embeddings = self.embedding_model.encode(combined_text).tolist()
            self.collection.add(embeddings=[embeddings], documents=[item['context']], ids=[f"id_{i}"])

    # Method to search the ChromaDB collection for relevant context based on a query
    def search_context(self, query, n_results=1):
        query_embeddings = self.embedding_model.encode(query).tolist()
        return self.collection.query(query_embeddings=query_embeddings, n_results=n_results)


# Example usage
if __name__ == "__main__":
   # Initialize the handler with collection name
    vector_store = VectorStore("knowledge-base")
    
    # Assuming closed_qa_dataset is defined and available
    vector_store.populate_vectors(closed_qa_dataset)

For each dataset entry, we generate and store an embedding of the combined 'instruction' and 'context' fields, with the context acting as the document for retrieval in our LLM prompts.

Next, we will utilize the Falcon-7b-instruct LLM to generate responses to closed information queries without additional context, showcasing the efficacy of our enriched knowledge base.

Generating Basic Answers

For our generative text task, we will harness the capabilities of the falcon-7b-instruct model, sourced from Hugging Face. This model is part of the innovative Falcon series, developed by the Technology Innovation Institute in Abu Dhabi. 

What makes Falcon-7B-Instruct stand out is its efficient balance of advanced capabilities and manageable size. It's designed for complex text understanding and generation tasks, delivering performance that rivals larger, closed-source models, but in a more streamlined package. This makes it an ideal choice for our project, where we need deep language understanding without the overhead of the larger models.

If you're planning to run the Falcon-7B-Instruct model, either on your local machine or a remote server, it's important to keep in mind the hardware requirements. As mentioned on HuggingFace's documentation, the model needs a minimum of 16GB RAM. However, for optimal performance and faster response times, using a GPU is highly recommended.


import transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class Falcon7BInstructModel:

    def __init__(self):
        # Model name
        model_name = "tiiuae/falcon-7b-instruct"
        self.pipeline, self.tokenizer = self.initialize_model(model_name)

    def initialize_model(self, model_name):
        # Tokenizer initialization
        tokenizer = AutoTokenizer.from_pretrained(model_name)

        # Pipeline setup for text generation
        pipeline = transformers.pipeline(
            "text-generation",
            model=model_name,
            tokenizer=tokenizer,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            device_map="auto",
        )

        return pipeline, tokenizer

    def generate_answer(self, question, context=None):
        # Preparing the input prompt
        prompt = question if context is None else f"{context}\n\n{question}"

        # Generating responses
        sequences = self.pipeline(
            prompt,
            max_length=500,
            do_sample=True,
            top_k=10,
            num_return_sequences=1,
            eos_token_id=self.tokenizer.eos_token_id,
        )

        # Extracting and returning the generated text
        return sequences['generated_text']

The code example, built upon Hugging Face's documentation, is quite clear and easy to follow.

Let's dissect its main components for a better understanding:

  • The tokenizer is a key component in natural language processing (NLP) models like Falcon-7B-Instruct. Its primary role is to convert input text into a format that the model can understand. Essentially, it breaks down the text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the tokenizer's design. In the context of the Falcon-7B-Instruct model, the AutoTokenizer.from_pretrained(model) call is loading a tokenizer that's specifically designed to work with this model, ensuring that the text is tokenized in a way that aligns with how the model was trained.
  • The pipeline in the transformers library is a high-level utility that abstracts away much of the complexity involved in processing data and getting predictions from a model. It handles multiple steps internally, such as tokenizing the input text, feeding the tokens into the model, and then processing the model's output into a human-readable form. In this script, the pipeline is set up for "text-generation", which means it's optimized to take in a prompt (like the user question) and generate a continuation of the text based on that prompt.

Example usage:


# Initialize the Falcon model class
falcon_model = Falcon7BInstructModel()

user_question = "When was Tomoaki Komorida born?"

# Generate an answer to the user question using the LLM
answer = falcon_model.generate_answer(user_question)

print(f"Result: {answer}")

As you have probably guessed, here’s the model output for the given user question:


{ answer: “I don't have information about Tomoaki Komorida's birthdate.” }

Utilizing Falcon-7B-Instruct without supplementary context yields a negative response as it hasn’t been trained with this “lesser-known” information. This illustrates the need for enriched context in generating more targeted and useful answers for non-general questions.

Generating Context-Aware Answers

Now, let's elevate our generative model's capability by providing it with relevant context, retrieved from our vector store.

Interestingly, we're using the same VectorStore class we for both generating embeddings and fetching context from the user question:


# Assuming vector_store and falcon_model have already been initialized

# Fetch context from VectorStore, assuming it's been populated
context_response = vector_store.search_context(user_question)

# Extract the context text from the response
# The context is assumed to be in the first element of the 'context' key
context = "".join(context_response['context'][0]) 

# Generate an answer using the Falcon model, incorporating the fetched context
enriched_answer = falcon_model.generate_answer(user_question, context=context)

print(f"Result: {enriched_answer}")

Naturally, the context enriched answer from our LLM is accurate and swift:


Tomoaki Komorida was born on July 10, 1981.

Wrapping up

In our detailed exploration, we've shown you the ropes on crafting a Large Language Model (LLM) application, enriched by custom datasets. It's clear that managing such a model, experimenting with diverse datasets, setting up the necessary infrastructure, and achieving a functional solution is far from trivial. However, this is where Qwak shines, simplifying this complex process. With Qwak, you're not just managing models; you're effectively streamlining the journey from concept to deployment, enabling a context-aware LLM to be operational in your environment in just a few hours.

Looking forward, we're thrilled to enhance your experience with Qwak by continuously refining our existing features. Our current focus is on improving the integration with our Vector Store, offering more robust ETL (Extract, Transform, Load) and visualization capabilities. 

Next Steps

To see a comprehensive example of how to build and deploy the application discussed in this article, visit our example repository. We've laid out everything to ensure a smooth and informative journey.

We hope this article has sparked your interest and curiosity. We invite you to embark on your own adventure in building an LLM with vector databases with Qwak. Start your journey today, and explore the possibilities that this cutting-edge platform offers, absolutely free. Happy modeling!

Infer
Virtual Conference by Qwak
March 20th, 11AM EST ->

Chat with us to see the platform live and discover how we can help simplify your ML journey.

say goodbe to complex mlops with Qwak