x
Back to blog
How to Fine-Tune LLMs on your Data
LLMops

How to Fine-Tune LLMs on your Data

By 
Alon Lev
June 11, 2023

What are LLMs / Generative models? 

Generative models have revolutionized the field of artificial intelligence by enabling computers to generate realistic and creative outputs. One particularly remarkable class of generative models is language models, which can produce human-like text based on the patterns and structures they learn from vast amounts of training data. We are all familiar with such solutions like ChatGPT by OpenAI and Bard by Google. In this blog post, we will delve into the fascinating world of generative models, explore the capabilities of language models (LLMs) and learn how to fine-tune a model on your own data.

*Screenshot from Qwak.com 

LLM is probably the most exciting technology that has come out in the last decade, and almost anyone you know is already using LLM in one way or another. Many companies nowadays want to take advantage of this new technology and ask their Data Scientists / ML Engineers to utilize LLMs for their business in order to improve their customer experience and gain a competitive advantage. 

General-purpose LLMs, such as OpenAI's GPT-3, are trained on large-scale, diverse datasets comprising a wide range of internet text. These models aim to understand and generate text across various domains and topics. Due to their broad training, general-purpose LLMs may produce outputs that are not finely tuned to specific domains or use cases. The generated text might lack specialized knowledge or context. These models may generate responses that are factually incorrect or biased since they learn from unfiltered internet text, which can contain misinformation or subjective viewpoints.

Why should I fine-tune LLMs? 

Fine-tuned LLMs are general-purpose LLMs that undergo additional training on domain-specific or task-specific datasets. This process allows the models to specialize in particular use cases and improves their performance in specific domains. Their major advantages are:

  1. Domain expertise: Fine-tuned LLMs gain deeper knowledge and context within a specific domain. They can generate more accurate and domain-specific responses due to their training on specialized data.
  2. Improved performance: By fine-tuning on relevant datasets, LLMs can achieve higher accuracy and provide more tailored outputs for specific tasks.
  3. Controlled behavior: Fine-tuning allows users to specify desired behaviors or constraints for the LLMs, enabling more control over the generated outputs.

*Screenshot from Qwak.com 

Here are a few use cases for LLM Fine-tuning:

  1. Customer Support Chatbots: By fine-tuning a language model on customer support data, you can create chatbots that can understand and respond to customer queries, provide relevant information, and assist with common issues. This helps automate customer support processes and improves the overall customer experience.
  2. Content Generation: LLM Fine-tuning can be used to generate specific types of content, such as product descriptions, news articles, or creative writing. By training the model on a specific domain, it can produce high-quality and contextually relevant content tailored to that domain.
  3. Sentiment Analysis and Text Classification: Language models can be fine-tuned to perform sentiment analysis or text classification tasks. For example, you can train a model on a labeled dataset of customer reviews to classify future reviews as positive or negative, helping businesses understand customer sentiment.
  4. Vulnerability Detection: LLM Fine-tuning can be used to train a model on vulnerability-related data, such as security advisories, code repositories, or security-related discussions. The fine-tuned model can then analyze code snippets, software repositories, or system configurations to identify potential vulnerabilities.
  5. Fraudulent Transaction Detection: By training a language model on historical transaction data, including known fraudulent transactions, you can fine-tune it to identify patterns and indicators of fraud. The model can learn to detect suspicious transactions, anomalous behavior, or characteristics associated with fraudulent activities.

How to fine-tune an open source LLM based on your data:

We’ll start by finding a model that we want to use for our fine-tuning. Hugging Face is a company and an open-source community that focuses on natural language processing (NLP) and machine learning models. By using the Hugging Face models repository, we can find a public open source model, and fine-tune it. For example, I chose to use Microsoft's DialoGPT-large model.

Using this model “as-is” is quite simple, I can simply use this code to load the model:


def initialize_model(self):
    self.tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
    self.model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")

To run the inference, I’ll need to send the data to the model in the right format.


def predict(self, df: pd.DataFrame) -> pd.DataFrame:
    reset_context = list(df["reset_context"])[0]
    user_id = list(df["user_id"])[0]
    message = list(df["message"])[0]


    new_user_input_ids = self.tokenizer.encode(message + self.tokenizer.eos_token,       
        return_tensors='pt')
    if bool(reset_context):
        self._reset_user_context(user_id)


    bot_input_ids = torch.cat([self.user_context_map[user_id], new_user_input_ids], 
        dim=-1) \
        if user_id in self.user_context_map \
        else new_user_input_ids


    chat_history_ids = self.trained_model.generate(bot_input_ids, max_length=1000, 
        pad_token_id=self.tokenizer.eos_token_id)
    response = self.tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0],  
        skip_special_tokens=True)

    self.user_context_map[user_id] = chat_history_ids
    return pd.DataFrame(data={"answer": [response], "model_type": ["dialogpt-large"]})

Now, an example prediction will look like this:


response = predict([{"user_id": ["Yuval"], "message": ["How are you doing today?"], "reset_context": False}])

To run a fine-tuning for this model, we’ll just need to add one more step to this process -training the model.


def build(self):
    self.tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
    self.model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")
    self.tokenizer.pad_token = self.tokenizer.eos_token
    raw_datasets = load_dataset() # Load my custom dataset
    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True) # Tokenize the dataset based on the tokenizer above
    data_collator = DataCollatorWithPadding(tokenizer=self.tokenizer)
    training_args = TrainingArguments("test-trainer")


    self.trained_model = Trainer(
        self.model,
        training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        data_collator=data_collator,
        tokenizer=self.tokenizer,
    )


    self.trained_model.train()

Based on this simple process, we’ll be able to fine-tune the model based on our custom data.

Let's recap what we did: 

In the above example, we demonstrated how to fine-tune a GPT-based model on your private data. Managing the model, fine-tuning, running different experiments with different datasets, deploying the fine tuned model, and having a working solution, isn’t a simple process. Using Qwak, you can manage this process, get the resources that you need, and have a working fine-tuned model in just a few hours.

What will the future bring? 

Well, we don't believe that anyone thinks that LLM is not here to stay, but the question is, how will it evolve, not only in terms of its capabilities, but also in terms of the various use cases, usage patterns, and problems it may or may not solve.

The job of ML practitioners in an LLM based world:  

It might also be a bit of a gamble, but we don't think ML practitioners are going anywhere, especially as more and more usage specific LLM use cases continue to rise (as in the example we just showed). Even with LLMs, ML practitioners are still responsible for the end-to-end development, deployment, and improvement of language models. They contribute their expertise in data preparation, model selection, fine-tuning, evaluation, ethical considerations, and ongoing optimization to ensure the effective utilization of LLMs in various applications and domains. While all other types of models, such as churn, LTV, fraud detection, and loans engines, are still very much needed, like with any other technology, the belief is that practitioners would be the first to be replaced. However, when the business benefit is rising, the demand and the need for professionals is also on the rise in accordance (where are the individuals who said no infra people in the cloud world??).

Hope you found this article interesting. We welcome you to build your own LLM on top of Qwak and start your journey here for free!

Related articles