How to Fine-Tune LLMs on Custom Datasets

Explore the capabilities of LLMs and learn the process of fine-tuning LLM models for domain expertise, improved performance, and controlled behavior.
Pavel Klushin
Pavel Klushin
Head of Solution Architecture at Qwak
February 15, 2024
Table of contents
How to Fine-Tune LLMs on Custom Datasets

An Introduction to Fine-Tuning LLMs

The advent of generative models, notably Large Language Models (LLMs) such as ChatGPT and Bard, represents a significant leap in artificial intelligence. These models excel in generating human-like text, transforming our interaction with language through machines. This exploration delves into fine-tuning LLMs with custom datasets to enhance their performance and precision for specific applications.

Understanding LLMs and Generative Models

LLMs, a cornerstone of generative models, are engineered to assimilate and generate text from extensive datasets. Their application across various sectors has dramatically improved customer experiences and provided businesses with an unparalleled competitive advantage. From automating intricate customer service dialogs to crafting personalized content, LLMs are redefining technological capabilities.

The Need for Fine-Tuning LLMs

Why fine-tune LLMs? Despite their vast capabilities, general-purpose LLMs sometimes fall short in specialized tasks due to their broad and diverse training. Fine-tuning these models on narrowly focused datasets enables them to acquire deep domain expertise, significantly improving their accuracy and allowing for behavior that is finely tailored to specific applications.

Enhancing Model Performance Through Fine-Tuning

  • Domain-Specific Knowledge: Fine-tuning imbues LLMs with an in-depth understanding of specialized fields, be it legal jargon, medical terminology, or customer service nuances.
  • Improved Accuracy and Relevance: By learning from domain-specific examples, fine-tuned models offer responses that are not only accurate but highly relevant to the task at hand.

Generalization vs. Specialization

The balance between generalization and specialization is pivotal. General-purpose models, while versatile, often lack the depth required for niche applications. Conversely, specialized models thrive in specific contexts, providing insights and solutions that are deeply aligned with particular domain requirements. Fine-tuning is the bridge that allows us to leverage the strengths of both approaches, creating models that are both adaptable and deeply knowledgeable.

Real-World Applications - The Versatility Across Industries

  • Legal Document Analysis: Specialized LLMs can parse and interpret complex legal texts, aiding in research and case preparation.
  • Medical Research: Fine-tuned models can sift through vast databases of research, identifying relevant studies and summarizing findings in a fraction of the time it would take a human.
  • Customer Service: Automating responses to customer inquiries with nuanced understanding and context-awareness, reducing wait times and improving satisfaction.
  • Content Creation: Generating articles, stories, and reports tailored to specific audiences, styles, or formats, enhancing engagement and reach.
  • Healthcare: Summarizing patient records, literature, and research findings to support diagnostic and treatment processes.

The Fine-Tuning Process

Fine-tuning an LLM is a multifaceted process, involving several key stages from dataset preparation to model deployment and ongoing optimization.

Selecting a Model for Fine-Tuning

Choosing the right model is the first step in the fine-tuning journey. In this case let’s  fine-tune LLM from Hugging Face, The Hugging Face platform, known for its extensive repository of open-source models and vibrant community, is an excellent place to start. For this guide, we will fine-tune an open source LLM - Microsoft's DialoGPT-large model for its robustness and versatility in text generation.

1. Initializing the Model

Initiating the fine-tuning process involves loading the selected model and its tokenizer, which prepares the text for the model:

def initialize_model(self):
    self.tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
    self.model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")

2. Fine-Tuning the Model

The core of fine-tuning lies in adapting the model to your dataset, enabling it to learn from your specific data and better align with the task at hand:

def build(self):
    # Initialize tokenizer and model
    self.tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
    self.model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")
    self.tokenizer.pad_token = self.tokenizer.eos_token
    # Load and prepare your custom dataset
    raw_datasets = load_dataset()  # Replace with your dataset
    tokenized_datasets =, batched=True)
    data_collator = DataCollatorWithPadding(tokenizer=self.tokenizer)
    training_args = TrainingArguments("test-trainer")
    # Train the model
    self.trained_model = Trainer(

3. Preparing for Inference

To make predictions, ensure the data is appropriately formatted to maintain context and user interaction:

def predict(self, df: pd.DataFrame) -> pd.DataFrame:
    # Extract and process input data
    reset_context = list(df["reset_context"])[0]
    user_id = list(df["user_id"])[0]
    message = list(df["message"])[0]
    new_user_input_ids = self.tokenizer.encode(message + self.tokenizer.eos_token, return_tensors='pt')
    if bool(reset_context):
    bot_input_ids =[self.user_context_map[user_id], new_user_input_ids], dim=-1) if user_id in self.user_context_map else new_user_input_ids
    chat_history_ids = self.model.generate(bot_input_ids, max_length=1000, pad_token_id=self.tokenizer.eos_token_id)
    response = self.tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
    self.user_context_map[user_id] = chat_history_ids
    return pd.DataFrame(data={"answer": [response], "model_type": ["dialogpt-large"]})

4. Predict

An example invocation of the prediction function might look like this:

response = predict([{"user_id": "Pavel", "message": "How are you doing today?", "reset_context": False}])

Key consideration when Fine-tuning an LLM

Data Preparation and Selection

The cornerstone of effective fine-tuning is the assembly of a high-quality, relevant dataset. This dataset not only guides the model's learning process but also defines the boundaries of its expertise.

For fine-tuning an LLM to enhance models, for example a tech support chatbot, the process involves several streamlined steps:

1. Data Collection: Compile a dataset from customer service interactions, focusing on technical queries and responses, say 50,000 entries.

2. Data Cleaning: Remove irrelevant details like personal information and off-topic discussions to ensure the model learns from valuable content.

3. Annotation: Optionally, categorize the data (e.g., by issue type such as installation or UI problems) to help the model understand different problem areas.

4. Dataset Split: Divide the cleaned dataset into training (80%), validation (10%), and test (10%) sets to support the model's learning and evaluation process.

5. Preprocessing: Use the model's tokenizer to convert the text into a format it can understand, preparing it for the fine-tuning process.

This approach ensures the chatbot model is specifically tuned to address software issues, improving its ability to provide relevant and accurate support.

Computational Resources

Fine-tuning Large Language Models (LLMs) requires significant computational power, often necessitating GPUs or cloud computing resources for efficient workload management. To mitigate these demands, leveraging cloud platforms like AWS and Google Cloud offers scalable computational power. Additionally, applying efficiency optimization techniques, such as model pruning and quantization, helps reduce the model's size and computational needs while preserving performance. These strategies ensure the fine-tuning process is both manageable and cost-effective.

Hyperparameter Tuning and Model Training

  • Learning Rate: Determines the size of steps the model takes during optimization. Too high, and the model might overshoot optimal solutions; too low, and the training could become prohibitively slow. For example, starting with a learning rate of 5e-5 and adjusting based on validation performance is common practice.
  • Batch Size: The number of training examples utilized in one iteration. Smaller batch sizes can lead to more stable convergence, while larger ones can expedite the training process. A batch size of 16 or 32 is typical for fine-tuning tasks.
  • Number of Epochs: Indicates how many times the training dataset is passed through the model. More epochs can improve learning up to a point, beyond which the model might start overfitting. Experimenting with 3-5 epochs is often a good starting point.

Adjusting these hyperparameters requires a balance between speed and accuracy. Effective tuning often involves running multiple training trials with varied settings, monitoring validation loss to identify the optimal configuration.

Experimentation and Validation

Experimentation and validation are essential practices in fine-tuning LLMs to ensure models are both effective and generalizable.

  • A/B Testing: This involves conducting experiments where two or more sets of hyperparameters are tested in parallel to compare their performance. For instance, one might run two versions of a model fine-tuning process where version A uses a learning rate of 5e-5 and version B uses 3e-5. By comparing their performance on a validation set, one can determine which learning rate yields better results.
  • Validation Sets: A crucial part of model training, validation sets are used to assess a model's performance on data it hasn't seen during training. This practice is vital for preventing overfitting — where a model performs well on training data but poorly on unseen data. For example, after each epoch of training, the model's accuracy or loss is evaluated on the validation set. This feedback loop allows for adjustments before final evaluation, ensuring the model is robust and performs well on new data.

Deployment and Ongoing Optimization

After the meticulous process of selecting, initializing, and fine-tuning the model, deploying it into a real-world environment marks a significant milestone. This phase is where the rubber meets the road, as the model begins interacting with actual data, whether it's automating customer service responses or generating content. For instance, deploying a fine-tuned model to enhance a customer service chatbot involves not just technical integration but also preparing the infrastructure to support real-time interactions, requiring a blend of skills from across data science and engineering teams.

However, the journey doesn't end with deployment. Ongoing optimization plays a crucial role in maintaining the model's relevance and performance. This continuous cycle of monitoring, evaluating, and updating the model ensures it adapts to new data, trends, and emerging needs. For example, an e-commerce recommendation system fine-tuned on past sales data will need regular updates to incorporate new product lines and changing consumer behavior to stay accurate and effective.

Lifecycle Management

Effective lifecycle management, incorporating monitoring tools and continuous learning, is essential. Utilizing platforms like TensorBoard or Weights & Biases allows teams to track performance metrics and identify when models might be drifting or underperforming. This proactive approach ensures models remain efficient and effective long after their initial deployment.

Continuous learning, where new data is periodically folded into the training set, ensures the model evolves alongside its operational domain. This adaptability is key in dynamic sectors like news aggregation or financial analysis, where staying abreast of the latest information is critical for maintaining performance and relevance.

Future of LLMs and the Role of ML Practitioners

The future of Large Language Models (LLMs) promises even more sophisticated and versatile applications across various sectors. As these models evolve, the expertise of machine learning (ML) practitioners becomes indispensable for several reasons:

  • Guiding Development: ML practitioners play a pivotal role in steering the development of LLMs towards more efficient, accurate, and context-aware models. For example, their insights can help refine models like GPT-4 for nuanced tasks such as legal analysis or personalized education, ensuring that the models can understand and generate highly specialized content.
  • Fine-Tuning for Specific Needs: Through fine-tuning, ML practitioners adapt general-purpose models to serve specific industry needs, such as creating a model tailored for medical diagnosis support by training it on vast medical literature and patient data. Their skill in adjusting hyperparameters and selecting relevant datasets ensures models deliver precise and reliable outputs.
  • Ethical Deployment: Ensuring LLMs are used responsibly and ethically is paramount. ML practitioners are at the forefront of addressing ethical considerations, such as bias mitigation and privacy concerns. For instance, they implement and oversee fairness and bias evaluation frameworks for models used in hiring to prevent discriminatory practices.

As LLMs continue to integrate into the fabric of industries, ML practitioners' role in harnessing their potential responsibly and innovatively will only grow, ensuring that these powerful tools benefit society while upholding ethical standards.


The process of fine-tuning LLMs on custom datasets is a nuanced journey that extends from the careful selection and preparation of data through to the ethical deployment and ongoing management of the model. This journey not only enhances the model's performance and precision but also underscores the critical role of machine learning practitioners in navigating the complexities of development, fine-tuning, and ethical considerations. As LLMs continue to evolve, their potential to revolutionize industries grows, promising more sophisticated applications and the need for continued innovation and responsible stewardship in the AI domain.

Virtual Conference by Qwak
March 20th, 11AM EST ->

Chat with us to see the platform live and discover how we can help simplify your ML journey.

say goodbe to complex mlops with Qwak