Preparing Data for Fine-Tuning Large Language Models: A Comprehensive Guide

Create articles from any YouTube video or use our API to get YouTube transcriptions

or, create a free article to see how easy it is.

Introduction

Fine-tuning large language models (LLMs) has become an essential practice for tailoring these powerful AI tools to specific domains and tasks. However, the process of preparing data for fine-tuning can be complex and time-consuming. This comprehensive guide will walk you through the steps of preparing data for fine-tuning LLMs using two key tools: Marker PDF for text extraction and Ons Sloth for the actual fine-tuning process.

Understanding the Data Preparation Process

The main concept behind preparing data for fine-tuning LLMs involves several key steps:

Extracting text from source files (e.g., PDF documents)
Formatting the extracted text into a structure suitable for LLMs
Converting the formatted text into a dataset ready for fine-tuning
Using the prepared dataset to fine-tune the chosen LLM

Let's dive deeper into each of these steps and explore how to implement them using Python and specialized libraries.

Text Extraction with Marker PDF

Installing Marker PDF

Marker PDF is a powerful Python library for extracting text from PDF documents. To install it, you can use pip:

pip install marker-pdf

If you're using a GPU and want CUDA support, you may need to install PyTorch separately with CUDA enabled. Visit the PyTorch website to get the appropriate installation command for your system.

Using Marker PDF for Single File Conversion

To convert a single PDF file, you can use the following code:

from marker import marker_single

marker_single("path/to/input.pdf", "path/to/output.md")

This will extract the text from the input PDF and save it as a Markdown file.

Converting Multiple PDF Files

For batch processing of multiple PDF files, you can use the marker function:

from marker import marker

marker("path/to/input_folder", "path/to/output_folder")

This will convert all PDF files in the input folder and save the extracted text as Markdown files in the output folder.

Formatting Extracted Text

After extracting text from PDF files, you'll often need to clean and format the data to make it suitable for LLM fine-tuning. This process can vary depending on the structure of your documents and the specific requirements of your project.

Here's an example of how you might clean and format the extracted text using Python:

import re
import json

def clean_text(text):
    # Remove unwanted patterns
    text = re.sub(r'not for assessment|not assessment', '', text, flags=re.IGNORECASE)
    
    # Split text into sections
    sections = re.split(r'\n\s*\n', text)
    
    return sections

def format_data(sections):
    formatted_data = []
    current_category = ""
    current_competency = ""
    
    for section in sections:
        if section.startswith("## "):  # Category
            current_category = section.strip("# ")
        elif section.startswith("### "):  # Competency
            current_competency = section.strip("# ")
        elif section.startswith("#### "):  # Level
            level, content = section.split("\n", 1)
            level = level.strip("# ")
            formatted_data.append({
                "category": current_category,
                "competency": current_competency,
                "level": level,
                "content": content.strip()
            })
    
    return formatted_data

# Read the Markdown file
with open("path/to/extracted_text.md", "r") as f:
    text = f.read()

# Clean and format the text
sections = clean_text(text)
formatted_data = format_data(sections)

# Save as JSON
with open("path/to/formatted_data.json", "w") as f:
    json.dump(formatted_data, f, indent=2)

This script reads the extracted text, cleans it by removing unwanted patterns, and then formats it into a structured JSON format based on the document's hierarchy (categories, competencies, and levels).

Converting Formatted Data to LLM-Ready Dataset

Once you have your data in a structured format, you'll need to convert it into a format suitable for fine-tuning LLMs. Typically, this involves creating a dataset of input-output pairs or conversations.

Here's an example of how you might convert the formatted data into a conversational dataset:

import json

def create_conversation(entry):
    conversation = [
        {"human": f"I want to write about the {entry['competency']} competency."},
        {"assistant": f"Certainly! I can help you with that. Which level are you interested in writing about for the {entry['competency']} competency?"},
        {"human": f"How many levels are there for this competency?"},
        {"assistant": f"Based on the information I have, the highest level for the {entry['competency']} competency is {entry['level']}."},
        {"human": f"Can you help me write about level {entry['level']}?"},
        {"assistant": f"Of course! Here's a proposed response for level {entry['level']} of the {entry['competency']} competency:\n\n{entry['content']}"}
    ]
    return conversation

# Load the formatted data
with open("path/to/formatted_data.json", "r") as f:
    formatted_data = json.load(f)

# Create conversational dataset
dataset = []
for entry in formatted_data:
    conversation = create_conversation(entry)
    dataset.extend(conversation)

# Save as JSONL
with open("path/to/conversation_dataset.jsonl", "w") as f:
    for item in dataset:
        f.write(json.dumps(item) + "\n")

This script converts each entry in the formatted data into a conversation-style dataset, which is then saved in JSONL format (JSON Lines), a common format for LLM fine-tuning datasets.

Fine-Tuning with Ons Sloth

Ons Sloth is a powerful tool for fine-tuning large language models. It supports various model architectures and offers parameter-efficient fine-tuning techniques.

Setting Up the Environment

Before using Ons Sloth, you'll need to set up your environment:

Install Windows Subsystem for Linux (WSL) if you're on Windows.
Install Anaconda for managing Python environments.
Create a new conda environment for Ons Sloth:

conda create -n ons_sloth python=3.10
conda activate ons_sloth

Install the necessary packages:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers datasets accelerate bitsandbytes wandb
pip install git+https://github.com/huggingface/peft.git
pip install git+https://github.com/huggingface/accelerate.git
pip install git+https://github.com/huggingface/transformers.git

Preparing the Dataset

To use your prepared dataset with Ons Sloth, you'll need to load it using the datasets library:

from datasets import load_dataset

dataset = load_dataset('json', data_files='path/to/conversation_dataset.jsonl', split='train')

Fine-Tuning Process

Here's a basic script to fine-tune a model using Ons Sloth:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from datasets import load_dataset
from transformers import TrainingArguments, Trainer

# Load the model and tokenizer
model_name = "unsloth/mistral-7b-instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# Prepare the model for k-bit training
model = prepare_model_for_kbit_training(model)

# Configure LoRA
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

# Load and preprocess the dataset
dataset = load_dataset('json', data_files='path/to/conversation_dataset.jsonl', split='train')

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=1000,
    save_total_limit=2,
)

# Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Start training
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./fine_tuned_model")

This script loads the model and tokenizer, prepares the model for k-bit training, configures LoRA (Low-Rank Adaptation) for efficient fine-tuning, preprocesses the dataset, and then trains the model using the Trainer class from Transformers.

Testing the Fine-Tuned Model

After fine-tuning, it's important to test your model to ensure it has improved on your specific task. Here's a simple way to test the model:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the fine-tuned model and tokenizer
model_path = "./fine_tuned_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

def generate_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the model
prompt = "I want to write about the ethics competency. Can you help me?"
response = generate_text(prompt)
print(response)

This script loads the fine-tuned model and tokenizer, defines a function to generate text based on a prompt, and then tests the model with a sample prompt.

Once you're satisfied with your fine-tuned model, you can save it for future use or share it with others. Here's how you can save the model locally and push it to the Hugging Face Hub:

# Save locally
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

# Push to Hugging Face Hub
from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    folder_path="./fine_tuned_model",
    repo_id="your-username/your-model-name",
    repo_type="model"
)

Remember to replace "your-username/your-model-name" with your actual Hugging Face username and the desired name for your model repository.

Conclusion

Preparing data for fine-tuning large language models is a crucial step in adapting these powerful AI tools to specific domains and tasks. By following this comprehensive guide, you've learned how to extract text from PDF documents using Marker PDF, format the extracted data, create a suitable dataset for LLM fine-tuning, and use Ons Sloth to fine-tune a model on your custom dataset.

Remember that the specific steps and code may need to be adjusted based on your particular use case and the structure of your data. Always test your fine-tuned model thoroughly to ensure it meets your requirements and performs well on your specific tasks.

As the field of natural language processing continues to evolve rapidly, stay updated with the latest tools and techniques for data preparation and model fine-tuning. This will help you make the most of large language models in your projects and applications.

Article created from: https://youtu.be/v2GniOB2D_U?si=D0eUB1dwcxppB76s

Preparing Data for Fine-Tuning Large Language Models: A Comprehensive Guide

Create articles from any YouTube video or use our API to get YouTube transcriptions

Introduction

Understanding the Data Preparation Process

Text Extraction with Marker PDF

Installing Marker PDF

Using Marker PDF for Single File Conversion

Converting Multiple PDF Files

Formatting Extracted Text

Converting Formatted Data to LLM-Ready Dataset

Fine-Tuning with Ons Sloth

Setting Up the Environment

Preparing the Dataset

Fine-Tuning Process

Testing the Fine-Tuned Model

Conclusion

Ready to automate your
LinkedIn, Twitter and blog posts with AI?

Related Articles

Perplexity Collections: Streamlining AI Interactions with Custom Agents

AI-Generated Video Ads: Revolutionizing Digital Marketing in 2025

Revolutionizing Work: How AI Agents Are Transforming Business Operations

Create articles from any YouTube video or use our API to get YouTube transcriptions

Installing Marker PDF

Using Marker PDF for Single File Conversion

Converting Multiple PDF Files

Setting Up the Environment

Preparing the Dataset

Fine-Tuning Process

Ready to automate your LinkedIn, Twitter and blog posts with AI?

Related Articles

Perplexity Collections: Streamlining AI Interactions with Custom Agents

AI-Generated Video Ads: Revolutionizing Digital Marketing in 2025

Revolutionizing Work: How AI Agents Are Transforming Business Operations

Ready to automate your
LinkedIn, Twitter and blog posts with AI?