1. YouTube Summaries
  2. Fine-Tuning Large Language Models on Mac M1 with MLX

Fine-Tuning Large Language Models on Mac M1 with MLX

By scribe 5 minute read

Create articles from any YouTube video or use our API to get YouTube transcriptions

Start for free
or, create a free article to see how easy it is.

Introduction

With the rise of open-source models and efficient fine-tuning methods, building custom machine learning solutions has become more accessible than ever. In this article, we'll explore how to fine-tune a large language model (LLM) locally on a Mac M1 using Apple's MLX library.

What is MLX?

MLX is a Python library developed by Apple's machine learning research team for efficiently running matrix operations on Apple Silicon. It's inspired by frameworks like PyTorch, JAX, and ArrayFire, but with some notable differences:

  • MLX leverages the unified memory model of M1 chips, eliminating the need to manage separate RAM and VRAM.
  • It allows for fine-tuning large language models on machines with limited memory, like a Mac Mini M1 with only 16GB of RAM.

While MLX is a low-level framework without high-level abstractions for loading and training models (like Hugging Face), it provides example implementations that can be easily adapted for various use cases.

Setting Up the Environment

To get started with MLX and fine-tuning, follow these steps:

  1. Clone the repository containing the example code:

    git clone https://github.com/yourusername/your-repo.git
    cd your-repo/llms/qlora_mlx
    
  2. Create and activate a virtual environment:

    python -m venv mlx_env
    source mlx_env/bin/activate
    
  3. Install the required libraries:

    pip install -r requirements.txt
    

Important Notes for Installation

  • You need an M-series chip (M1, M2, etc.) to use MLX.
  • Use a native Python version >= 3.8.
  • Ensure you're running macOS 13.5 or later (macOS 14 recommended).

Preparing the Model

MLX provides a convert.py script that can convert models from the Hugging Face Hub into the MLX format and optionally quantize them. For this example, we'll use a pre-converted and quantized version of Mistral 7B Instruct v0.2.

If you need to convert a model yourself, you can use a command like this:

convert_command = [
    "python", "scripts/convert.py",
    "--hf-path", "mistralai/Mistral-7B-Instruct-v0.2",
    "--mlx-path", "mistral-7b-instruct-v0.2-mlx",
    "--quantize"
]

Fine-Tuning Process

Data Preparation

Before fine-tuning, you need to prepare your dataset. For this example, we'll use a dataset of YouTube comments and responses. The data should be in JSONL format, with separate files for training, testing, and validation.

Each example in the JSONL file should have this structure:

{"text": "[INST] <<SYS>>\nYou are Sha GPT, an AI assistant created by Sha. Your responses should be brief and to the point, similar to how Sha would respond to YouTube comments. Always end your response with 'Sha GPT'.\n<</SYS>>\n\nPlease respond to the following comment:\n{comment}\n[/INST]\n{response}"}

Running Fine-Tuning

To start the fine-tuning process, use the lora.py script with appropriate parameters:

fine_tune_command = [
    "python", "scripts/lora.py",
    "--model", "mlx-community/Mistral-7B-Instruct-v0.2-4bit-mlx",
    "--train",
    "--iters", "100",
    "--steps-per-eval", "10",
    "--val-batches", "1",
    "--lr", "1e-5",
    "--lora-layers", "16",
    "--test"
]

This command will:

  • Use the specified quantized model
  • Run 100 training iterations
  • Evaluate every 10 steps
  • Use all validation examples for evaluation
  • Set the learning rate to 1e-5
  • Apply LoRA to 16 layers
  • Compute the test loss at the end of training

Monitoring Training

During training, you'll see output showing the training loss, validation loss, and other metrics. The process may take 15-20 minutes, depending on your machine's specifications.

Running Inference with the Fine-Tuned Model

After training, you'll find an adapters.npz file in your repository. This file contains the LoRA weights learned during training. You can now use these adapters to run inference with your fine-tuned model.

Here's an example of how to run inference:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.2-4bit-mlx", adapter_file="adapters.npz")

comment = "Great content, thank you!"
prompt = f"[INST] <<SYS>>\nYou are Sha GPT, an AI assistant created by Sha. Your responses should be brief and to the point, similar to how Sha would respond to YouTube comments. Always end your response with 'Sha GPT'.\n<</SYS>>\n\nPlease respond to the following comment:\n{comment}\n[/INST]\n"

response = generate(model, tokenizer, prompt, max_tokens=140, verbose=True)
print(response)

Fine-Tuning Results and Analysis

After fine-tuning, you should notice that the model generates responses more aligned with your target style. In this case, the fine-tuned model produces shorter, more concise responses that better mimic Sha's communication style.

For example:

  • Before fine-tuning: "Thank you for your kind words! I'm glad you found the content helpful and enjoyable. If you have any specific questions or topics you'd like me to cover in more detail, please feel free to ask."
  • After fine-tuning: "Glad you enjoyed it! Sha GPT"

The fine-tuned model demonstrates a better understanding of the desired response style, producing briefer and more casual responses.

Challenges and Considerations

Hyperparameter Tuning

Fine-tuning machine learning models often requires experimenting with different hyperparameters. In this example, adjusting the rank of the LoRA adapters proved crucial for improving training performance.

The lora.py script doesn't expose the rank as a command-line argument, so you may need to modify the script directly. Changing the rank from 8 to 4 significantly improved results:

config = LoraConfig(rank=4, alpha=16)

This aligns with findings from the LoRA paper, which suggests that ranks 4 and 8 often provide a good balance between performance and computational efficiency.

Memory Management

While MLX is designed to work efficiently with Apple Silicon's unified memory, you may still encounter memory constraints when fine-tuning large models. To optimize performance:

  1. Close unnecessary applications during fine-tuning.
  2. Monitor memory usage with Activity Monitor.
  3. Experiment with batch sizes and other parameters that affect memory consumption.

Conclusion

Fine-tuning large language models on Mac M1 hardware is now possible thanks to libraries like MLX. This opens up new possibilities for developers and researchers who want to create custom AI solutions without relying on cloud services or expensive GPU setups.

By following the steps outlined in this guide, you can fine-tune models to better suit your specific use cases, whether it's generating responses in a particular style, adapting to domain-specific tasks, or improving performance on targeted datasets.

As the field of AI and machine learning continues to evolve, tools like MLX will play a crucial role in democratizing access to advanced AI capabilities, allowing more developers to experiment with and deploy custom language models on consumer-grade hardware.

Further Reading and Resources

By leveraging these tools and techniques, you can create powerful, customized AI solutions right on your Mac M1 machine, opening up new possibilities for innovation and experimentation in the field of natural language processing.

Article created from: https://www.youtube.com/watch?v=3PIqhdRzhxE&t=1s

Ready to automate your
LinkedIn, Twitter and blog posts with AI?

Start for free