Brief IA

Apple Silicon and MLX: A Revolution in Local Refinement

🔬 Research·Tom Levy·

Apple Silicon and MLX: A Revolution in Local Refinement

Apple Silicon and MLX: A Revolution in Local Refinement
Key Takeaways
1Apple Silicon now allows for fine-tuning language models locally, eliminating cloud costs thanks to MLX.
2MLX, designed for Apple's unified architecture, optimizes memory usage for training on Mac.
3Fine-tuning with MLX requires models in the safetensors format, excluding other common formats.
💡Why it mattersThis advancement makes model fine-tuning more accessible and cost-effective for Mac users, strengthening the Apple ecosystem.
Le brief IA que lisent les pros

Le brief IA que les pros lisent chaque soir

Les 7 actus IA du jour, décryptées en 5 min. Gratuit.

Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.

Choisis ton rythme

Gratuit · Pas de spam · Désabonnement en 1 clic

📄
Full Analysis

A New Era for Fine-Tuning Language Models on Mac

Fine-tuning language models has long been associated with high costs related to renting GPUs in the cloud. However, Mac users equipped with Apple Silicon chips can now customize open models with their own data, directly on their device, at no additional cost. This advancement is made possible by a framework specifically designed to leverage the unique hardware of these laptops.

In 2014, I transitioned from Windows and Dell machines to a Mac, and I have never regretted that choice. What began as a simple curiosity for a cleaner operating system has transformed into a genuine appreciation for Apple’s seamless integration of hardware and software. Today, this synergy manifests in unexpected ways, particularly with the ability to fine-tune language models directly on the device, without resorting to the cloud and without the data ever leaving the machine.

This capability is made possible by MLX, an open-source array library developed by Apple’s machine learning research team, along with its MLX LM package. The latter offers text generation and fine-tuning features for thousands of open models, all through a simplified command set. This tutorial guides you through the complete process: from installing the tools to preparing a dataset, training a LoRA adapter, reducing memory usage through quantization, and finally testing and deploying the fine-tuned model. By the end, you will have a fine-tuned model running on your own machine and a reproducible workflow for any dataset.

Why MLX is Ideal for Apple Silicon

Most local inference tools were initially developed for NVIDIA hardware before being adapted for Mac. MLX, on the other hand, was designed from the ground up for the unified memory architecture of Apple Silicon, where the CPU and GPU share a single memory pool.

This unique design eliminates the need to copy data between system memory and dedicated GPU memory. On a Mac with 16 GB of RAM, the model weights, optimizer state, and training batch coexist in the same space, making on-device fine-tuning not only possible but practical. The MLX API is heavily inspired by NumPy, adds automatic differentiation for training, and uses Metal to accelerate GPU processing while maintaining this shared memory view.

To get started, you will need a Mac Apple Silicon (M1 or newer), macOS Ventura 13.5 or later, and Python 3.10 or higher. Intel Macs are not supported, and any attempt to install on these machines will return a "no matching distribution" error.

On a discrete GPU, training data is typically transferred between system memory and dedicated GPU memory. Apple Silicon maintains a shared pool, allowing a 16 GB Mac to fine-tune models locally.

Setting Up Your Environment

With this architecture in mind, the first step is to install the necessary tools. Start with the package and its training extensions, which integrate everything you need for fine-tuning.

pip install "mlx-lm[train]"

Verify that the installation is correct by performing a quick generation test on a small model.

mlx_lm.generate \
--model mlx-community/[Mistral](/dossier/mistral)-7B-Instruct-v0.3-4bit \
--prompt "Explain LoRA in two sentences." \
--max-tokens 120

On the first run, a Mistral model quantized to 4 bits is downloaded from the MLX Community organization on Hugging Face, cached locally, and then used to generate a response. The mlx-community organization hosts thousands of pre-converted models, so you typically do not need to convert the weights yourself.

It is important to note that fine-tuning with MLX requires models in the safetensors format from Hugging Face. GGUF files, common in other local tools, are compatible for inference but not for training here. Supported architectures include Llama, Mistral, Qwen2, Phi, Gemma, and Mixtral, among others, meaning that most popular open models are available right from the start.

Preparing Your Dataset

Once the environment is set up, the next step is to format your data so that it can be used by the trainer. MLX LM reads training data from a folder containing three files: train.jsonl, valid.jsonl, and an optional test.jsonl. Each line of these files contains a JSON example. The training file is required, the validation file allows the trainer to report validation loss during execution, and the test file evaluates the model after training is complete.

Three data formats are supported: chat, completions, and text. The chat format is the most robust by default. It stores messages labeled by role per line and allows MLX LM to apply the model according to its own chat model, so your data aligns with how the model was trained to handle conversations.

{"messages": [{"role": "user", "content": "What is LoRA?"}, {"role": "assistant", "content": "An efficient way to fine-tune a model."}]}

For simple input-output pairs, the completions format is simpler and works well for instruction-type tasks.

{"prompt": "Summarize: The market rose sharply today.", "completion": "Markets gained."}
{"prompt": "Translate to French: good morning", "completion": "bonjour"}

By default, the trainer calculates the loss on the entire example, meaning the model expends effort reproducing both the prompt and the response. Passing --mask-prompt tells it to calculate the loss only on the completion, so the training focuses on the response you are actually interested in. This generally produces a model that follows instructions more reliably, and it works with both chat and completions formats. For chat data, the last message in the list is considered the completion.

Ensure that each example fits on a single line without internal line breaks, as the reader treats each line as a distinct record. Split your data so that about 80% ends up in train.jsonl and 10 to 20% in valid.jsonl. About 200 to 500 examples is a reasonable minimum to modify a model's behavior (much less tends to overfit and memorize rather than generalize).

Training Your First LoRA Adapter

With your data ready, the next step is training. Rather than updating every weight in the model, Low-Rank Adaptation (LoRA) freezes the original weights and trains small adapter matrices alongside them. This reduces memory and storage requirements to a fraction of full fine-tuning while retaining most of the quality. The method comes from the LoRA paper by Hu and colleagues.

LoRA keeps the large pre-trained weights frozen and only trains the small matrices A and B. Since only these two adapters receive updates, memory and storage remain low.

Start training with a command, pointing to a model and your data folder.

--model mlx-community/Mistral-7B-Instruct-v0.3-4bit \

As it runs, MLX LM displays training loss, validation loss, tokens processed, and iterations per second. The adapter weights are saved in a default adapter folder. Key flags to know: --fine-tune-type accepts lora (default), dora, or full; --num-layers defines how many transformer layers receive adapters (default: 16); and --iters controls the duration of training.

The example intentionally sets --batch-size 1 to keep memory usage as low as possible. This avoids crashes on 16 GB machines. If you have 64 GB or more, increasing it to 2 or 4 reduces total training time. When memory is limited but you want the effect of a larger batch, --grad-accumulation-steps increases the effective batch size without raising memory usage.

If you prefer live graphs rather than terminal outputs, add --report-to wandb to log metrics in Weights & Biases. If you encounter memory issues, reduce --num-layers to 8 or 4, or add --grad-checkpoint to trade computations for lower memory. These two flags are usually sufficient to make a job fit that would otherwise run out of space.

Choosing a Base Model and Adapter Parameters

Building on the training mechanisms above, two early decisions shape the rest of your run: which model to choose as a starting point and how much to adapt it. For a first project, an 8B parameter model in 4 bits is the ideal compromise. Once the workflow feels comfortable, you can move to 13B or 14B models, which require 14 to 18 GB of working memory and run comfortably on a 32 GB machine.

The number of trained layers and the rank of the adapter together control capacity. More layers and a higher rank give the adapter more room to learn, at the cost of memory and time. A common starting point uses 16 layers with a moderate rank, then adjusts based on validation loss. If training loss decreases while validation loss increases, the adapter is memorizing your examples.

The learning rate is also important. Values in the range of 1e-5 to 5e-5 work for most LoRA runs. Too high, and training becomes unstable; too low, and the model barely moves. Change one parameter at a time so you can attribute any improvement to a specific choice.

Reducing Memory Usage Through Quantization

Note that the base model above already ends in 4 bits. Training a LoRA adapter on a quantized model is what people call QLoRA, described in the QLoRA paper. Since quantization is built into MLX, the same command mlx_lm.lora trains adapters directly on quantized weights without additional configuration.

The benefit is tangible. A 7B model in 4 bits reduces weight memory by about 3.5 times compared to full precision, bringing a 7B fine-tuning comfortably within 8 GB of working memory. On a 16 GB MacBook, this leaves ample room for the operating system and your training batch.

If you prefer to quantize a full precision model yourself before training, the convert command takes care of that.

mlx_lm.convert \
--hf-path mistralai/Mistral-7B-Instruct-v0.3 \
--mlx-path ./mistral-4bit \

This writes a 4 bits version to a local folder that you then pass to --model.

Testing and Generating with Your Adapter

With training complete, it’s time to see how well the adapter has learned. Evaluate it against your reserved test set to get a figure you can track across experiments.

--model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
--adapter-path ./adapters \

To see the model respond, pass the same adapter path to the generation command. MLX LM loads the base model and applies your adapter on top.

mlx_lm.generate \
--model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
--adapter-path ./adapters \
--prompt "Summarize: Our quarterly revenue grew twelve percent."

Run the same prompt without the adapter to compare. If your dataset matched well with the target task, the model should show a notable improvement in generating relevant responses.

Brief IA — L'actualité IA en français

L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.