LoRA: Is Fine-Tuning Technique Really Unbeatable?

⚡

Key Takeaways

1Efficient Parameter Fine-Tuning (PEFT) allows for optimizing models with less memory, even for quantized models.

2LoRA, a PEFT method, overwhelmingly dominates the landscape with 98.4% usage on Hugging Face Hub.

3Benchmarks show that other PEFT techniques can outperform LoRA based on certain criteria, despite its popularity.

💡Why it matters — An over-reliance on LoRA could stifle innovation by overlooking potentially more effective alternatives.

When Should You Use PEFT?

In the world of machine learning models, there is a vast array of open models available for developers. However, these models do not always meet expectations for specific use cases. Prompting, a method of providing precise instructions to the model, can enhance performance, but often, that is not enough. Rather than creating an entirely new model, a more efficient approach is to fine-tune an existing model.

The fine-tuning process, while powerful, is notoriously memory-intensive. Indeed, it often requires loading the entire model multiple times, which can be prohibitive. Quantization is a technique that reduces a model's memory footprint, but it has a limitation: quantized models cannot be fine-tuned directly. To circumvent this issue, a series of techniques have been developed, grouped under the term Parameter-Efficient Fine-Tuning (PEFT).

PEFT allows for fine-tuning a model using only a fraction of the memory typically required. It also offers the ability to fine-tune quantized models. Among its other advantages are a significant reduction in checkpoint size, improved resistance to catastrophic forgetting, and the ability to serve multiple fine-tunes from a single base model.

At Hugging Face, we have developed the PEFT library, which encompasses many PEFT techniques under a unified API. This library integrates seamlessly with other tools in the ecosystem, such as Transformers and Diffusers. It also supports various quantization methods, making parameter-efficient fine-tuning more accessible. Whether you want to fine-tune on your own data or explore new PEFT methods, PEFT is an excellent starting point.

LoRA: The Queen of Fine-Tuning Techniques 👑

Among the parameter-efficient fine-tuning techniques, one method has particularly stood out: Low Rank Adaptation, or LoRA. This technique works by adding a small number of parameters on top of the base model while freezing the weights of the latter. Only these new parameters are trained, making LoRA extremely efficient.

LoRA is undoubtedly the most popular PEFT technique. To illustrate this popularity, let's look at some numbers:

Out of a sample of 20,834 model cards on the Hugging Face Hub mentioning a PEFT technique, 20,509 mention LoRA, which is 98.4%.
Analyzing popular PEFT techniques for image generation on an external site, out of 10,000 checkpoints, 7,111 were LoRA. Other identified techniques include LoCon (363) and DoRA (11), a likely variant of LoRA. This means that 95.0% of PEFT checkpoints are LoRA.
A search for the code from peft import <PEFT CONFIG> on GitHub shows that 71.3% of the results pertain to LoRA. The next techniques are LoHa (3.7%) and AdaLoRA (3.5%).

These numbers, while imperfect, clearly indicate that LoRA is almost certainly the most common PEFT technique. This could simply mean that LoRA is the most effective for the majority of users, which would explain its popularity. However, another explanation is possible: LoRA, being one of the first PEFT techniques to gain popularity, has benefited from increased visibility, numerous tutorials and examples, as well as better support in downstream packages. Thus, its popularity is self-sustaining.

This raises an important question: are we neglecting potential performance by limiting ourselves to LoRA? Many researchers claim that their techniques outperform LoRA. Shouldn't we explore these newer alternatives?

Choosing the Right PEFT Technique Based on Article Results is Problematic

There is a multitude of scientific articles examining fine-tuning techniques other than LoRA. In the PEFT library alone, there are over 40 distinct PEFT techniques to date, not counting the numerous variations. For almost each of these techniques, researchers claim that they outperform LoRA on their benchmarks.

The problem with these claims lies in the pressure on researchers to produce results that surpass existing benchmarks. Even without malicious intent, this can bias results, for example, by spending less time tuning alternative techniques compared to the one proposed by the researchers. One study showed, for instance, that LoRA can match supposedly better PEFT techniques simply by adjusting the learning rate.

Another complication is that each article chooses a different set of PEFT techniques to compare and a different set of benchmarks to execute. Even when the same technique is compared on the same benchmark, the code is often unavailable or difficult to run independently, making results hard to reproduce.

Ultimately, it is challenging to determine which PEFT technique works best for you by only consulting article results. Therefore, you might be tempted to simply go with the default choice, LoRA.

How We Approach Benchmarking in PEFT

At Hugging Face, we have considered how we can help users make informed decisions about which PEFT technique to use. With the PEFT library, we already provide a package that implements many PEFT techniques and exposes them with the same API. The next step is to provide benchmarks that can further illuminate the discussed question.

We have had a benchmark that checks the fine-tuning of LLMs on a mathematical dataset for some time. This benchmark takes an LLM and fine-tunes it on chain-of-thought reasoning to produce the result of a math question using a base model that is not fine-tuned for instructions. The benchmark thus checks whether the model can learn to perform mathematical reasoning and also adjust the generated output to the expected format.

To extend our findings to another modality, we have also added an image generation benchmark. This tests whether the model can be fine-tuned to learn a new concept, a cat plushie, and generate it in new contexts without forgetting existing concepts.

All PEFT techniques are evaluated under the exact same conditions: same base model, same dataset, same training and evaluation code, same hardware. As different users have different needs, we track more than just test performance. In addition to VRAM usage, we track metrics such as forgetting/drift, execution time, and checkpoint size. The results are designed to work on consumer-grade hardware, and adding a new experience requires only adding a new PEFT configuration and running a script.

Since we compare all PEFT techniques on an equal footing and have no stake in the race, we believe these benchmarks can paint an objective picture of the effectiveness of different PEFT techniques. We argue that if you have your own dataset, you can adopt a similar approach and leverage the PEFT library to evaluate multiple PEFT techniques.

Our Findings: LoRA Works Well but Is Not Necessarily the Best Choice

After completing the benchmark runs, we found that while LoRA performs well, other PEFT methods may surpass it on one or more axes and should therefore be considered.

One way to interpret the results is to think in terms of trade-offs, for example: what is the model's performance on the test set relative to the memory required to train it? If a PEFT technique cannot be surpassed on both metrics simultaneously by another technique, it lies on the Pareto frontier. In other words: if you want better test accuracy, you need more memory, and if you want more memory efficiency, you have to sacrifice accuracy.

Let's take a moment to examine the benchmark results on the Math dataset. In terms of test accuracy versus memory, we find that LoRA indeed lies on the Pareto frontier. It achieves 53.2% test accuracy and requires 22.6 GB of VRAM at maximum. However, there are other PEFT techniques on the Pareto frontier. For example, BEFT achieves 32.9% test accuracy and requires only 20.2 GB of memory at maximum. On the other end, we have Lily, which achieves 54.9% test accuracy but requires 25.6 GB of memory. Depending on what is most important to you, you might conclude that LoRA does not present the best trade-off.

It is also important to note that even though LoRA performs well on this task, we are not talking about standard LoRA. On one side, we have LoRA with rank-stabilized initialization, which is a technique for adjusting LoRA's contribution differently from the default initialization and provides very good test accuracy (53.2%). On the other side, we have LoRA-FA, which uses a specialized optimizer for LoRA that freezes part of the LoRA weights and is therefore more memory-efficient (20.2 GB). Standard LoRA only achieves an accuracy of 48.1% at 22.5 GB of memory and should therefore be avoided in favor of alternatives.