Understanding LLM Fine Tuning

with LoRA (Low-Rank Adaptation)

What Is LLM Fine Tuning and What are Its Key Challenges?

Large Language Models (LLM) are the most significant innovation in natural language processing, and probably in AI in general, in our generation. LLMs like OpenAI’s GPT-4 and Google’s PaLM 2 and more recently Gemini achieve human-like performance for a wide range of cognitive tasks involving text, images, and video.

However, while LLMs have tremendous potential, they require huge computing resources to train, meaning that only a small group of technology giants and research groups are able to build their own LLMs. How can the rest of the world create specific LLM tools to suit their needs? That’s where LLM tuning comes in.

LLM tuning is a specialized process that takes a pre-trained language model and customizes it for specific tasks or domains. It leverages the general language understanding, acquired by the model during its initial training phase, and adapts it to more specialized requirements. The advantage of LLM tuning is that it does not require re-training the entire model, so, at least in principle, it should be much simpler and less computationally intensive than training a new LLM.

This is part of a series of articles about generative AI.

In this article:

What Is LoRA (Low-Rank Adaptation) and How Is it Used for LLM Fine Tuning?

LoRA (Low-Rank Adaptation) is a highly efficient method of LLM fine tuning, which is putting LLM development into the hands of smaller organizations and even individual developers. LoRA makes it possible to run a specialized LLM model on a single machine, opening major opportunities for LLM development in the broader data science community.

LoRA modifies the fine-tuning process by freezing the original model weights and applying changes to a separate set of weights, which are then added to the original parameters. LoRA transforms the model parameters into a lower-rank dimension, reducing the number of parameters that need training, thus speeding up the process and lowering costs.

This method is particularly useful in scenarios where multiple clients need fine-tuned models for different applications, as it allows for creating a set of weights for each specific use case without the need for separate models.

Advantages of Using LoRA for Fine-Tuning

Efficiency in Training and Adaptation

LoRA enhances the training and adaptation efficiency of large language models like OpenAI’s GPT-3 and Meta’s LLaMA. Traditional fine-tuning methods require updating all model parameters, which is computationally intensive. LoRA, instead, introduces low-rank matrices that only modify a subset of the original model's weights. These matrices are small compared to the full set of parameters, enabling more efficient updates.

The approach focuses on altering the weight matrices in the transformer layers of the model, specifically targeting the most impactful parameters. This selective updating streamlines the adaptation process, making it significantly quicker and more efficient. It allows the model to adapt to new tasks or datasets without the need to extensively retrain the entire model.

Reduced Computational Resources Requirement

LoRA reduces the computational resources required for fine-tuning large language models. By using low-rank matrices to update specific parameters, the approach drastically cuts down the number of parameters that need to be trained. This reduction is crucial for practical applications, as fully retraining LLM models like GPT-3 is beyond the resource capabilities of most organizations.

LoRA's method requires less memory and processing power, and also allows for quicker iterations and experiments, as each training cycle consumes fewer resources. This efficiency is particularly beneficial for applications that require regular updates or adaptations, such as adapting a model to specialized domains or continuously evolving datasets.

Preservation of Pre-trained Model Weights

LoRA preserves the integrity of pre-trained model weights, which is a significant advantage. In traditional fine-tuning, all weights of the model are subject to change, which can lead to a loss of the general knowledge the model originally possessed. LoRA's approach of selectively updating weights through low-rank matrices ensures that the core structure and knowledge embedded in the pre-trained model are largely maintained.

This preservation is crucial for maintaining the model's broad understanding and capabilities while still allowing it to adapt to specific tasks or datasets. It ensures that the fine-tuned model retains the strengths of the original model, such as its understanding of language and context, while gaining new capabilities or improved performance in targeted areas.

How LoRA Works

To understand how LoRA works, we’ll show a simplified example.We’ll refer to W0 as the collection of parameters from the originally trained LLM, represented in matrix form, and ∆W as a matrix representing the adjustments in weights that will be added in fine tuning.

At the heart of the LoRA technique is breaking down the ∆W matrix into two more manageable matrices called A and B. These matrices are much smaller, which translates into lower computational overhead when tweaking the model's parameters. This is illustrated in the diagram below, from the original LoRA paper.

Source: Arxiv

Consider a weight matrix, W0, which measures d by d in size and is kept unchanged during the training procedure. The updated weights, ∆W, also measure d by d. In the LoRA approach, a parameter r is introduced which reduces the size of the matrix. The smaller matrices, A and B, are defined with a reduced size of r by d, and d by r.

Upon completing the training, W0 and ∆W are stored separately. For a new input x with a size of 1 by d, the model multiplies x by both W and ∆W, resulting in two d-sized output vectors. These vectors are then added together element-wise to produce the final result, denoted as h.

To better understand the potential of LoRA in reducing the number of parameters, consider an example where ∆W is sourced from a 200 by 200 matrix and thus has 40,000 parameters. In the LoRA approach, we’ll reduce the large matrix to the smaller A and B matrices with sizes of 200 by 2 and 2 by 200, respectively, using d=200 and setting the rank as r=2. This means the model needs to adjust only 800 parameters, 50 times less than the initial 40,000.

The parameter r is crucial in determining the size of A and B. A smaller r value means fewer parameters and faster training times, although this may result in a compromise on model performance if r is set too low.

Please note that we used identical input and output sizes for simplicity, but LoRA is flexible and does not assume identical input and output sizes.

Further reading: A step-by-step tutorial for fine tuning LLMs with LoRA is outside the scope of this article, but you can read the excellent post by Chris Kuo to learn how to fine-tune GPT-3 with LoRA.

QLoRA: A New Spin on LoRA

QLoRA, which stands for Quantized Low-Rank Adaptation, is an extension of LoRA that makes the method even more efficient. QLoRA enables low-rank adaptations within a highly compressed, 4-bit quantized pre-trained model framework. Some of the improved features introduced by QLoRA include:

  • 4-bit NormalFloat (NF4): Think of this as a compact, optimized format to record your model's data. It strikes a balance that is well-suited for weights that follow a normal distribution, reducing memory usage by narrowing the data down to 4-bit precision.
  • Double quantization: This can be likened to a shorthand notation where both weights and quantization constants are abbreviated, further reducing the memory footprint.
  • Paged optimizers: These optimizers effectively handle sudden memory demands, ensuring that the training process stays smooth and efficient, even for the largest models.

As a result, with QLoRA you could fine-tune a large 65 billion parameter model on a single GPU with just 48GB memory, without any loss in quality compared to full 16-bit training. Additionally, QLoRA makes it feasible to fine-tune large models with full 16-bit precision on standard academic setups, paving the way for more exploration and practical uses of large language models (LLMs).

Guanaco is an innovative model family utilizing QLoRA, which provides far superior performance compared to previous LLM frameworks. It eclipses all other openly available models in the Vicuna benchmark, achieving 99.3% of the effectiveness of ChatGPT with only one day's training on a single GPU.

Further reading: See how to tune Meta’s new LLaMA model with an even more advanced version of QLoRA, known as QA-LoRA, in this post by Benjamin Marie.

Optimizing Your AI Infrastructure with Run:ai

Run:ai automates resource management and orchestration and reduces cost for the infrastructure used to train LLMs and other computationally intensive models. With Run:ai, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:ai:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:ai GPU virtualization platform.