LLM Training: How It Works and 4 Key Considerations

What Is LLM Training?

Large language model (LLM) training is the process of teaching LLMs to understand and generate human language. This is achieved through feeding the model massive amounts of text data (or text and image data in multi-modal architectures), and then using algorithms to learn patterns and predict what comes next in a sentence. The result is an AI system that can generate human-like text, translate between languages, answer questions, and perform many other cognitive tasks.

The term 'large' in LLM refers to the number of parameters in the model. These parameters are variables that the model uses to make predictions. The higher the number of parameters, the more detailed and nuanced the AI's understanding of language can be. However, training such models requires considerable computational resources and specialized expertise.

This is part of a series of articles about machine learning engineer.

In this article:

How LLM Training Works
Evaluating LLMs After Training
4 Key Considerations for Training LLMs
Optimizing Your AI Infrastructure with Run:ai

How LLM Training Works

Here are the general steps involved in training LLMs.

1. Data Collection (Preprocessing)

This initial step involves seeking out and compiling training dataset. Data can originate from diverse sources such as books, articles, web content, and open-access datasets. This data needs to be cleaned up and prepared for training. For example, the dataset might require conversion to lowercase, removal of stop words, and tokenization into token sequences.

2. Model Configuration

Transformer deep learning frameworks are commonly used for Natural Language Processing (NLP) applications. When setting up a transformer neural network, certain parameters must be defined. These may include the number of layers within transformer blocks, attention heads, and hyperparameters. Researchers typically experiment with parameters to find a combination that yields optimal performance.

3. Model Training

The cleansed and prepared text data can now be used to train the model. The training process starts with the model being fed a series of words. The model then aims to predict the subsequent word in the given sequence. The model then modifies its weight assignments based on its subsequent word predictions. This cycle is repeated millions, or even billions times, depending on the size of the dataset, until the model achieves optimal performance.

Given the sheer volume and scale of LLM models and data, huge computational power is needed for model training. To reduce training time, it is common to use a model parallelism allows, which distributes parts of the model over multiple Graphics Processing Units (GPUs).

4. Fine-Tuning

Once the training is done, the model is evaluated using a testing dataset to gauge its performance. Depending on the results of these tests, fine-tuning adjustments may be made to the model. These refinements can take the form of tweaking hyperparameters or modifying the model's structure. In certain situations, there may be a need for additional data training to enhance the model's performance.

Evaluating LLMs After Training

Like any other machine learning model, after LLMs are trained, they need to be evaluated to see if training was successful, and how the model compares to benchmarks, alternative algorithms, or previous versions. The evaluation of LLMs employs both intrinsic and extrinsic tactics.

Intrinsic Methods

Intrinsic analysis tracks performance based on objective, quantitative metrics that measure the linguistic precision of the model or how successful it is at predicting the next word. These metrics include:

Language fluency: Evaluates the naturalness of language produced by the LLM, checking for grammatical correctness and syntactic variety to ensure sentences generated by the model sound as if they were written by a human.
Coherence: Measures the model's ability to maintain topic consistency across sentences and paragraphs, ensuring that successive sentences support and are logically connected to each other.
Perplexity: A statistical measure of how well the model predicts a sample. A lower perplexity score indicates the model is better at predicting the next word in a sequence, showing a tighter fit to the observed data.
BLEU score (Bilingual Evaluation Understudy): Assesses the correspondence between a machine's output and that of a human, focusing on the precision of translated text or generated responses by counting matching subsequences of words.

Extrinsic Methods

With recent advancements in LLMs, extrinsic methods are now favored to assess their performance. This involves examining how well the models perform in real-world tasks like problem-solving, reasoning, mathematics, computer science, and in competitive exams like GRE, LSAT, and the US Uniform Bar Exam.

Here are a few extrinsic methods commonly used for LLM assessment:

Questionnaires: Checking how the LLM performs on questions intended for humans and comparing its score to human performance.
Common-sense inferences: Testing the LLM’s ability to make common-sense inferences which are easy for humans.
Multitasking: Testing a model’s multitasking accuracy across different domains like mathematics, law, and history.
Factuality: Testing a model’s ability to answer factual questions accurately (and the degree of hallucinations in responses).

Related content: Read our guide to machine learning workflow

4 Key Considerations for Training LLMs

Training LLMs from scratch is a difficult task that can have high cost and complexity. Here are some of the key challenges.

1. Infrastructure

LLMs are trained on huge text corpora, typically at least 1000 GB in size. Furthermore, the models employed for training on such datasets are enormous, with billions of parameters. An infrastructure with multiple GPUs is essential to be able to train such large models.

To illustrate the computational requirements, training GPT-3, a previous-generation model with 175 billion parameters, would take 288 years to train on one NVIDIA V100 GPU. Typically, LLMs are trained on thousands of GPUs in parallel. For example, Google trained its PaLM model with 540 billion parameters by distributing training over 6,144 TPU v4 chips.

2. Cost

However, acquiring and hosting such a large number of GPUs is not possible for most organizations. Even OpenAI, creator of the GPT series of models and the popular ChatGPT, did not train their models on their own infrastructure, but instead relied on Microsoft’s Azure cloud platform. In 2019 Microsoft invested $1 billion in OpenAI, and it is estimated that much of the money was spent on training their LLMs on Azure cloud resources.

3. Model Distribution Strategies

Beyond the scale and cost, there are also complex considerations in how to run LLM training on the computing resources. In particular:

LLMs are first trained on a single GPU to get an idea of their resource requirements.
Model parallelism is an important strategy. This involves distributing models across numerous GPUs, with optimal partitioning designed to enhance memory and I/O bandwidth.
With very large models, there is a need for Tensor model parallelism. This approach distributes individual layers of the model across multiple GPUs. This requires precise coding, configuration, and careful implementation for accurate and efficient execution.
LLM training is iterative in nature. Various parallel computing strategies are often used, and researchers experiment with different configurations, adjusting training runs to the specific needs of the model and available hardware.

4. Impact of Model Architecture Choices

The chosen LLM architecture has a direct impact on training complexity. Here are a few guidelines for adapting the architecture to the available resources:

The model's depth and width (in terms of number of parameters) should be selected to achieve a balance between available computational resources and complexity.
It is preferable to use architectures with residual connections. This makes it easier to optimize resource utilization.
Determine the need for a Transformer architecture with self-attention, because this imposes specific training requirements.
Identify the functional needs of the model, such as generative modeling, bi-directional/masked language modeling, multi-task learning, and multi-modal analysis.
Perform training runs with familiar models like GPT, BERT, and XLNet to understand their applicability to your use case.
Determine your tokenization technique: Word-based, subword, or character based. This can impact vocabulary size and input length, directly impacting computations requirements.

Optimizing Your AI Infrastructure with Run:ai

Run:ai automates resource management and orchestration and reduces cost for the infrastructure used to train LLMs and other computationally intensive models. With Run:ai, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:ai:

Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:ai GPU virtualization platform.