IT & Data Science

How to train NeMo Megatron GPT-3 on Kubernetes

December 20, 2023

Ready for a demo of Run:ai?

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

NeMo™ Megatron, a creation of the NVIDIA Applied Deep Learning Research team, represents a GPU-accelerated framework tailored for training and deploying transformer-based Large Language Models (LLMs) such as GPT, T5, and BERT. The goal of NeMo is to help researchers from industry and academia to reuse prior work (code and pretrained models) and make it easier to create new conversational AI models.

While NeMO™ Megatron Launcher provides training scripts for SLURM workload scheduler, there has been a noticeable gap in guidance and scripts explicitly crafted for Kubernetes (K8s) environments. K8s is an open-source container orchestration platform that has emerged as a de facto standard for cloud-native infrastructure management and has become the platform of choice for AI companies like OpenAI and Spotify, and new AI Cloud providers like Coreweave. Its dynamic nature and cloud-native architecture make it a standout choice for orchestrating distributed machine learning workloads.

In this guide, we explain step by step how to train NVIDIA’s NeMo models on Kubernetes clusters. Our primary objective is to simplify the process and help AI practitioners get started with their LLM experiments on Kubernetes faster. To assist with this, we've made our launching scripts available in our repository, ensuring that anyone in the AI community can easily begin their LLM experiments on Kubernetes.



We have used a cluster with the following specification:

  • A centralized NFS Server
  • 4 x NVIDIA DGX A100 Nodes, with a total of 32 x NVIDIA A100 Tensor Core GPUs with 80 GB of GPU memory each
  • 8 x 200 Gb HDR NVIDIA InfiniBand connectivity per node

This “how to” guide should work on similar GPU clusters, even without InfiniBand connectivities and a centralized NFS server. See below for details.

NVIDIA GPU Operator + NVIDIA Network Operator

In order to run the training and utilize the GPU and the network stack of NVIDIA hardware, we need to install the K8s software support, which are the GPU Operator and Network Operator.

helm repo add nvidia
helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator

Training Operator

The training operator is a set of tools and controllers written by KubeFlow.

Commands of how it should be installed can be found here.

kubectl apply -k ""

Docker credentials

Before we start to run our training, we need to make sure K8s has the credentials to pull images from NVIDIA’s GPU Cloud (NGC) catalog. It contains the image with which we perform the training.

In order to get credentials to NGC we register to NeMo Framework Beta through this link and go to this link.

After logging-in you should see the following:

The ea-bignlp is the organization we just joined that will give us the ability to pull NeMo images.

After that, you can click on settings and then ‘Setup’:

Then “Generate API Key”:

After you have the API Key, it will be our “docker password” and the “docker user” will be “$oauthtoken

Run the command:

docker login

And use the username and password we just received from NGC.

Next lets create a Kubernetes secret based on these credentials:

kubectl create secret generic regcred \
   --from-file=.dockerconfigjson=~/.docker/config.json \

Note: Point to your relevant config.json file if it is not in the default location.

Data Preprocessing

Now that we have a cluster ready for training, we need to prepare the data with which we will train the model. The dataset we are going to use is called “The Pile”.

"The Pile," created by OpenAI, is a massive, diverse text dataset with over 800 gigabytes of content from various internet sources, including books and websites. This inclusive resource spans multiple languages and subjects, making it invaluable for training large-scale language models like GPT-3.5. Researchers and developers use it for a wide range of natural language processing tasks. The whole dataset is divided into 30 shards of data.

The preprocessing includes 3 parts:

  1. Download
  2. Extraction
  3. Pre-process

We are going to download, extract, and pre-process the data directly to the NFS server so data is available to all the Kubernetes nodes.  If you’re not working with a centralized file system you can alternatively copy the data to a local disk storage attached to each node.

Downloading the first shard can be done from this link.

Note: You need to register to Kaggle in order to download the file.

After downloading the dataset we need to extract it:

mkdir /workspace
unzip -d /workspace

Now we will see a file named 00.jsonl which is extracted from the zip file.

In order to preprocess the data we run a docker image which mounts the extracted data and opens a terminal inside the container:

docker run -it -v /workspace:/host-dir bash

Through the terminal we now prepare the environment and launch the command that will initiate the preprocessing step which may take a few minutes:

git clone

mkdir vocab
cd vocab


python3 NeMo-Megatron-Launcher/launcher_scripts/nemo_launcher/collections/dataprep_scripts/pile_dataprep/ +data_dir=/omer/ +vocab_save_dir=/workspace/vocab/ +tokenizer_type=gpt +launcher_scripts_path="" +merges_save_dir=/workspace/vocab/ ++rm_extracted=False

Note: Make sure data is mounted / copied to the exact same directory location on every node, as the pod template is identical to the pods running on all nodes.


Now that we have the cluster ready and the data is preprocessed we can move to the training step.

We are going to run the training job as a PyTorchJob using the K8s training operator and launch the job using the Megatron K8s Launcher which you can find here.

First, let’s clone the repository:

git clone

Next we prepare the K8s YAML files by running the command below. Adjust the number of workers to the number of GPUs available in your cluster. In our case, we had 32 GPUs, which corresponds to 32 workers.

cd k8s-launcher/models/language_processing/gpt3/pretraining

python3 --model 5b --num_workers 32 --results_dir /path/to/NFS --data_dir /path/to/preprocessed/data --image_pull_secret regcred

Apply the files to K8s to launch the training run:

kubectl apply -f results

And now wait for the model to be trained! :-)


In this guide, we walked through the process of training NeMo LLMs on a Kubernetes GPU cluster. We've covered everything from setting up the necessary infrastructure, including GPU and network support, to data preparation and the actual training process. Our goal has been to simplify the complex journey of training large language models, making it more accessible to AI practitioners. By openly sharing our tools and launcher scripts on our repository, we aim to accelerate the journey of AI practitioners  to adopt Kubernetes for the development of Generative AI applications.

Ready for a demo of Run:ai?

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.