IT & Data Science

Multi-node Distributed Training on Kubernetes with Run:ai and Tensorflow

August 20, 2023

Ready for a demo of Run:ai?

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

When it comes to training big models or handling large datasets, relying on a single node might not be sufficient and can lead to slow training processes. This is where multi-node training comes to the rescue. There are several incentives for teams to transition from single-node to multi-node training. Some common reasons include:

  • Faster Experimentation: Speeding Up Training

In research and development, time is of the essence. Teams often need to accelerate the training process to obtain experimental results quickly. Employing multi-node training techniques, such as data parallelism, helps distribute the workload and leverage the collective processing power of multiple nodes, leading to faster training times.

  • Large Batch Sizes: Data Parallelism

When the batch size required by your model is too large to fit on a single machine, data parallelism becomes crucial. It involves duplicating the model across multiple GPUs, with each GPU processing a subset of the data simultaneously. This technique helps distribute the workload and accelerate the training process.

  • Large Models: Model Parallelism

In scenarios where the model itself is too large to fit on a single machine's memory, model parallelism is utilized. This approach involves splitting the model across multiple GPUs, with each GPU responsible for computing a portion of the model's operations. By dividing the model's parameters and computations, model parallelism enables training on machines with limited memory capacity.

In this tutorial, we will focus on data parallelism using Tensorflow, specifically using the tf.distribute.MultiWorkerMirroredStrategy. If you are unsure about the parallelism strategy, please refer to the ‘Parallelism Strategies for Distributed Training’ blogpost.

Data Parallelism in Tensorflow

Data parallelism is a technique in distributed training where the model is replicated across multiple devices or machines, and each replica processes a portion of the training data. This enables efficient utilization of resources and accelerates the training process.

The tf.distribute.MultiWorkerMirroredStrategy is a distribution strategy in Tensorflow that supports synchronous training on multiple workers. It replicates all variables and computations to each local device, utilizing a distributed collective implementation such as all-reduce to enable multiple workers to work together.

Step-by-Step Guide: Running Tensorflow Distributed Training on Run:ai

Prerequisites

Before diving into the distributed training setup, ensure that you have the following prerequisites in place:

  1. Run:ai environment (v2.10 and later): Make sure you have access to a Run:ai environment with version 2.10 or a later release. Run:ai provides a comprehensive platform for managing and scaling deep learning workloads on Kubernetes.
  1. Two nodes with one GPU each: For this tutorial, we will use a setup consisting of two nodes, each equipped with one GPU. However, you can scale up by adding more nodes or GPUs to suit your specific requirements.
  1. Image Registry (e.g., Docker Hub Account): Prepare an image registry, such as a Docker Hub account, where you can store your custom Docker images for distributed training.

Modifying the Training Script for Distributed Training

To prepare your code for distributed training, you need to make some modifications to the training script. In this tutorial, we will use a slightly different version of the example script provided in the Tensorflow documentation. Please refer to the official Tensorflow documentation for more details. You can find the modified training script and all other documents presented in this guide in this GitHub repository.

After creating the training script, we will create a bash script, launch.sh, which will launch the training job and export the required environment variable TF_CONFIG, which will be passed by the system to each pod. Here is an overview of the launch.sh script:


#!/bin/bash

export TF_CONFIG

python distributed.py

The TF_CONFIG environment variable is essential for configuring distributed training. It provides information about the cluster, including the worker's task type and task ID.

Creating the Docker Image

Now, let's create a Docker image to encapsulate our training environment. In the Dockerfile, we will pull the Tensorflow base image, copy all the required files to the container's working directory, make the launch.sh script executable, and run it. Here is an overview of the Dockerfile:


# Base image
FROM tensorflow/tensorflow:2.10.1-gpu

# Set the working directory
WORKDIR /app

# Copy the Python script and other required files
COPY launch.sh utils.py distributed.py /app/

# Set the command to run when the container starts
# Run the bash file
RUN chmod u+x launch.sh
CMD ["./launch.sh"]

Side Note: In the future versions of Run:ai, the step of creating and building Docker images will be automated.

Building and Pushing the Docker Image

Next, log in to your Docker Hub account and build the Docker image using the following commands:


$ docker login -u YOUR-USER-NAME 
$ docker build -t YOUR-USER-NAME/distributed_training_tensorflow .
$ docker push YOUR-USER-NAME/distributed_training_tensorflow

Make sure to replace YOUR-USER-NAME with your actual Docker Hub username. These commands will build the image and push it to your Docker Hub repository. You can find the image that we created for this guide here.

Launching the Distributed Training on Run:ai

To start the distributed training on Run:ai, ensure that you have the correct version of the CLI (v2.10 or later). To launch a distributed Tensorflow training job, use the runai submit-tf command or runai submit-dist tf depending on your CLI version. Here is the command to submit our job:


# For CLI versions below 2.13:
runai submit-tf --name distributed-training-tf --workers=2 -g 1 \
        -i docker.io/YOUR-USER-NAME/distributed_training_tensorflow --no-master

# For CLI version 2.13 and above:
runai submit-dist tf --name distributed-training-tf --workers=2 -g 1 \
        -i docker.io/YOUR-USER-NAME/distributed_training_tensorflow --no-master

In this example, we selected 2 workers and attached 1 GPU, as we have a setup with 2 nodes for training. After submitting the job, you can monitor the logs of the worker nodes through the Run:ai user interface or using the CLI command ‘runai describe job distributed-training-tf -p project_name’.

Be aware that we used the flag - - no-master; which will not create a master node but create 2 worker nodes instead.

Figure 1: Logs of the first worker node on UI
Figure 2: Logs of the second worker node on UI

Conclusion

In this tutorial, we explored the process of running multi-node distributed training on Kubernetes with Run:ai using Tensorflow. Distributed training offers faster experimentation, the ability to handle large batch sizes, and support for training large models. By utilizing the tf.distribute.MultiWorkerMirroredStrategy, teams can distribute the workload and leverage the collective processing power of multiple nodes or GPUs.It replicates all variables and computations to each local device, enabling multiple workers to work together effectively. By following the step-by-step guide, you can set up distributed training on Run:ai, taking advantage of its platform for managing and scaling deep learning workloads on Kubernetes.