Question 1

The Challenges of Scheduling AI Workloads on Kubernetes

Accepted Answer

This article explains the basics of Kubernetes scheduling. The guide explains how Kubernetes, the de-facto choice for container orchestration, is not suited for scheduling and orchestration of Deep Learning workloads. We will address the specific areas where Kubernetes falls short for AI and how you can address those shortfalls.

Question 2

Kubernetes Scheduling Basics

Accepted Answer

In Kubernetes, scheduling means making sure that pods are attached to worker nodes. The default Kubernetes scheduler is kube-scheduler, which runs in the cluster’s master node and “watches” for newly created pods that have no node assigned. The scheduler first filters the existing cluster nodes according to the container/pod’s resource configurations and identifies “feasible” nodes that meet the scheduling requirements. It then scores the feasible nodes and picks the node with the highest score to run the pod. The scheduler notifies the master node’s API server about the decision in a binding process.

If no suitable node is found, the pod is unscheduled until the scheduler succeeds in finding a match.

What's Missing?

Kubernetes was built for running microservices with scale-out architecture in mind. The default Kubernetes scheduler is therefore not ideal for AI workloads, lacking critical high-performance scheduling components like batch scheduling, preemption, and multiple queues for efficiently orchestrating long running jobs. In addition, K8s is missing gang scheduling for scaling up parallel processing AI workloads to multiple distributed nodes, and topology awareness for optimizing performance.

Question 3

Batch Scheduling Explained

Accepted Answer

Batch workloads are jobs that run to completion unattended (i.e., without user intervention). Batch processing and scheduling is commonly used in High Performance Computing (HPC) but the concept can easily be applied to data science and AI. With batch processing, training models can start, end, and then shut down, all without any manual intervention. Plus, when the container terminates, the resources are released and can be allocated to other workloads.

The scheduler that is native to Kubernetes does not use batch scheduling methods like multi-queue scheduling, fairness, advanced preemption mechanisms, and more, all of which are needed to efficiently manage the lifecycle of batch workloads. With such capabilities jobs can be paused and resumed automatically according to predefined priorities and policies, taking into account the fluctuating demands and the load of the cluster. Batch scheduling also prevents jobs from being starved by heavy users and ensures fairness between multiple users sharing a cluster.

Question 4

What is Topology Awareness?

Accepted Answer

Another challenge of running AI workloads on Kubernetes relates to a concept called ‘topology awareness’. This refers to:

inter-node communication and
how resources within a node inter-connect
These two topological factors that have major impact on the runtime performance of workloads. In clusters managed by a centralized orchestration system, the responsibility of provisioning resources and optimizing allocations according to these topological factors is at the hands of the cluster manager. Kubernetes has not yet addressed topology awareness efficiently, resulting in lower performance when sub-optimal resources are provisioned. Performance inconsistency is another issue -workloads may run at maximum speed, but often poor hardware setup leads to lower performance.

Scheduler awareness to the topology of interconnect links between nodes is important for distributed workloads with parallel workers communicating across machines. In these cases, it is critical that the scheduler binds pods to nodes with fast interconnect communication links. For example, nodes located in the same rack would typically communicate faster and with lower latency than nodes located in different racks. The default K8s scheduler today does not account for inter-node communication.

Another important aspect of topology awareness relates to how different resources within a node are communicating. Typically, multiple CPU sockets, memory units, network interface cards (NICs), and multiple peripheral devices like GPUs, are all set up in a node in a topology that is not always symmetric. For example, different memory units can be connected to different CPU sockets and a workload running on a specific CPU socket would gain the fastest read/write data access when using the memory unit closest to the CPU socket. Another example would be a workload running on multiple GPUs in a node with non-uniform topology of inter-GPU connectors. Provisioning the optimal mix of CPUs, memory units, NICs, GPUs, etc., is often called NUMA (non-uniform memory access) alignment.

Topology awareness relating to NUMA alignment has been addressed by Kubernetes but the current implementation is limited and highly inefficient - the Kubernetes scheduler allocates a node for a workload without knowing if CPU/memory/GPU/NIC alignment can be applied. If such alignment is not feasible on the chosen node, best-effort configuration would run the workload using a sub-optimal alignment while restricted configuration would fail the workload. Importantly, sub-optimal alignment and a failure to run a workload can occur even in cases where other nodes that can satisfy NUMA alignment are available in the cluster.

The limitations of topology-awareness relate to a basic flaw in Kubernetes architecture. The scheduling mechanism of Kubernetes is based on splitting responsibilities between the scheduler which operates at the cluster level and Kubelet which operates at the node level. The scheduler allocates nodes for containers based on information about the number of resources available in each node, without being aware of the topology of the nodes, the topology of the resources within a node, and which exact resources are actually available at a given moment. Kubelet, together with components of Linux OS and device plugins, is responsible for scheduling the containers and for allocating their resources within the node. This architecture is perfect for orchestrating microservices running within a node, but fails to provide high, consistent performance when orchestrating compute-intensive jobs and distributed workloads.

Compare Kubernetes vs Slurm schedulers in another guide from this series.

Question 5

Gang Scheduling

Accepted Answer

The third AI-focused component missing from Kubernetes is gang scheduling. Gang scheduling is used when containers need to be launched together, start together, and end together. For example, this capability is required for distributed workloads to ensure that different containers are launched on different nodes only when enough resources are available, preventing inefficiencies and dead-lock situations where one group of containers are launched while others are waiting for resources to become available. Gang scheduling can also help with recovery when some of the containers fail, without requiring a restart of the entire workload.

Compare Kubernetes vs Slurm schedulers in another guide from this series.

Kubernetes Scheduling for AI

Challenges of Using K8s for AI

Related Articles

The Challenges of Scheduling AI Workloads on Kubernetes

Kubernetes Scheduling Basics

Scale-out vs. Scale-up Architecture

Batch Scheduling Explained

What is Topology Awareness?

Gang Scheduling

Kubernetes Scheduling with Run:AI

See Our Additional Guides on Key Artificial Intelligence Infrastructure Topics

GPUs for Deep Learning

MLOps