Kubernetes Scheduling: Complete Guide & Requirements for AI Workloads

What Is Scheduling in Kubernetes?

In Kubernetes, scheduling means making sure that pods are attached to worker nodes. The default Kubernetes scheduler is kube-scheduler, which runs in the cluster’s master node and “watches” for newly created pods that have no node assigned. The scheduler first filters the existing cluster nodes according to the container/pod’s resource configurations and identifies “feasible” nodes that meet the scheduling requirements. It then scores the feasible nodes and picks the node with the highest score to run the pod. The scheduler notifies the master node’s API server about the decision in a binding process.

If no suitable node is found, the pod is unscheduled until the scheduler succeeds in finding a match.

This is part of an extensive series of guides about Kubernetes Architecture

In this article:

How Kubernetes Scheduling Works
Understanding the K8s Scheduling Algorithm with Examples
Kubernetes Scheduler Limitations and Challenges
Kubernetes Scheduling for AI Workoads
Kubernetes Scheduling with Run:ai

How Kubernetes Scheduling Works

Kubernetes scheduling involves several stages to ensure that pods are optimally placed on nodes. The process begins when the kube-scheduler detects newly created pods without node assignments. It follows a series of steps to find the best possible node for each pod.

Filtering: The scheduler filters out nodes that do not meet the pod’s requirements. These requirements include resource limits (CPU, memory), node selectors, and taints/tolerations.
Scoring: The remaining feasible nodes are scored based on various criteria such as resource availability, affinity/anti-affinity rules, and load distribution.
Selection: The node with the highest score is selected for the pod.‍
Binding: The scheduler binds the pod to the selected node by updating the API server.

Understanding the K8s Scheduling Algorithm with Examples

Node Selection

During the node selection phase, the scheduler filters nodes based on a set of predicates. These predicates ensure that the node can accommodate the pod's requirements. Key predicates include:

Resource requests: Ensuring the node has sufficient CPU and memory.
Node affinity/anti-affinity: Matching pods to preferred nodes or avoiding certain nodes.
Taints and tolerations: Nodes may have taints to repel certain pods; tolerations allow pods to tolerate these taints.
Pod affinity/anti-affinity: Ensuring pods are placed on the same or different nodes based on defined rules.

Here is an example of a pod manifest that defines node affinity:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
  containers:
  - name: my-container
    image: nginx
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

Scoring and Prioritization

After filtering, the scheduler scores the feasible nodes. Scoring is based on several factors, such as:

Resource utilization: Preference for nodes with balanced resource usage.
Pod spread constraints: Distributing pods across nodes to avoid overloading a single node.
Custom scoring plugins: Custom rules defined in the scheduler configuration.

The node with the highest score is chosen to host the pod. If multiple nodes have the same score, one is selected at random.

Here’s an example of a custom scheduler policy for prioritization:

{
  "kind": "Policy",
  "apiVersion": "v1",
  "priorities": [
    {
      "name": "MostRequestedPriority",
      "weight": 1
    },
    {
      "name": "BalancedResourceAllocation",
      "weight": 1
    }
  ]
}

Resource Allocation and Bin Packing

Resource allocation in Kubernetes follows a bin-packing approach to optimize node utilization:

Bin packing: The scheduler tries to pack pods into nodes to utilize resources efficiently. This minimizes the number of nodes needed, reducing costs and improving performance.
Overcommitment: Kubernetes allows overcommitting resources, meaning more pods can be scheduled than the physical capacity, relying on the assumption that not all pods will fully utilize their requested resources simultaneously.

Here’s an example of how to set up a pod with resource requests and limits:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

Pod Binding and Affinity Rules

The final step in scheduling is binding the pod to the chosen node. This involves:

Binding: The scheduler communicates with the API server to bind the pod to the selected node.
Node affinity: Ensuring that pods are placed on specific nodes based on affinity rules.
Inter-pod affinity/anti-affinity: Placing pods in relation to other pods, either on the same node or different nodes, depending on the rules defined.

Here’s and example pod with inter-pod affinity:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: "kubernetes.io/hostname"
  containers:
  - name: my-container
    image: nginx

Kubernetes Scheduler Limitations and Challenges

Here are some of the potential drawbacks of the default Kubernetes scheduler.

Resource Requests and Limits (Noisy Neighbors)

Kubernetes allows users to set resource requests and limits for each pod, ensuring that pods get the required resources while also capping their maximum usage. However, this can lead to a problem known as the "noisy neighbor" issue. This occurs when one pod consumes resources excessively, impacting the performance of other pods on the same node.

Even with limits in place, contention for CPU, memory, or disk I/O can degrade overall system performance, especially when resource-intensive pods are colocated with latency-sensitive ones.

Lack of Resources for System Processes

Nodes must reserve resources for system processes in addition to the resources allocated to pods. If a node becomes too full with user pods, it might not have enough resources left to run essential system services, which can lead to node instability and crashes.

Kubernetes attempts to prevent this by reserving a portion of resources for system daemons through configurations like kube-reserved and system-reserved. However, improper configuration or high pod density can still lead to resource starvation, impacting the node's health and performance.

Preempted or Scheduled Pods

Kubernetes may preempt lower-priority pods to make room for higher-priority ones, ensuring that critical workloads are not starved of resources. This can disrupt running applications, leading to downtime and data loss if the applications are not designed to handle such interruptions.

Additionally, scheduled pods that remain pending due to insufficient resources can create a backlog, complicating resource management and delaying application deployment.

Kubernetes Scheduling for AI Workoads

Kubernetes scheduling for AI workloads requires a different approach. Let’s review the key considerations.

Scale-Out vs Scale-Up Architecture

Kubernetes was built as a Hyperscale System with Scale-out architecture for running services. For more information on Kubernetes architecture read the article here. AI/ML workloads require a different approach. They should run on high-performance systems that can efficiently scale-up workloads.

What Is a Hyperscale System?

Hyperscale systems were designed and built to run microservices that can serve millions of requests. Such services are always up, waiting for triggers to take action and serve incoming calls, needing to support peak demands that can grow notably with respect to average demand.

Hyperscale systems are typically based on cost-efficient hardware that allows each application to support millions of service requests at a sufficiently low price.

Scheduling for Hyperscale Systems

Hyperscale systems require a scheduling approach that spreads a large number of service instances on multiple servers to be resilient to server failures, and even to multiple zones and regions to be resilient to data center outages. They are based on auto-scaling mechanisms that quickly scale out infrastructure, spinning machines up and down to dynamically support demand in a cost-efficient way. Kubernetes was built to satisfy such requirements.

What Is a High-Performance System?

A high-performance system with scale-up architecture is one in which workloads are running across multiple machines, requiring high-speed, low-latency networking and software programs that can run distributed processes for parallel computing.

High-performance systems support workloads for data science, big data analytics, AI, and HPC. In these scenarios the infrastructure should support tens to thousands of long-running workloads concurrently, not millions of short, concurrent service requests as is the case with microservices. AI workloads run to completion, starting and ending by themselves without user intervention (called 'batch jobs', which we will address in more detail later), typically for long durations ranging from hours, days and in some cases even for weeks.

Infrastructure for data science and HPC needs to have the capability to host compute-intensive workloads and process them fast enough. It is therefore based on high end, expensive hardware, including in some cases specialized accelerators like GPUs which typically results in high cost per workload/user.

Scheduling for High-Performance Systems

For high-performance systems to work efficiently, they need to enable large workloads that require considerable resources to coexist efficiently with small workloads requiring fewer resources. These processes are very different than the spread scheduling and scale-out mechanism required for microservices. They require scheduling methods like bin packing and consolidation to put as many workloads as possible on a single machine to gain efficiency of hardware utilization and reduce machine fragmentation. Reserved instances and backfill scheduling are needed to prevent cases where large workloads requiring multiple resources need to wait in queue for a long time and batch scheduling and preemption mechanisms are needed to orchestrate long running jobs dynamically according to priorities and fairness policies. In addition, elasticity is required to scale up a single workload to use more resources according to availability.

Batch Scheduling Explained

Batch workloads are jobs that run to completion unattended (i.e., without user intervention). Batch processing and scheduling is commonly used in High Performance Computing (HPC) but the concept can easily be applied to data science and AI. With batch processing, training models can start, end, and then shut down, all without any manual intervention. Plus, when the container terminates, the resources are released and can be allocated to other workloads.

The scheduler that is native to Kubernetes does not use batch scheduling methods like multi-queue scheduling, fairness, advanced preemption mechanisms, and more, all of which are needed to efficiently manage the lifecycle of batch workloads. With such capabilities jobs can be paused and resumed automatically according to predefined priorities and policies, taking into account the fluctuating demands and the load of the cluster. Batch scheduling also prevents jobs from being starved by heavy users and ensures fairness between multiple users sharing a cluster.

What Is Topology Awareness?

Another challenge of running AI workloads on Kubernetes relates to a concept called ‘topology awareness’. This refers to:

inter-node communication and
how resources within a node inter-connect

These two topological factors that have major impact on the runtime performance of workloads. In clusters managed by a centralized orchestration system, the responsibility of provisioning resources and optimizing allocations according to these topological factors is at the hands of the cluster manager. Kubernetes has not yet addressed topology awareness efficiently, resulting in lower performance when sub-optimal resources are provisioned. Performance inconsistency is another issue -workloads may run at maximum speed, but often poor hardware setup leads to lower performance.

Scheduler awareness to the topology of interconnect links between nodes is important for distributed workloads with parallel workers communicating across machines. In these cases, it is critical that the scheduler binds pods to nodes with fast interconnect communication links. For example, nodes located in the same rack would typically communicate faster and with lower latency than nodes located in different racks. The default K8s scheduler today does not account for inter-node communication.

Another important aspect of topology awareness relates to how different resources within a node are communicating. Typically, multiple CPU sockets, memory units, network interface cards (NICs), and multiple peripheral devices like GPUs, are all set up in a node in a topology that is not always symmetric. For example, different memory units can be connected to different CPU sockets and a workload running on a specific CPU socket would gain the fastest read/write data access when using the memory unit closest to the CPU socket. Another example would be a workload running on multiple GPUs in a node with non-uniform topology of inter-GPU connectors. Provisioning the optimal mix of CPUs, memory units, NICs, GPUs, etc., is often called NUMA (non-uniform memory access) alignment.

Topology awareness relating to NUMA alignment has been addressed by Kubernetes but the current implementation is limited and highly inefficient - the Kubernetes scheduler allocates a node for a workload without knowing if CPU/memory/GPU/NIC alignment can be applied. If such alignment is not feasible on the chosen node, best-effort configuration would run the workload using a sub-optimal alignment while restricted configuration would fail the workload. Importantly, sub-optimal alignment and a failure to run a workload can occur even in cases where other nodes that can satisfy NUMA alignment are available in the cluster.

The limitations of topology-awareness relate to a basic flaw in Kubernetes architecture. The scheduling mechanism of Kubernetes is based on splitting responsibilities between the scheduler which operates at the cluster level and Kubelet which operates at the node level. The scheduler allocates nodes for containers based on information about the number of resources available in each node, without being aware of the topology of the nodes, the topology of the resources within a node, and which exact resources are actually available at a given moment. Kubelet, together with components of Linux OS and device plugins, is responsible for scheduling the containers and for allocating their resources within the node. This architecture is perfect for orchestrating microservices running within a node, but fails to provide high, consistent performance when orchestrating compute-intensive jobs and distributed workloads.

Compare Kubernetes vs Slurm schedulers in another guide from this series.

Gang Scheduling

The third AI-focused component missing from Kubernetes is gang scheduling. Gang scheduling is used when containers need to be launched together, start together, and end together. For example, this capability is required for distributed workloads to ensure that different containers are launched on different nodes only when enough resources are available, preventing inefficiencies and dead-lock situations where one group of containers are launched while others are waiting for resources to become available. Gang scheduling can also help with recovery when some of the containers fail, without requiring a restart of the entire workload.

Compare Kubernetes vs Slurm schedulers in another guide from this series.

Kubernetes Scheduling with Run:ai

Run:ai’s Scheduler plugs into Kubernetes clusters to enable optimized orchestration of high-performance AI workloads.

High-performance system —for scale-up infrastructures that pool resources and enable large workloads that require considerable resources to coexist efficiently with small workloads requiring fewer resources.
Batch Scheduling - training models can start, pause, restart, end, and then shut down, all without any manual intervention. Plus, when the container terminates, the resources are released and can be allocated to other workloads for greater system efficiency.
Topology awareness— inter-resource and inter-node communication enable consistent high performance of AI workloads.
Gang Scheduling - containers can be launched together, start together, and end together for distributed workloads that need considerable resources.

Run:ai simplifies Kubernetes scheduling for AI workloads, helping data scientists accelerate their productivity and the quality of their models. Learn more about the Run:ai platform.

See Our Additional Guides on Key Artificial Intelligence Infrastructure Topics

We have authored in-depth guides on several other artificial intelligence infrastructure topics that can also be useful as you explore the world of deep learning GPUs.

GPUs for Deep Learning

Learn how to assess GPUs to determine which is the best GPU for your deep learning model. Discover types of consumer and data center deep learning GPUs. Get started with PyTorch for GPUs – learn how PyTorch supports NVIDIA’s CUDA standard, and get quick technical instructions for using PyTorch with CUDA. Finally, learn about the NVIDIA deep learning SDK, what are the top NVIDIA GPUs for deep learning, and what best practices you should adopt when using NVIDIA GPUs.

Best GPU for Deep Learning: Critical Considerations for Large-Scale AI
PyTorch GPU: Working with CUDA in PyTorch
NVIDIA Deep Learning GPU: Choosing the Right GPU for Your Project

MLOps

In today’s highly competitive economy, enterprises are looking to Artificial Intelligence in general and Machine and Deep Learning in particular to transform big data into actionable insights that can help them better address their target audiences, improve their decision-making processes, and streamline their supply chains and production processes, to mention just a few of the many use cases out there. In order to stay ahead of the curve and capture the full value of ML, however, companies must strategically embrace MLOps.

See top articles in our MLOps guide: