As AI teams increasingly accept Kubernetes as the de-facto container orchestration tool, it’s more important than ever that data scientists sharing a cluster have a fair scheduling solution. Deciding which users should get the next GPU can be critical, especially when deep learning workloads can run for days or even weeks, and a wrong decision can cause starvation.
In this post, I will explain:
- How Kubernetes handles GPU scheduling
- Common pitfalls when using K8s scheduling framework
- How we created an allocation algorithm for data science workloads
- Adding pod preemption into scheduling decisions
- A real-life example of fair-share allocation of GPU
Let’s start with the way Kubernetes schedules pods.
To put it simply, the scheduler picks a pending pod and then attempts to bind the pod to the most suitable node – but how does the scheduler pick the right pod? The default scheduler stores pending pods in a heap. The heap is sorted by priority and by creation time of the pods. So, the next pod to be allocated will be stalled at the bottom of the heap. Using the default Kubernetes scheduler on a shared cluster with a limited number of GPUs can cause fairness issues, because it’s easy for data scientists to monopolize the cluster.
Priority and Bulk Submission: Easy Ways to Monopolize a GPU Cluster
One way for users to monopolize a GPU cluster is to submit pods set to the highest priority they are allowed. This moves their pods to the top of the heap to be allocated first. As you can see in the below example, Bob submits pods with higher priority than Alice, so his pods will be at the top of the heap and scheduled first. Eventually, Bob can monopolize and use all the GPUs in the cluster.
The other way to monopolize the GPU cluster is to submit as many pods as possible. This will cause more pods to be in the heap and shorten creation times. In the example below, Bob submits many pods and the heap contains mostly his pods, so again he’ll have more GPUs allocated.
To prevent monopolization of the cluster, we want to build a scheduler that knows to share the GPU resources between users fairly, regardless of creation time and pod priorities. Even if Alice submits less pods and pods with lower priority, she should still get the same amount of GPUs as Bob.
The Kubernetes Scheduling Framework
In V115 of Kubernetes, the architecture of the scheduler was changed to use the scheduling framework. The scheduling framework is more pluggable and provides several extension points to change the behavior of the scheduler. It was a big game-changer for those wanting to build alternatives to the default scheduler, because it reduces the need for developers to build their own scheduler from scratch and then trying to keep up with all the Kubernetes features.
At Run:AI, we examined the scheduling framework for implementing our scheduler and we decided not to use it. We made this decision because we wanted to use a different data structure for pending pods, rather than just a simple heap of pods (which is the only data structure the Kubernetes scheduling framework allows).
The Run:AI Scheduler
The data structure we use today at Run:AI is a two-dimensional heap designed for deep learning with multi GPU. In the main heap, every node represents a user, and we sort the heap by the number of allocated GPUs used. Every node belonging to each user is associated with another heap of his/her pending pods. The pending pods are sorted by priority and by creation time (just like in Kubernetes). In this manner, we can associate pending pods to a specific user according to their load on the cluster, rather than just using a simple heap of pods.
The way we allocate pods follows these steps:
- Pop the first user from the users heap
- Pop the first pod from this user
- Allocate this pod
- Update the number of allocated GPUs for this user and reshuffle the heap
GPU Scheduling in Action: Allocation Algorithm
In the following example, Alice is the first user in the users heap. So, we will allocate Alice’s first pod and update the number of allocated GPUs Alice is currently using. Once we do that, the heap will be updated and Bob will now move to the top of the heap. Notice that at the top of the users heap, we always have the most starved user. This helps us in our goal of achieving fairness. To continue allocating, we pop the new top pod in the heap – from Bob’s pods this time – and allocate the spot. This alternating flow will continue until there are no more free GPUs in the cluster. As the animation finishes, we see the GPUs are shared fairly between Alice and Bob.
The above solution provides fairness when deciding which pods should be allocated next, but does it provide fairness between the users? Not quite yet. Imagine another scenario, below: Bob submitted many long-running pods when the GPUs were free. Because he was the only active user, we allowed Bob to use the whole cluster and maximize utilization of the GPUs. But when Alice submits pods, the pods remain in a pending state and Alice will be starved until Bob finishes using the GPUs. Bob could be running the GPUs for weeks, so Alice might be starved for a very long time. Definitely not fair.
GPU Scheduling in Action: Enabling Preemption of Users
In order to prevent long-running pods from blocking new pods, Run:AI has another algorithm to enable preemption of users. We calculate the number of GPUs every active user deserves at a given moment. Then, we run a simulation where we preempt the pods of the user who is monopolizing the GPUs, attempting to allocate pods for the starved user. If the simulation succeeds, we preempt the pods and simply allow our allocation algorithm to apply.
Watch the below example, where we add four GPUs in the cluster. We calculate how many GPUs every user should have – that’s two GPUs each. Bob is using two GPUs more than issued. So we preempt two of Bob’s pods and move those pods back to the pending pods heap. Now we have two free GPUs and we simply apply our allocation algorithm. We allocate Alice’s first pod and update the number of allocated GPUs, but she remains at the top of the heap because she’s still starved. Then we allocate a second pod. As you can see, Bob and Alice are now sharing the cluster fairly.
To summarize, with the allocation and preemption algorithms, users are allowed to maximize their use of the GPUs when they’re free, and when other users submit pods, everyone gets the GPUs they deserve. We can see these algorithms at work in the usage stats from our customers.
Run:AI Customer Example: Fair-share Allocation of GPU
When you look at the marked time window below, you can see that the blue user uses about 20 GPUs while the pink user uses about 10 GPUs. We allow the blue user to get 20 GPUs because the GPUs are available, and we want to get the maximum utilization out of the cluster.
After a few hours, in the new marked window, the pink user submits more pods, so we preempt the blue user and we let the pink user have those GPUs. As you can see, the lines are getting closer to each other. This is because we took the GPUs from the blue user and gave them to the pink user. This is what fairness looks like.
Run:AI Offers Plug-and-Play Fair Scheduling of GPUs
With a simple integration via Kubernetes plug-in, your data science teams can enjoy virtually unlimited compute and fair allocation according to rules you control.
Here are some of the scheduling capabilities you gain when using Run:AI:
- Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
- No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
- A higher level of control—Run:AI enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.
By Yodar Shafrir, Software Engineer Team Lead at Run:AI