This article was originally published on May 18, 2022 in The Sequence of AI Knowledge. Run:ai introduces rntop, a new super useful open-source tool that measures GPU cluster utilization. Learn why that’s a critical measure for data scientists, as well as IT leaders controlling hardware budgets.
Why measure GPU cluster utilization?
Ask any data scientist what they think their GPU cluster’s utilization is, and they’ll probably say it’s above 60%–at least. When AI teams and models begin to scale, it can certainly feel like compute resources are always in use, which is extremely frustrating for data scientists who are eager to train, validate and test their models. As one customer recently put it, “We need a scheduler…otherwise, we might have blood on the floor as users will fight for the GPUs.” Yikes. But the truth is, most GPU clusters are at less than 20% utilization.
Why the disconnect between guesstimate and reality? A mismatch between...