Kubernetes Architecture for AI Workloads

AI & Machine Learning Guide

Understanding Kubernetes Architecture for Data Science Workloads

This article explains how Kubernetes Architecture as a platform for containerized AI workloads came to be used inside many companies. The guide explains some of the things to consider when implementing Kubernetes architecture to orchestrate AI workloads.

Kubernetes Overview

Originally developed inside Google, Kubernetes has been an open-source project since June 2014 and managed by the Cloud Native Computing Foundation (CNCF) since Google and Linux partnered to found the CNCF in July 2015. Kubernetes is an orchestration system that automates the processes involved in running thousands of containers in production. It eliminates the infrastructure complexity associated with deploying, scaling, and managing containerized applications.

There is a strong correlation between the growth in containers and microservice architectures and the adoption of Kubernetes. According to a recent Gartner report, “By 2023, more than 70% of global organizations will be running more than two containerized applications in production, up from less than 20% in 2019.” And Kubernetes usage will continue to grow as companies deepen their commitment to containerization. According to a recent survey of 250 IT professionals conducted by Dimensional Insight, “Well over half (59%) are running Kubernetes in a production environment, with one-third (33%) operating 26 clusters or more and one-fifth (20%) running more than 50 clusters.”

The Kubernetes website is full of case studies of companies from a wide range of verticals that have embraced Kubernetes to address business-critical use cases—from Booking.com, which leveraged Kubernetes to dramatically accelerate the development and deployment of new services; to CapitalOne, which uses Kubernetes as an “operating system” to multiply productivity while reducing costs; and the New York Times, which maximizes its cloud-native capabilities with Kubernetes-as-a-service on the Google Cloud Platform.

This guide looks specifically at how Kubernetes can be used to support data science workloads in general and machine/deep learning in particular. As data science workloads require some specific tooling for their needs, utilizing Kubernetes for deep learning has some challenges that we will identify in this post.

This is part of an extensive series of guides about open source.

Kubernetes Architecture

Containers generally require automated orchestration that, for example, starts a particular container on demand, allows containers to talk to each other, dynamically spins up and terminates compute resources, recovers from failures and manages the lifecycle of containers, and generally ensures optimal performance and high availability. In this section, we review briefly how Kubernetes works.

Kubernetes Cluster
Figure 1: Schematic view of a Kubernetes cluster

As shown in Figure 1, each Kubernetes cluster contains at least one master node, which controls and schedules the cluster, and a number of worker nodes, each running one or more pods deployed to the same host (in our example, a Docker engine). A pod represents a unit of work and runs either a single container as an encapsulated service, or several tightly coupled containers that share network and storage resources. Kubernetes takes care of connecting pods to the infrastructure and managing them during runtime (monitoring, scaling, rolling deployments, etc.).

Every pod has its own IP address, which makes it easily discoverable to applications through Kubernetes service discovery. Multiple containers within a pod share the same IP address and network ports, while communicating among themselves using localhost.

Other Kubernetes concepts that are important to understand include:

  • Service: A logical collection of pods presented as a single entity, with a single point of access and easy communications among pods in the service.
  • Volume: A resource where containers can store and access data, including persistent volumes for stateful applications.
  • Label: A user-defined metadata tag that makes Kubernetes resources easily searchable.
  • Job: Jobs run containers to completion – that is, the containers start and end automatically. A job creates one or more pods and ensures that a specified number of them successfully run to completion. Jobs are particularly useful for running machine learning workloads, which will be addressed later in this guide.
  • Replica: Pods do not self-heal. If a pod fails or is evicted for some reason, a replication controller immediately uses a template to start up another replica pod so that there are always the correct number of pods available.
  • Namespace: A grouping mechanism for Kubernetes resources (pods, services, replication controllers, volumes, etc.) that isolates those resources within the cluster.


How Does Kubernetes Address Data Science Challenges?

Containers and the Kubernetes ecosystem have been embraced by developers for their ability to abstract modern distributed applications from the infrastructure layer. Declarative deployments, real-time continuous monitoring, and dynamic service routing deliver repeatability, reproducibility, portability, and flexibility across diverse environments and libraries.

These same Kubernetes features address many of the most fundamental requirements of data science workloads:

  • Reproducibility across a complex pipeline: Machine/deep learning pipelines consist of multiple stages, from data processing through feature extraction to training, testing, and deploying models. With Kubernetes, research and operations teams can confidently share a combined infrastructure-agnostic pipeline.
  • Repeatability: Machine/deep learning is a highly iterative process. With Kubernetes data scientists can repeat experiments with full control over all environmental variables including data sets, ML libraries, and infrastructure resources.
  • Portability across development, staging, and production environments: When run with Kubernetes, ML-based containerized applications can be seamlessly and dynamically ported across diverse environments.
  • Flexibility: Kubernetes provides the messaging, deployment, and orchestration fabric that is essential for packaging ML-based applications as highly modular microservices capable of mixing and matching different languages, libraries, databases, and infrastructures.

Considerations for Successful Kubernetes Architecture for AI Workloads

With all of the advantages described above, it is not surprising that Kubernetes has become the de facto container orchestration standard for data science teams. This section provides best practices for optimizing how data science workloads are run on Kubernetes.

Kubernetes Monitoring

Monitoring Kubernetes clusters is essential for right-scaling Kubernetes applications in production and for maintaining system availability and health. However, legacy tools for monitoring monolithic applications cannot provide actionable observability into distributed, event-driven, and dynamic Kubernetes applications. The new monitoring challenges raised by Kubernetes deployments include:

  • With seamless deployment across complex infrastructures, diverse streams of compute, store, and network data must be normalized, analyzed, and visualized to achieve real-time actionable insight into environment topology and performance.
  • Highly ephemeral containers make it tricky to capture and track important metrics such as the number of containers currently running, container restart activity, and each container’s CPU, storage, memory usage, and network health.
  • Effectively harnessing Kubernetes’ rich array of internal logs for quick detection and remediation of cluster performance issues, including node and control plane component metrics.

The current gold standard for monitoring Kubernetes ecosystems is Prometheus, an open-source monitoring system with its own declarative query language, PromQL. A Prometheus server deployed in the Kubernetes ecosystem can discover Kubernetes services and pull their metrics into a scalable time-series database. Prometheus’ multidimensional data model based on key-value pairs aligns well with how Kubernetes structures infrastructure metadata using labels.

The Prometheus metrics, which are published using the standard HTTP protocol, are human-readable and easily accessed via API calls by, for example, visualization and dashboard-building tools such as Grafana. Prometheus itself provides basic visualization capabilities by displaying the results of PromQL queries run on the aggregated time-series data as tables or graphs. Prometheus can also issue real-time alerts to the relevant teams when predefined performance thresholds are breached.

  • Run batch AI workloads as jobs and interactive sessions as replicas
  • Use CronJobs for better scheduling

Traditionally, when used for applications and services, K8s containers are run as replicas, not as jobs. But for ML and DL workloads, running as jobs is a better fit. This is because jobs run to completion and can support parallel processing. Jobs can run at the same time multiple pods, enabling set up of a parallel processing workflow while making sure those pods terminate and free their resources when the job runs to completion. Replicas are not set up to enable this functionality, which is critical for batch experimentation and for increasing resource utilization and reducing cloud spending. Replicas are a better fit for interactive sessions where users build and debug their models or experiment with data.

Kubernetes architecture includes CronJob, which is the native way to trigger jobs in a schedule. CronJobs are used when creating periodic and recurring tasks. CronJobs can also schedule specific tasks at determined times, such as scheduling a Job for when your cluster is likely to be idle.

Kubernetes Architecture and Run:ai

Run:ai’s platform is built as a plug-in to the Kubernetes architecture to enable automated orchestration of high-performance AI workloads.

Run:ai simplifies and optimizes orchestration of AI workloads on Kubernetes to help data scientists accelerate their productivity and the quality of their models. Learn more about the Run:ai platform.

Learn More About Kubernetes Architecture

The Challenges of Scheduling AI Workloads on Kubernetes

This article explains the basics of Kubernetes scheduling. The guide explains how Kubernetes alone is not suited to scheduling and orchestration of Deep Learning workloads and the specific areas where Kubernetes falls short for AI.

Learn how Kubernetes handles AI workloads and how to overcome its limitations using concepts from High Performance Computing (HPC).

Read more: The Challenges of Scheduling AI Workloads on Kubernetes

Kubeflow Pipelines: The Basics and a Quick Tutorial

Kubeflow Pipelines is a platform designed to help you build and deploy container-based machine learning (ML) workflows that are portable and scalable. Each pipeline represents an ML workflow, and includes the specifications of all inputs needed to run the pipeline, as well the outputs of all components.

Learn about Kubeflow Pipelines use cases, architecture, and see how to set up your first pipeline, step by step.

Read more: Kubeflow Pipelines: The Basics and a Quick Tutorial

What is Container Orchestration?

Containers have become widely adopted over the past decade. A lightweight alternative to virtual machines, they make it possible to package software and its dependencies in an isolated unit, which can be easily deployed in any environment. Containers are one of the foundational technologies used to build cloud native applications.

Learn about container orchestration, the problems it solves, container orchestration platforms, and leveraging orchestration for GPU workloads.

Read more: What is Container Orchestration?