Container Orchestration

A Guide

What is Container Orchestration?

Containers have become widely adopted over the past decade. A lightweight alternative to virtual machines, they make it possible to package software and its dependencies in an isolated unit, which can be easily deployed in any environment. Containers are one of the foundational technologies used to build cloud native applications.

Companies that need to deploy and manage hundreds of Linux containers and hosts can benefit from container orchestration. Container orchestration can automatically deploy, manage, scale, and set up networking for large numbers of containers. Popular container orchestrators include Kubernetes, Docker Swarm, and OpenShift.

Container orchestration makes it possible to deploy applications across multiple environments without having to redesign or refactor them. Orchestrators can also be used to deploy applications in a microservices architecture, in which software is broken up into small, self-sufficient services, developed using efficient CI/CD pipelines.

In this article, you will learn:

What Problems Does Container Orchestration Solve?

Scaling containers across an organization, while ensuring efficient utilization of computing resources, can be very challenging without automation.

Capabilities of container engines

Container engines like Docker, provide CLI commands for operations like pulling a container image from a repository, creating a container, and starting or stopping one or several containers. These commands are effective for managing containers on a few hosts, but they do not address the full lifecycle of containerized applications.

Additional requirements at large scale

Here are some of the challenges that need to be addressed in larger-scale containerized applications:

  • Automatically deploying specific quantities of containers to a set of host machines.
  • Understanding which hosts are underutilized, and can be used to deploy more containers, or overutilized, meaning existing containers don’t have sufficient resources.
  • Updating and rolling back applications running on multiple containers in different physical locations.
  • Load balancing application traffic between multiple containers or groups of containers.
  • Central user interface for managing container workloads
  • Defining networking for containers
  • Ensuring security best practices across large numbers of containers

How container orchestrators help

Container orchestrators automate all of the above activities, using a declarative approach. You define a “desired state” of your containerized application, typically using a configuration file, and the orchestrator constantly works to achieve that desired state, given the available resources.

Orchestrators can do the following automatically:

  • Manage the full container lifecycle
  • Scale containers and the underlying infrastructure
  • Manage service discovery and container networking
  • Implement security controls in a consistent way
  • Monitor container health and handle fault tolerance
  • Load balance traffic between containers
  • Manage optimal resource utilization of container hosts

AWS Container Orchestration

AWS built its own container orchestration platform, known as Amazon Elastic Container Service (ECS). ECS seamlessly integrates containers with AWS services and is compatible with Docker. It allows you to run container-based applications on EC2 instances. AWS container orchestration is easy to use and is fully managed, and you don’t need any additional software if you are already using AWS.

Another option for container orchestration on AWS is the Elastic Kubernetes Service (EKS), which lets you run Kubernetes workloads on managed clusters. AWS fully manages the Kubernetes control plane, and assists with tasks like autoscaling, updating, networking and security.

Azure Container Orchestration

Azure Kubernetes Service (AKS) is a container orchestration solution available on Microsoft Azure. It is a managed service based on Kubernetes, which you can use to deploy, manage and scale Docker containers and containerized applications across a cluster of hosts on the Azure public cloud.

Microsoft manages Kubernetes for you, so you don’t have to handle upgrades to Kubernetes versions. You can choose when to upgrade Kubernetes in your AKS cluster to minimise disruption to your workloads.

AKS can automatically add or remove nodes to clusters in response to fluctuations in demand. You can also leverage node pools, including nodes with graphics processing units (GPU) or other specialized hardware capabilities, to boost your processing power. This is important for workloads that require extensive computing resources.

OpenShift Container Orchestration

OpenShift, created by Red Hat, is a container orchestration platform that can run containers in on-premise or hybrid cloud environments. Internally, OpenShift is based on Kubernetes and shares many of the same components.

However, there are many differences between the two platforms. OpenShift adds many components and capabilities not included in plain Kubernetes, including Istio service mesh, Prometheus monitoring, and Red Hat Quay Container Registry.

OpenShift uses the concept of build artifacts, and enables these artifacts to run as first-class resources in Kubernetes. It is tightly integrated with Red Hat Enterprise Linux (RHEL), an operating system distribution used by many large enterprise deployments.

Container Orchestration on NVIDIA GPUs

Kubernetes can run on NVIDIA GPUs, allowing the container orchestration platform to leverage GPU acceleration. The NVIDIA device plugin enables GPU support in Kubernetes, so developers can schedule GPU resources to build and deploy applications on multi-cloud clusters.

Kubernetes has become increasingly important for developing and scaling machine learning and deep learning algorithms. If you are not a trained data scientist, containers can help simplify management and deployment of models. Containers allow you to package a model, making it more easily transferable. You don’t have to build a model from scratch every time, which can be complex and time consuming.

GPUs cannot be virtualized and allow developers to simultaneously process large data sets across heterogeneous environments, including cloud deployments and distributed networks. They can accelerate the development of data-heavy systems such as conversational AIs.

NVIDIA DGX systems support container orchestration for multiple open-source container runtimes, such as containerd, CRI-O and Docker. GPU metrics can be monitored via a monitoring stack, integrating NVIDIA DCGM with Prometheus and Grafana. You can specify attributes such as memory requirements and GPU type.

Container orchestration on NVIDIA GPUs is supported by a number of toolkits, which are continuously being developed. With the NVIDIA Container Toolkit, you can:

  • Build, deploy, orchestrate and monitor GPU-accelerated Docker containers
  • Automate container configuration using the container runtime library
  • Utilize Jetson edge devices running the same CUDA-X stack — for example, images pulled from NVIDIA GPU Cloud can be optimized for JetPack.

NVIDIA also offers a transfer learning toolkit that distributes pre-trained models for AI operations such as conversational AI and computer vision using Docker containers. Transfer learning allows you to transfer an existing neural network capability to a new model. Developers can use the NVIDIA GPU Cloud registry to access existing models packaged in containers.

A key element of managing machine learning workloads on orchestrators is scheduling. Read our guide to Kubernetes scheduling

Automating Kubernetes for Machine Learning with Run:AI

Run:AI’s Scheduler is a simple plug-in to Kubernetes clusters and enables optimized orchestration of high-performance containerized workloads. It adds high-performance orchestration to your containerized AI workloads. The Run:AI platform includes:

  • High-performance for scale-up infrastructures—pool resources and enable large workloads that require considerable resources to coexist efficiently with small workloads requiring fewer resources.
  • Batch scheduling—workloads can start, pause, restart, end, and then shut down, all without any manual intervention. Plus, when the container terminates, the resources are released and can be allocated to other workloads for greater system efficiency.
  • Topology awareness—inter-resource and inter-node communication enable consistent high performance of containerized workloads.
  • Gang scheduling—containers can be launched together, start together, and end together for distributed workloads that need considerable resources.

Run:AI simplifies Kubernetes scheduling for AI and HPC workloads, helping researchers accelerate their productivity and the quality of their work.

Learn more about the Run:AI Kubernetes Scheduler