What Is NVIDIA Base Command?
NVIDIA Base Command Platform facilitates centralized, cloud-hosted management of the entire artificial intelligence (AI) development lifecycle. It provides data scientists and IT teams with ready-to-use management tools for training AI, including workflows and resource management. It allows multiple teams to share AI infrastructure without interrupting each other.
NVIDIA Base Command Platform offers a web user interface (UI) and a command-line API that enables the execution of AI workloads using right-sized resources, such as single GPU or multi-node cluster. It provides dataset management, enabling fast delivery of production-ready AI models and applications.
The platform includes a set of built-in telemetry features to help evaluate deep learning (DL) techniques, resource allocations, and workload settings. Its reporting and visibility capabilities provide organizations with insights to measure the progress of AI projects based on business goals. Team managers can use this feature to define project priorities and plans by forecasting computing capacity needs.
In this article:
- NVIDIA Base Command: Features and Benefits
- ~AI Job Scheduling and Workload Orchestration
- ~Holistic Workflow Management
- ~Integrated Reporting and Monitoring Dashboards
- NVIDIA Base Command Platform Concepts
- Getting Started with NVIDIA Base Command Platform
- ~Dataset Management
- ~Workspace Management
- ~Learn More About NVIDIA A100
NVIDIA Base Command Features
NVIDIA Base Command Platform is an AI training service that helps businesses and data scientists accelerate the AI development process. It centralizes end-to-end AI training processes, including job scheduling, resource sharing, and dataset management, through an intuitive user interface (UI), command-line interface, reporting dashboard, and integrated monitoring.
AI Job Scheduling and Workload Orchestration
Base Command provides Kubernetes, Slurm, and Jupyter Notebook environments for NVIDIA DGX systems (NVIDIA’s multi-GPU AI workstation), providing an easy-to-use scheduling and orchestration solution that meets the requirements of large enterprises. It provides access to tools teams already use, with unified management and NVIDIA support.
Holistic Workflow Management
Data science and deep learning practitioners require optimized, ready-to-run AI software and end-to-end management of AI experiments and workflow.
NVIDIA Base Command can configure and manage AI workloads, providing unified dataset management. It ensures AI workloads run on resources of the right size, from single GPUs to large multi-node clusters. Cloud hosting management features enable a common user experience and control over NVIDIA DGX SuperPOD (a cluster of DGX workstations).
Base Command accommodates AI tools and work methods, with consistent functionality across web UI, API, and command line. A large selection of optimized, pre-built containers with deep learning frameworks, data science tools, and trained models, are available through NVIDIA NGC Catalog, allowing data scientists to build a production-ready model faster.
Integrated Reporting and Monitoring Dashboards
An AI project is multifaceted and highly iterative in nature, requiring constant fine-tuning. NVIDIA Base Command enables IT teams and AI professionals to optimize and analyze AI resources using built-in telemetry. Management can access reporting and presentation to help them analyze progress of AI initiatives and improve AI infrastructure.
NVIDIA Base Command Platform Concepts
Here are some common concepts of the NVIDIA Base Command Platform:
- Accelerated computing environment (ACE)—a cluster or availability zone with dedicated compute, networking, and storage.
- NGC catalog—a set of publicly accessible, GPU-optimized, NVIDIA-maintained software, including Helm charts, containers, and pre-trained models.
- Job—the basic computation unit or container running in the ACE.
- Job definition—a set of attributes defining a job.
- Dataset—a job’s aggregated data inputs containing code or data.
- Container image—a Docker container used to package application components.
- Data result—a read/write mount that a job species and the system captures.
- Instance—determines the resources available to a job, including CPU, RAM, and GPUs (up to eight).
- Job command—the action specified by a job to run in a container (could be simple or complex).
- Multi-node job—a job running on more than one node.
- Model—an NGC pre-trained DL model that you can easily fine-tune or re-train.
- Organization (org)—the dedicated registry space for an organization.
- Team—a unit within an org with a dedicated registry space that only team members can access.
- User—an individual with an NVIDIA Base Command Platform account, grouped under an organization and (optionally) team.
- Private registry—a secure space for storing and sharing resources, models, and containers.
- Quota—the default limit on storage and GPU resources for each user.
- Telemetry—metrics and data from different components like CPU, memory, and GPU.
- Workspace—a shareable, read/write-persistent storage space for collaboration on a job. You can mount a workspace to a job in read/write or read-only mode.
Getting Started with NVIDIA Base Command Platform
You must have an account with Base Command Platform to use this platform. The organization’s admin should create accounts for employees, which can be activated by mapping the email address to the organization’s SSO (single sign-on).
Datasets contain read-only data for repeatable, well-documented, and scalable workloads. They may be accessible enterprise-wide or by a specific team. Under the Base Command menu, you can select Datasets to view the dataset accessible to you, the organization, or the team.
Datasets are critical for deep learning jobs, providing shareable data for training and production workloads. You can mount multiple datasets to one job, while multiple users and jobs can access the same dataset simultaneously.
To mount a dataset:
- Go to Data Input, click on the Datasets tab, and look for the dataset you want to mount.
- Select the dataset (or datasets).
- Specify the mount point for each chosen dataset.
A workspace is a persistent, shareable storage unit that you can mount to a job to enable concurrent access. Each workspace has an ID and (optionally) a name). Your storage quota includes workspaces.
The main purpose of a workspace is to enable data sharing between jobs, such as for re-training and checkpoints. It also facilitates collaboration between multiple users, providing a convenient place to store and sync code. You can write multiple jobs in the same workspace. Workspaces can serve as network home directories and shared storage spaces for specific teams.
To build an NGC workspace:
- Go to Base Command on the navigation menu on the left, then Workspaces.
- Next, go to the Create Workspace menu at the page’s top right.
Under Create a Workspace, enter the workspace’s name and choose the ACE you want to attach to the workspace.
- Select Create to add the workspace to your workspace list.
Learn More About NVIDIA A100
NVIDIA Deep Learning GPU: Choosing the Right GPU for Your Project
NVIDIA Deep Learning GPUs provide high processing power for training deep learning models. This article provides a review of three top NVIDIA GPUs—NVIDIA Tesla V100, GeForce RTX 2080 Ti, and NVIDIA Titan RTX.
Learn what is the NVIDIA deep learning SDK, what are the top NVIDIA GPUs for deep learning, and what best practices you should adopt when using NVIDIA GPUs.
NVIDIA DGX: Under the Hood of DGX-1, DGX-2 and A100
DGX is a line of servers and workstations built by NVIDIA, which can run large, demanding machine learning and deep learning workloads on GPUs. DGX provides a massive amount of computing power—between 1-5 PetaFLOPS in one DGX system. It also provides advanced technology for interlinking GPUs and enabling massive parallelization across thousands of GPU cores.
Get an in-depth look into three generations of the NVIDIA DGX series, including hardware architecture, software architecture, networking and scalability features.
NVIDIA NGC: Features, Popular Containers, and a Quick Tutorial
NVIDIA NGC is a repository of containerized applications you can use in deep learning, machine learning, and high performance computing (HPC) projects. These applications are optimized for running on NVIDIA GPU hardware, contain pre-trained models and Helm charts that let you deploy applications seamlessly in Kubernetes clusters.
Learn about NVIDIA NGC, a repository of containerized applications for machine learning. See examples of containers offered on NGC and learn how to get started.