What Is NVIDIA EGX?
NVIDIA EGX is an enterprise edge computing platform. With billions of IoT devices in use today, there is a growing need for managing data streaming from these devices and enabling analytics and machine learning to derive insights from it. EGX provides a software stack, architecture, and supporting hardware for handling data-intensive workloads in a variety of scenarios. It has a special focus on edge locations, but can also be used in the cloud and traditional data centers.
This is part of our series of articles about NVIDIA A100 and other NVIDIA AI offerings.
In this article:
NVIDIA EGX Use Cases
Here are some notable use cases for Nvidia EGX.
AI at the Edge
Edge AI allows organizations to respond quickly to complex data, operations, and markets. Combining AI capabilities with advanced connectivity and computing power allows businesses to shift their operations from a data center to the network edge and reduce the distance between the data capture and processing locations. AI is playing a growing role in helping them rapidly respond. This approach minimizes transit costs and improves latency.
However, edge AI has special requirements given the lack of a central computing resource. NVIDIA EGX allows organizations to centrally manage their system and software updates across a distributed edge. It also helps address the unique edge computing security requirements with an end-to-end security approach.
The platform supports various accelerated edge AI applications to deliver insights quickly.
Related content: Read our guide to edge AI
AI for Data Centers
AI is transformative for various industries, but many organizations struggle to adapt their existing infrastructure to support AI applications. With AI models constantly growing and requiring more training data, traditional data centers cannot meet their resource needs. The complexity of many AI applications makes them difficult to scale, monitor, and manage, especially if the hardware and infrastructure are not standardized.
NVIDIA EGX helps IT teams deliver complete AI solutions on cost-effective, powerful infrastructure. Based on NVIDIA-Certified Systems servers with high-performance GPUs, the platform secures NVIDIA Mellanox networking, allowing customers to future-proof their organizations by standardizing their environments with a unified architecture to manage, deploy, and monitor projects.
Data-Driven Enterprise Models
Data science dominates modern computing, with data analytics capabilities translating into direct cost savings and revenues. However, it remains complex and time-consuming and relies on extensive infrastructure and computing power. CPU-based computing cannot support data-driven applications at scale.
NVIDIA EGX offers parallel GPU computing, eliminating bottlenecks and improving performance. It helps accelerate insights and ROI, allowing enterprises to leverage high-performance GPU computing.
Creators often face complex challenges when producing large amounts of data and creating high-quality content, especially when geographically distributed teams work remotely. They require extensive computing and graphics power to support visual computing workloads such as rendering.
Visualization applications require a powerful computing infrastructure that supports sophisticated technologies, while IT teams rely on secure, manageable, and scalable solutions to maintain workstation performance. NVIDIA EGX supports visualization projects with GPU computing software, including NVIDIA A40 and NVIDIA vGPU.
The cloud native NVIDIA Omniverse platform uses multiple GPUs to support remote collaboration and visualization capabilities like photorealistic simulation. It streamlines infrastructure management processes and secures assets by eliminating the need to send sensitive data globally.
NVIDIA EGX Platform Overview
The NVIDIA EGX platform provides a cloud-native software stack, called the EGX stack, validated servers and hardware appliances, a large ecosystem of partners offering services over EGX, and a library of Helm charts organizations can use to easily deploy AI applications.
Key features of the EGX platform include:
- Cloud native—EGX is built on cloud-native technologies such as microservices, containerization, and declarative automation. It runs GPU-optimized NVIDIA NGC containers (the NGC Catalog is a curated set of software for AI and HPC).
- Open source—the EGX stack is based on open source projects, and NVIDIA actively contributes to open source projects used in the stack.
- Performance and scale—the EGX stack uses NVIDIA Ampere A100 Tensor Core GPUs with EGX fusion accelerators that scale workloads across multiple nodes.
- Partner ecosystem—NVIDIA has certified a range of edge hardware systems for use with EGX. These systems have passed extensive testing that validates their ability to deliver high performance running NGC containers, and sufficient remote management and security capabilities.
NVIDIA GPU and Network Operator
At the heart of the EGX stack are two Kubernetes operators that automatically manage NVIDIA infrastructure.
NVIDIA GPU Operator automatically manages all NVIDIA software components required for GPU configuration, including:
- CUDA-enabled NVIDIA drivers
- Kubernetes device plugins for GPUs
- NVIDIA container runtimes
- Automatic labeling for nodes
- The NVIDIA Data Center GPU Manager (DCGM) monitoring agent
NVIDIA Network Operator leverage Kubernetes Custom Resource Definitions (CRD) to enable high-speed networking, Remote Direct Memory Access (RDMA), and GPUDirect on Kubernetes clusters. It provides support for RDMA shared devices, the Mellanox Kubernetes device plugin, and the GPUDirect RDMA peer memory driver.
Related content: Read our guide to NVIDIA deep learning GPU
The NVIDIA EGX stack can be deployed in a variety of hardware combinations, from full racks of NVIDIA T4 servers to pocket-sized NVIDIA Jetson Nano devices. NVIDIA's NGC-Ready program validates edge systems to maximize the potential of NVIDIA’s GPU-optimized software. NGC-Ready validation not only checks for GPU compatibility, but performs additional security and remote system administration tests to help system administrators remotely manage and secure their edge systems.
NVIDIA EGX Management with Run:ai
Run:ai automates resource management and orchestration for machine learning infrastructure, including on EGX edge devices. With Run:ai, you can automatically run as many compute intensive experiments as needed.
Our AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:
- Advanced queueing and fair scheduling to allow users to easily and automatically share clusters of GPUs,
- Distributed training on multiple GPU nodes to accelerate model training times,
- Fractional GPUs to seamlessly run multiple workloads on a single GPU of any type,
- Visibility into workloads and resource utilization to improve user productivity.
Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:ai GPU virtualization platform.