AWS Deep Learning

Choosing the Best Option for You

What is AWS Deep Learning?

There is increasing demand for deep learning technology, which can discover complex patterns in images, text, speech, and other data, and can power a new generation of applications and data analysis systems.

Many organizations are using cloud computing for deep learning. Cloud systems are useful for storing, processing and ingesting the large data volumes required for deep learning, and to perform large-scale training on deep learning models using multiple GPUs. With cloud deep learning, you can request as many GPU machines as needed, and scale up and down on demand.

Amazon Web Services (AWS) provides an extensive ecosystem of services to support deep learning applications. This article introduces the unique value proposition of Amazon Web Services—including storage resources, fast compute instances with GPU hardware, and high performance networking resources.

AWS also provides end-to-end deep learning solutions, including SageMaker and Deep Learning Containers. Read on to learn more about these solutions and more.

In this article, you will learn:

This is part of an extensive series of guides about Cloud Deep Learning

AWS Technology Building Blocks for Deep Learning

Any deep learning project requires three essential resources—storage, compute, and networking. Here are the Amazon services typically used to power deep learning deployments in each of these three categories.

AWS Storage and Networking Resources for Deep Learning

Amazon Simple Storage Service (S3)

You can use Amazon S3 to store a massive amount of data for your deep learning projects, at a low cost. S3 can be the basis for data science tasks like data ingestion, extract, transform and load (ETL), ad-hoc data querying and data wrangling. You can also connect data analysis and visualization tools to S3 to make sense of your data before using it in a deep learning project.

Amazon Elastic Block Storage (EBS)

When training is performed, data is typically streamed from S3 to EBS volumes that are attached to the training machines in Amazon EC2. This provides low-latency access to data during model training.

Amazon Elastic File System (EFS)

Amazon EFS is probably the best storage option for large-scale batch processing, or when multiple training jobs need access to the same data. It allows developers and data scientists to access large amounts of data directly from their workstation or a code repository, with unlimited disk space and no need to manage network file shares.

Amazon FSx for Lustre is another high-performance file system solution suitable for compute-intensive workloads like deep learning.

Elastic Fabric Adapter (EFA)

Amazon EFA is a special network interface designed for high performance computing (HPC). It bypasses the operating system to allow ultra-fast communication between compute instances, for large distributed computing jobs.

AWS Compute Resources for Deep Learning

Neural network models typically require millions of matrix and vector operations. These operations can easily be parallelized, and this is why GPU hardware, which has a large number of cores, can provide a massive performance improvement.

Amazon went through four generations of GPU instances—the latest generation, called P4, was released in November 2020.

Amazon EC2 P2

Provide the following capabilities:

  • Up to 16 NVIDIA K80 GPUs (up to 192 GB of GPU memory)
  • 64 vCPUs
  • 732 GiB of memory
  • GPUDirect communication, allowing up to 16 GPUs to work together

Amazon EC2 P3 Instances

Provide the following capabilities:

  • Up to 8 NVIDIA V100 GPUs (up to 256 GB of GPU memory)
  • 1.8 TB SSD storage supporting NVMe
  • Elastic Fabric Adapter (EFA) provides acceleration for distributed machine learning

Amazon EC2 P4 Instances

Provide the following capabilities:

  • 8 NVIDIA A100 Tensor Core GPU, with 2.5 PetaFLOPS of performance and a total of 320 GB of GPU memory
  • NVLink link, supporting NVIDIA GPUDirect
  • 1.1 TB of regular RAM
  • 8 TB SSD drive supporting the fast NVME protocol, with 16 Gbps read throughput
  • Four network connections with 100 Gbps each
  • Ultra fast link to Amazon EBS storage with 19 Gbps bandwidth and up to 80K IOPS

Amazon EC2 G4

The G4 instance is a more cost-effective instance offering good performance for deep learning inference applications. It comes with:

  • Up to 64 virtual CPUs
  • Up to 4 NVIDIA T4 GPUs
  • UP to to 256 GB of RAM
  • Up to 900 GB NVMe storage
  • One 100 Gbps network link
  • Up to 50 Gbps network throughput
  • G4 instances offer up to 50 Gbps of network throughput
  • Free NVIDIA GRID and gaming drivers

AWS Deep Learning Services

Beyond offering the building blocks for deep learning applications, Amazon also offers end-to-end deep learning solutions. We’ll cover three options.

AWS SageMaker

Amazon SageMaker is a fully managed machine learning service, which enables data scientists and developers to create and train machine learning models, including deep learning architectures, and deploy them into a hosted production environment.

SageMaker provides an integrated Jupyter notebook, allowing data scientists to easily access data sources without needing to manage server infrastructure. It makes it easy to run common ML and DL algorithms, pre-optimized to run in a distributed environment.

AWS Deep Learning AMI (DLAMI)

AWS DLAMI is a custom EC2 machine image that can be used with multiple instance types, including simple CPU instances and fast GPU instances like P4. Developers and data scientists can use it to instantly set up a pre-configured DL environment on Amazon, including CUDA, cuDNN, and popular frameworks like PyTorch, TensorFlow, and Horovod.

AWS Deep Learning Containers

AWS Deep Learning Containers are a pre-installed deep learning Docker image that includes a complete deep learning development environment. It comes pre-installed with TensorFlow and PyTorch, and can be deployed on SageMaker or Amazon container services, including EKS and ECS. You can use Deep Learning Containers free, only paying for Amazon resources needed to run the container.

Amazon Elastic Inference

Elastic Inference is a method for attaching GPU-powered acceleration to regular Amazon EC2 instances, like you would add a GPU to a regular CPU-based machine. It can provide significant cost savings, by allowing you to run deep learning and SageMaker instances on regular compute instances, which are significantly cheaper than GPU instances.

Deep Learning in the Cloud with Run:AI

Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.

Our AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:

  • Advanced queueing and fair scheduling to allow users to easily and automatically share clusters of GPUs,
  • Distributed training on multiple GPU nodes to accelerate model training times,
  • Fractional GPUs to seamlessly run multiple workloads on a single GPU of any type,
  • Visibility into workloads and resource utilization to improve user productivity.

Run:AI simplifies machine learning infrastructure orchestration, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:AI GPU virtualization platform.