Google Cloud Platform (GCP) is the world’s third largest cloud provider. Google offers a number of virtual machines (VMs) that provide graphical processing units (GPUs), including the NVIDIA Tesla K80, P4, T4, P100, and V100.
You can use NVIDIA GPUs on GCP for large scale cloud deep learning projects, analytics, physical object simulation, video transcoding, and molecular modeling. GCP also provides virtual NVIDIA GRID workstations, which can let an organization’s employees run graphics-intensive workloads remotely.
In this article, you will learn:
Google Cloud provides several GPU options. These GPUs can be selected as part of two Google instance types:
The available GPUs are as follows:
GPUs suitable for model training, inference, and high performance computing:
GPUs suitable for inference, training, remote visualization, and transcoding:
Related content: read our guides to deep learning on other cloud providers:
Google Cloud provides another hardware acceleration option—the Tensorflow Processing Unit (TPU). While not strictly a GPU, TPUs are a powerful alternative for machine learning workloads, especially deep learning.
A TPU is an application-specific integrated circuit (ASIC) developed by Google specifically to accelerate machine learning. Google provides TPU on demand as a deep learning cloud service called Cloud TPU.
Cloud TPU is tightly integrated with Google's open source machine learning (ML) framework, TensorFlow, which provides dedicated APIs for TPU hardware. Cloud TPU lets you create TensorFlow compute unit clusters including TPUs, GPUs, and regular CPUs.
Cloud TPU is mainly suitable for machine learning models based on matrix calculations, models that require weeks or months to train, models with large datasets or a large number of variables, and those that run a training loop many times (as in neural networks).
Cloud TPU is not suitable for models that use linear or elementary algebra, models that do not access memory often, or those that involve high-precision arithmetic operations.
Related content: read our complete guide to google TPU
Here is how to create a Google Cloud virtual machine (VM) with an attached NVIDIA A100 GPU:
That’s it! This process spins up a Google Cloud VM with an attached NVIDIA GPU.
Here are two tips that can help you improve GPU performance in a Google Cloud VM.
Autoboost is a feature in GPUs of the NVIDIA Tesla K80 series. It automatically adjusts clock frequency to determine the best frequency for your particular application. However, constantly adjusting the clock frequency will also reduce GPU performance when running on Google infrastructure.
If you're running an NVIDIA Tesla K80 GPU on Compute Engine, it is recommended to disable auto boost, using the following command (in Linux):
sudo nvidia-smi --auto-boost-default=DISABLED
When using Tesla K80, you should also set the GPU clock speed to the highest frequency, using this command:
sudo nvidia-smi --applications-clocks=2505,875
To make distributed workloads run faster with NVIDIA Tesla T4 or V100, use the maximum network bandwidth of 100 Gbps, as follows:
See additional best practices from Google for using the maximum 100 Gbps bandwidth.
Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed, managing large numbers of GPUs in Google Cloud and other public clouds.
Here are some of the capabilities you gain when using Run:AI:
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:AI GPU virtualization platform.