A Tensor Processing Unit (TPU) is an application specific integrated circuit (ASIC) developed by Google to accelerate machine learning. Google offers TPUs on demand, as a cloud deep learning service called Cloud TPU.
Cloud TPU is tightly integrated with TensorFlow, Google’s open source machine learning (ML) framework. You can use dedicated TensorFlow APIs to run workloads on TPU hardware. Cloud TPU lets you create clusters of TensorFlow computing units, which can also include CPUs and regular graphical processing units (GPUs).
In this article, you will learn:
- What is Google Cloud TPU?
- When to Use TPUs
- Cloud TPU Architecture
- TPU Versions
- Cloud TPU Performance Best Practices
- XLA Compiler Performance
- Model Processing Performance
- Consequences of Tiling
Cloud TPU is optimized for the following scenarios.
- Machine learning models primarily driven by matrix computations
- Main training loop does not use custom TensorFlow operations
- Long running models that take weeks or months to train
- Models with very large batch sizes
- Models that run the entire training loop multiple times—this is common for neural networks
Cloud TPU are not recommended for these scenarios:
- Models that use vector-wise linear algebra or element-wise algebra (as opposed to matrix calculations)
- Models that access memory sparsely
- Models that use arithmetic operations requiring a high level of precision
- Models using custom TensorFlow operations, especially if they run in the main training loop
Each TPU core has three types of processing units:
- Scalar processor
- Vector processor
- Matrix units (MXU)—provides most of the computing power of the TPU chip. Each MXU can run 16,000 cumulative multiplication operations per cycle
MXUs use bfloat16, a 16-bit floating point representation, which provides better accuracy for machine learning model calculations compared to the traditional half-precision representation.
Each core in a TPU device can perform calculations (known as XLA operations) individually. High bandwidth interconnects enable the chips to directly communicate with each other.
Cloud TPU offers two deployment options:
- Single TPU devices—not interconnected through a dedicated high-speed network. You cannot combine single device TPU versions (see description below) to run the same workloads.
- TPU Pods—connect multiple TPU devices with a high-speed network interface. This provides ML workloads with a massive pool of TPU cores and memory, and makes it possible to combine TPU versions.
A TPU version specifies the hardware characteristics of the device. The table below provides details for the latest two generations.
|Version||High Bandwidth Memory (HBM)||Total Memory||# of MXUs per Core|
|TPU v2||8 GB||4 TB||Up to 512|
|TPU v3||16 GB||32 TB||Up to 2048|
Google has announced the launch of a fourth-generation TPU ASIC, called TPU v4, which provides more than twice matrix multiplication capacity than v3, greatly improved memory bandwidth, and improved interconnect technology. In the MLPerf benchmark, TPU v4 had 2.7X better performance than v3
The results showed that on a similar scale in previous ML Perf training competitions, TPU v3’s performance improved by an average of 2.7 times. Please wait patiently. Details of the TPUv4 will be released soon.
Cloud TPU Performance Best Practices
Here are a few best practices you can use to get the most out of TPU resources on Google Cloud.
XLA Compiler Performance
Accelerated Linear Algebra (XLA) is a machine learning compiler that can generate executable binaries for TPU, CPU, GPU and other hardware platforms. XLA comes with TensorFlow’s standard codebase. Cloud TPU TensorFlow models are converted to XLA graphs, and XLA graphs are compiled into TPU executables.
The hardware used for Cloud TPU is distinctly different from that used for CPUs and GPUs. At a higher level, a CPU runs only a few high-performance threads, while a GPU runs many threads with poor thread performance. By contrast, cloud TPUs with 128 x 128 matrix units run one very powerful thread capable of running 16K operations per cycle. This one thread is composed of 128 x 128 threads connected in the form of a pipeline.
Therefore, when addressing memory on a TPU, prefer to use multiples of 8 (floating point), and when running matrix operations, use multiples of 128.
Model Processing Performance
Here is how to resolve two common problems when training models on a TPU:
Data preprocessing takes too long
The software stack provided by TensorFlow-TPU lets CPUs perform complex data preprocessing before sending the data to the TPU. However, TPUs are incredibly fast, and complex input data processing can quickly accumulate into bottlenecks.
Google provides a Cloud TPU analysis tool, which lets you measure whether input processing is causing a bottleneck. “In this case, you can look for optimizations, like performing specific pre-processing operations offline on a one-time basis, to avoid a slowdown.
Sharding makes batch size too small
Your model batch size is automatically sharded, or split, between 8 cores on the TPU. For example, if your batch size is 128, the true batch size running on each TPU core is 16. This will utilize the TPU at only a fraction of its capacity.
To optimally use memory on the TPU, use the biggest batch size which, when divided by 8, will fit your TPU’s memory. Batch sizes should always be divisible by 128, because a TPU uses 128 x 128 memory cells for processing.
Consequences of Tiling
Cloud TPU arrays are padded (or “tiled”), filling one dimension to the nearest multiple of 8, and the other dimension to a multiple of 128. The XLA compiler uses heuristics to arrange the data in an efficient manner, but this can sometimes go wrong. Try different model configurations to see which gives you the best performance.
Take into account memory that is wasted on padding. To make the most efficient use of TPUs, structure your model dimension sizes to fit the dimensions expected by the TPU, to minimize tiling and wasted memory overhead.
Google TPU with Run:AI
Run:AI automates resource management and workload orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments using GPU, and CPU hardware.
Our AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:
- Advanced queueing and fair scheduling to allow users to easily and automatically share clusters of GPUs,
- Distributed training on multiple GPU nodes to accelerate model training times,
- Fractional GPUs to seamlessly run multiple workloads on a single GPU of any type,
- Visibility into workloads and resource utilization to improve user productivity.
Run:AI simplifies machine learning infrastructure orchestration, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run.ai GPU virtualization platform.