Azure Machine Learning

From Basic ML to Distributed Deep Learning Models

What is Azure Machine Learning (ML)?

Data scientists use machine learning (ML) techniques when training algorithms to use data to predict future behavior, results, and trends. ML enables computers to learn without explicit programming.

Azure ML is a cloud solution that applies for all types of ML, including traditional supervised and unsupervised machine learning models, and newer deep learning (DL) techniques. Azure’s Machine Learning service provides a few ways to work with ML models:

  • Programmatically via the Python SDK or R SDK
  • Using a graphical UI in an Azure ML Workspace
  • Via the low-code or no-code options in Azure ML Studio

This is part of our series of articles about cloud deep learning.

Related content: learn about AWS deep learning options

In this article, you will learn:

How Does Azure Machine Learning Work?

The top-level entity in Azure Machine Learning is a workspace. It contains everything you need to work with machine learning models in Azure:

  • Cloud resources, including compute instances, used to train the model and run it in production
  • Assets and artifacts created during the machine learning process

There are other Azure resources in use in the workspace.

  • Azure Container Registry (ACR)—machine learning models and their associated code are stored in the registry as Docker containers
  • Azure Storage account—this is where your machine learning datasets and Jupyter notebooks are stored.
  • Azure Key Vault—used to manage secrets and other sensitive data needed by resources in your workspace.
  • Azure Application Insights—allows you to monitor execution and performance of your ML models.

Virtual Machines

Azure Machine Learning provides two types of fully managed virtual machines (VMs) configured for machine learning jobs.

  • Compute instance—VMs with tools and environments configured for machine learning. A compute instance can serve as a virtual workstation for an ML developer. It can run Jupyter notebooks with no further configuration. Alternatively, a compute instance can be used to run ML models at the training or inference stages.
  • Compute clusters—a group of VMs with auto-scaling capabilities. Typically used as targets for large-scale ML jobs, training large datasets, or production workloads, and can add compute nodes as required by the ML job.

Datasets and Datastores

Azure Machine Learning provides an additional entity called a dataset, which makes your ML data easy to access and use. When creating a dataset, you provide a reference to your source data, and a copy of its metadata. You do not need to duplicate your ML datasets to Azure Machine Learning, just point to them, which saves storage costs and improves security.

Datasets are securely connected to Azure storage through an entity called a datastore. The datastore holds connection information securely, and allows the dataset to connect to your original data, wherever it is located. It retrieves secrets and credentials from the Azure Key Vault instance which is part of the workspace. This fully integrated setup allows you to access storage securely without needing to write scripts, manage complex configuration, or perform any manual action.

Models

In Azure Machine Learning, a model is simply code that accepts data as input and returns outputs. Models can be added to the system in two ways:

  • Importing a pre-trained model from other machine learning frameworks including scikit-learn, PyTorch, TensorFlow, and XGBoost.
  • Providing code for a new model, and submitting it for training on compute targets in Azure Machine Learning. Once it is trained, it can be registered in the workspace as a model.
  • Deployment

Any model in the workspace can be deployed for production use as a service endpoint. This requires three components:

  • Environment—specifies dependencies the model needs to run at the inference stage.
  • Scoring code—receives requests, evaluates them using the machine learning model, and outputs results produced by the algorithm.
  • Inference configuration—references the environment, scoring code, and any other resources the model needs to run as a managed service.

Thus, Azure Machine Learning lets you set up a complete machine learning environment, including compute resources, datasets, models, and endpoints that can help you deploy a model in production for external applications or end users.

Distributed Training of Deep Learning Models on Azure

Azure Machine Learning can also be used to train large-scale deep learning models. Below is a reference architecture provided by Microsoft, which shows how to distribute deep learning jobs across VM clusters with GPU support. The reference architecture refers to an image classification model, but it can be used for many other deep learning use cases.

Source: Azure


The architecture is comprised of four key components:

  • Azure Machine Learning Compute—provides VMs that run parts of the distributed deep learning job, auto-scaling as necessary. Azure Machine Learning compute clusters can schedule tasks, collect results, adjust resources to actual loads, and manage errors. VMs that participate in the cluster can be GPU-enable to accelerate deep learning calculations.
  • Standard blob storage—records results and stores execution logs.
  • Premium blob storage—used to store training data and enable high performance access during model training, which is needed for distributed training. The architecture uses Azure Blobfuse to mount blob storage on the compute instances, with local caching. In the first epoch of training, data is pulled from blob storage, and subsequently, data is accessed from local storage on the VM.
  • Azure Container Registry—holds Docker containers, pre-configured with the relevant deep learning framework, which run on the VM instances and perform the actual training.

Performance Considerations

Azure offers four types of virtual machines that support GPUs and are suitable for training DL algorithms. It is best to start with a single instance and see if it supports your load with sufficient training performance, and if not, scale up to a cluster of smaller instances.

Azure provides four VMs that support GPUs: NC, ND, NCv2, and NCv3. They provide successively more powerful NVIDIA GPUs: K80, P40, P100, and V100, respectively. See the official documentation for Azure GPU instances.

Scalability Considerations

Due to network overhead, distributed training efficiency is always lower than 100%. The main bottleneck is due to device-to-device synchronization. Therefore, distributed learning is ideal for large models that cannot be trained on a single VM, and need to be broken up and trained one piece at a time.

Storage Considerations

When training a deep learning model, you need to ensure that the model has high performance access to the dataset. You may be running on a fast GPU instance, but if storage is too slow, it will slow training down.

This is why the reference architecture recommends using two measures to improve data access performance:

  • Storing datasets on premium blob storage which provides higher performance
  • Using a local cache mechanism, to ensure that except for the first data load, each time the model needs to access data it can do so from internal storage on the VM

To summarize, using the reference architecture in the figure above, you can run large-scale, distributed deep learning jobs on Azure with high performance, on a fully managed infrastructure that takes care of compute, storage, deployment and monitoring.

Large-Scale Machine Learning and Deep Learning Training with Run:AI

Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.

Our AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:

  • Advanced queueing and fair scheduling to allow users to easily and automatically share clusters of GPUs,
  • Distributed training on multiple GPU nodes to accelerate model training times,
  • Fractional GPUs to seamlessly run multiple workloads on a single GPU of any type,
  • Visibility into workloads and resource utilization to improve user productivity.

Run:AI simplifies machine learning infrastructure orchestration, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:AI GPU virtualization platform.