Deep learning is a machine learning technique used to build artificial intelligence (AI) systems. It is based on the idea of artificial neural networks (ANN), designed to perform complex analysis of large amounts of data by passing it through multiple layers of neurons.
There is a wide variety of deep neural networks (DNN). Deep convolutional neural networks (CNN or DCNN) are the type most commonly used to identify patterns in images and video. DCNNs have evolved from traditional artificial neural networks, using a three-dimensional neural pattern inspired by the visual cortex of animals.
Deep convolutional neural networks are mainly focused on applications like object detection, image classification, recommendation systems, and are also sometimes used for natural language processing.
In this article, you will learn:
The strength of DCNNs is in their layering. A DCNN uses a three-dimensional neural network to process the Red, Green, and Blue elements of the image at the same time. This considerably reduces the number of artificial neurons required to process an image, compared to traditional feed forward neural networks.
Deep convolutional neural networks receive images as an input and use them to train a classifier. The network employs a special mathematical operation called a “convolution” instead of matrix multiplication.
The architecture of a convolutional network typically consists of four types of layers: convolution, pooling, activation, and fully connected.
Applies a convolution filter to the image to detect features of the image. Here is how this process works:
The convolution maps are passed through a nonlinear activation layer, such as Rectified Linear Unit (ReLu), which replaces negative numbers of the filtered images with zeros.
The pooling layers gradually reduce the size of the image, keeping only the most important information. For example, for each group of 4 pixels, the pixel having the maximum value is retained (this is called max pooling), or only the average is retained (average pooling).
Pooling layers help control overfitting by reducing the number of calculations and parameters in the network.
After several iterations of convolution and pooling layers (in some deep convolutional neural network architectures this may happen thousands of times), at the end of the network there is a traditional multi layer perceptron or “fully connected” neural network.
In many CNN architectures, there are multiple fully connected layers, with activation and pooling layers in between them. Fully connected layers receive an input vector containing the flattened pixels of the image, which have been filtered, corrected and reduced by convolution and pooling layers. The softmax function is applied at the end to the outputs of the fully connected layers, giving the probability of a class the image belongs to – for example, is it a car, a boat or an airplane.
Related content: read our guide to deep learning for computer vision.
Below are five deep convolutional neural network architectures commonly used to perform object detection and image classification.
Region-based Convolutional Neural Network (R-CNN), is a network capable of accurately extracting objects to be identified in the image. However, it is very slow in the scanning phase and in the identification of regions.
The poor performance of this architecture is due to its use of the selective search algorithm, which extracts approximately 2000 regions of the starting image. Afterwards it executes N CNNs on top of each region, whose outputs are fed to a support vector machine (SVM) to classify the region.
Fast R-CNN is a simplified R-CNN architecture, which can also identify regions of interest in an image but runs a lot faster. It improves performance by extracting features before it identifies regions of interest. It uses only one CNN for the entire image, instead of 2000 CNN networks on each superimposed region. Instead of the SVM which is computationally intensive, a softmax function returns the identification probability. The downside is that Fast R-CNN has lower accuracy than R-CNN in terms recognition of the bounding boxes of objects in the image.
GoogleNet, also called Inception v1, is a large-scale CNN architecture which won the ImageNet Challenge in 2014. It achieved an error rate of less than 7%, close to the level of human performance. The architecture consists of a 22-layer deep CNN based on small convolutions, called “inceptions”, batch normalization, and other techniques to decrease the number of parameters from tens of millions in previous architectures to four million.
A deep convolutional neural network architecture with 16 convolutional layers. It uses 3x3 convolutions, and trained on 4 GPUs for more than two weeks to achieve its performance. The downside of VGGNet is that unlike GoogleNet, it has 138 million parameters, making it difficult to run in the inference stage.
The Residual Neural Network (ResNet) is a CNN with up to 152 layers. ResNet uses “gated units”, to skip some convolutional layers. Like GoogleNet, it uses heavy batch normalization. ResNet uses an innovative design which lets it run many more convolutional layers without increasing complexity. It participated in the ImageNet Challenge 2015, achieving an impressive error rate of 3.57%, while beating human-level performance on the trained dataset.
Deep convolutional neural networks are the state of the art mechanism for classifying images. For example, they are used to:
CNN classification on medical images is more accurate than the human eye and can detect abnormalities in X-ray or MRI images. Such systems can analyze sequences of images (for examples, tests taken over a long period of time) and identify subtle differences that human analysts might miss. This also makes it possible to perform predictive analysis.
Classification models for medical images are trained on large public health databases. The resulting models can be used on patient test results, to identify medical conditions and automatically generate a prognosis.
Optical character recognition (OCR) is used to identify symbols such as text or numbers in images. Traditionally OCR was performed using statistical or early machine learning techniques, but today many OCR engines use deep convolutional neural networks.
OCR powered by CNNs can be used to improve search within rich media content, and identify text in written documents, even those with poor quality or hard to recognize handwriting. This is especially important in the banking and insurance industries. Another application of deep learning OCR is for automated signature recognition.
Run:AI automates resource management and orchestration for machine learning infrastructure. With Run:AI, you can automatically run as many compute intensive experiments as needed.
Here are some of the capabilities you gain when using Run:AI:
Run:AI simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:AI GPU virtualization platform.
Read more in our series of guides about deep learning for computer vision.