A Year in Computer Vision — Part 4 of 4

— Part Four: ConvNet Architectures, Datasets, Ungroupable Extras

Published in

Towards Data Science

23 min readOct 2, 2017

The following piece is taken from a recent publication compiled by our research team relating to the field of Computer Vision. All parts are available through our website now and Part 1, Part 2 and Part 3 on medium.

The full publication is available for free on our website now via: www.themtank.org

We encourage readers to view the piece through our own website, as we include embedded content and easy navigational functions to make the report as dynamic as possible. Our website generates no revenue for the team and simply aims to make the materials as engaging and intuitive for readers as possible. Any feedback on the presentation there is wholeheartedly welcomed by us!

Please follow, share and support our work through whatever your preferred channels are (and clap to your hearts content!). Feel free to contact the editors with any questions or to see about potentially contributing to future works: info@themtank.com

Part Four: ConvNet Architectures, Datasets, Ungroupable Extras

ConvNet Architectures

ConvNet architectures have recently found many novel applications outside of Computer Vision, some of which will feature in our forthcoming publications. However, they continue to feature prominently in Computer Vision, with architectural advancements providing improvements in speed, accuracy and training for many of the aforementioned applications and tasks in this paper.

For this reason, ConvNet architectures are of fundamental importance to Computer Vision as a whole. The following features some noteworthy ConvNet architectures from 2016, many of which take inspiration from the recent success of ResNets.

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning[131] — present Inception v4, a new Inception architecture which builds on the Inception v2 and v3 from the end of 2015.[132] The paper also provides an analysis of using residual connections for training Inception Networks along with some Residual-Inception hybrid networks.
Densely Connected Convolutional Networks[133] or “DenseNets” take direct inspiration from the identity/skip connections of ResNets. The approach extends this concept to ConvNets by having each layer connect to every other layer in a feed forward fashion, sharing feature maps from previous layers as inputs, thus creating DenseNets.

“DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters”.[134]

Figure 16: Example of DenseNet Architecture

**Note**: A 5-layer dense block with a growth rate of *k = 4*. Each layer takes all preceding feature-maps as input. **Source**: Huang et al. (2016)[135]

The model was evaluated on CIFAR-10, CIFAR-100, SVHN and ImageNet; it achieved SOTA on a number of them. Impressively, DenseNets achieve these results while using less memory and with reduced computational requirements. There are multiple implementations (Keras, Tensorflow, etc) here.[136]

FractalNet Ultra-Deep Neural Networks without Residuals[137] — utilises interacting subpaths of different lengths, without pass-through or residual connections, instead altering internal signals using filter and nonlinearities for transformations.

“FractalNets repeatedly combine several parallel layer sequences with different numbers of convolutional blocks to obtain a large nominal depth, while maintaining many short paths in the network”.[138]

The network achieved SOTA performance on CIFAR and ImageNet, while demonstrating some additional properties. For instance, they call into question the role of residuals in the success of extremely deep ConvNets, while also providing insight into the nature of answers attained by various subnetwork depths.

Lets keep it simple: using simple architectures to outperform deeper architectures[139] focuses on creating a simplified mother architecture. The architecture achieved SOTA results, or parity with existing approaches, on ‘datasets such as CIFAR10/100, MNIST and SVHN with simple or no data-augmentation’. We feel their exact words provide the best description of the motivation here:

“In this work, we present a very simple fully convolutional network architecture of 13 layers, with minimum reliance on new features which outperforms almost all deeper architectures with 2 to 25 times fewer parameters. Our architecture can be a very good candidate for many scenarios, especially for use in embedded devices.”

“It can be furthermore compressed using methods such as DeepCompression and thus its memory consumption can be decreased drastically. We intentionally tried to create a mother architecture with minimum reliance on new features proposed recently, to show the effectiveness of a well-crafted yet simple convolutional architecture which can then later be enhanced with existing or new methods presented in the literature.”[140]

Here are some additional techniques which complement ConvNet Architectures:

Swapout: Learning an ensemble of deep architectures[141] generalises dropout and stochastic depth methods to prevent co-adaptation of units, both in a specific layer and across network layers. The ensemble training method samples from multiple architectures including “dropout, stochastic depth and residual architectures”. Swapout outperforms ResNets of identical network structure on the CIFAR-10 and CIFAR-100 and can be classified as a regularisation technique.
SqueezeNet[142] posits that smaller DNNs offer various benefits, from less computationally taxing training to easier information transmission to, and operation on, devices with limited storage or processing power. SqueezeNet is a small DNN architecture which achieves ‘AlexNet-level accuracy with significantly reduced parameters and memory requirements using model compression techniques which make it 510x smaller than AlexNet.’

A Rectified Linear Unit (ReLU) is traditionally the dominant activation function for all Neural Networks. However, here are some recent alternatives:

Concatenated Rectified Linear Units (CRelu)[143]
Exponential Linear Units (ELUs)[144] from the close of 2015
Parametric Exponential Linear Unit (PELU)[145]

Moving towards equivariance in ConvNets

ConvNets are translation invariant — meaning they can identify the same features in multiple parts of an image. However, the typical CNN isn’t rotation invariant — meaning that if a feature or the whole image is rotated then the network’s performance suffers. Usually ConvNets learn to (sort of) deal with rotation invariance through data augmentation (e.g. purposefully rotating the images by small random amounts during training). This means the network gains slight rotation invariant properties without specifically designing rotation invariance into the network. This means that rotation invariance is fundamentally limited in networks using current techniques. This is an interesting parallel with humans who also typically fare worse at recognising characters upside down, although there is no reason for machines to suffer this limitation.

The following papers tackle rotation-invariant ConvNets. While each approach has novelties, they all improve rotation invariance through more efficient parameter usage leading to eventual global rotation equivariance:

Harmonic CNNs[146] replace regular CNN filters with ‘circular harmonics’.
Group Equivariant Convolutional Networks (G-CNNs)[147] uses G-Convolutions, which are a new type of layer that “enjoys a substantially higher degree of weight sharing than regular convolution layers and increases the expressive capacity of the network without increasing the number of parameters.”
Exploiting Cyclic Symmetry in Convolutional Neural Networks[148] presents four operations as layers which augment neural network layers to partially increase rotational equivariance.
Steerable CNNs[149] — Cohen and Welling build on the work they did with G-CNNs, demonstrating that “steerable architectures” outperform residual and dense networks on the CIFARs. They also provide a succinct overview of the invariance problem:

“To improve the statistical efficiency of machine learning methods, many have sought to learn invariant representations. In deep learning, however, intermediate layers should not be fully invariant, because the relative pose of local features must be preserved for further layers. Thus, one is led to the idea of equivariance: a network is equivariant if the representations it produces transform in a predictable linear manner under transformations of the input. In other words, equivariant networks produce representations that are steerable. Steerability makes it possible to apply filters not just in every position (as in a standard convolution layer), but in every pose, thus allowing for increased parameter sharing.”107

Residual Networks

Figure 17: Test-Error Rates on CIFAR Datasets

Source: Abdi and Nahavandi (2016, p. 6)[150]

Residual Networks and their variants became incredibly popular in 2016, following the success of Microsoft’s ResNet,[151] with many open source versions and pre-trained models now available. In 2015, ResNet won 1st place in ImageNet’s Detection, Localisation and Classification tasks as well as in COCO’s Detection and Segmentation challenges. Although questions still abound about depth, ResNets tackling of the vanishing gradient problem provided more impetus for the “increased depth produces superior abstraction” philosophy which underpins much of Deep Learning at present.

ResNets are often conceptualised as an ensemble of shallower networks, which somewhat counteract the hierarchical nature of Deep Neural Networks (DNNs) by running shortcut connections parallel to their convolutional layers. These shortcuts or skip connections mitigate vanishing/exploding gradient problems associated with DNNs, by allowing easier back-propagation of gradients throughout the network layers. For more information there is a Quora thread available here.[152]

Residual Learning, Theory and Improvements

Wide Residual Networks[153] is now an extremely common ResNet approach. The authors conduct an experimental study on the architecture of ResNet blocks, and improve residual network performance by increasing the width and reducing the depth of the networks, which mitigates the diminishing feature reuse problem. This approach produces new SOTA on multiple benchmarks including 3.89% and 18.3% on CIFAR-10 and CIFAR-100 respectively. The authors show that a ‘16-layer-deep wide ResNet performs as well or better in accuracy and efficiency than many other ResNets (including 1000 layer networks)’.
Deep Networks with Stochastic Depth[154] essentially applies dropout to whole layers of neurons instead of to bunches of individual neurons. “We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function.” Stochastic depth allows quicker training and better accuracy even when training networks greater than 1200 layers.
Learning Identity Mappings with Residual Gates[155] — “by using a scalar parameter to control each gate, we provide a way to learn identity mappings by optimizing only one parameter.” The authors use these Gated ResNets to improve the optimisation of deep models, while providing ‘high tolerance to full layer removal’ such that 90% of performance remains following significant removal at random. Using Wide Gated ResNets the model achieves 3.65% and 18.27% error on CIFAR- 10 and CIFAR-100, respectively.
Residual Networks Behave Like Ensembles of Relatively Shallow Networks[156] — ResNets can be viewed as collections of many paths, which don’t strongly depend upon one another and hence reinforce the notion of ensemble behaviour. Furthermore, residual pathways vary in length with the short paths contributing to gradient during training while the deeper paths don’t factor in this stage.
Identity Mappings in Deep Residual Networks[157] comes as an improvement from the original Resnet authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Identity mappings are shown to allow ‘forward and backward signals to be propagated between any ResNet block when used as the skip connections and after-addition activation’. The approach improves generalisation, training and results “using a 1001-layer ResNet on CIFAR-10 (4.62% error) and CIFAR-100, and a 200-layer ResNet on ImageNet.”
Multi-Residual Networks: Improving the Speed and Accuracy of Residual Networks[158] again advocates for the ensemble behaviour of ResNets and favours a wider-over-deeper approach to ResNet architecture. “The proposed multi-residual network increases the number of residual functions in the residual blocks.” Improved accuracy produces 3.73% and 19.45% error on CIFAR-10 and CIFAR-100, respectively. The table presented in Fig. 17 was taken from this paper, and more up-to-date versions are available which consider the work produced in 2017 thus far.

Other residual theory and improvements
Although a relatively recent idea, there is quite a considerable body of work being created around ResNets presently. The following represents some additional theories and improvements which we wished to highlight for interested readers:

Datasets

The significance of rich datasets for all facets of machine learning cannot be overstated. Hence, we feel it is prudent to include some of the largest advancements in this domain. To paraphrase Ben Hamner, the CTO and co-founder of Kaggle, ‘a new dataset can make a thousand papers flourish’,[168] that is to say the availability of data can promote new approaches, as well as breath new life into previously ineffectual techniques.

In 2016, traditional datasets such as ImageNet[169], Common Objects in Context (COCO)[170], the CIFARs[171] and MNIST[172] were joined by a host of new entries. We also noted the rise of synthetic datasets spurred on by progress in graphics. Synthetic datasets are an interesting work-around of the large data requirements for Artificial Neural Networks (ANNs). In the interest of brevity, we have selected our (subjective) most important new datasets for 2016:

Places2[173] is a scene classification dataset, i.e. the task is to label an image with a scene class like ‘Stadium’, ‘Park’, etc. While prediction models and image understanding will undoubtedly be improved by the Places2 dataset, an interesting finding from networks that are trained on this dataset is that in the process of learning to classify scenes, the network learns to detect objects in them without ever being explicitly taught this. For example, that bedrooms contain beds and that sinks can be in both kitchens and bathrooms. This means that the objects themselves are lower level features in the abstraction hierarchy for the classification of scenes.

Figure 18: Examples from SceneNet RGB-D

**Note**: Examples taken from SceneNet RGB-D, a dataset with 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth. The photo (a) is rendered through computer graphics with available ground truth for specific tasks from (b) to (e). Creation of synthetic datasets should aid the process of domain adaptation. Synthetic datasets are somewhat pointless if the knowledge learned from them cannot be applied to the real world. This is where domain adaptation comes in, which refers to this transfer learning process of moving knowledge from one domain to another, e.g. from synthetic to real-world environments. Domain adaptation has recently been improving very rapidly again highlighting the recent efforts in transfer learning. Columns © vs (d) show the difference between instance and semantic/class segmentation.

Source: McCormac et al. (2017)[174]

SceneNet RGB-D[175] — This synthetic dataset expands on the original SceneNet dataset and provides pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection, and also for geometric computer vision problems such as optical flow, depth estimation, camera pose estimation, and 3D reconstruction. The dataset granularizes the chosen environment by providing pixel-perfect representations.
CMPlaces[176] is a cross-modal scene dataset from MIT. The task is to recognize scenes across many different modalities beyond natural images and in the process hopefully transfer that knowledge across modalities too. Some of the modalities are: Real, Clip Art, Sketches, Spatial Text (words written which correspond to spatial locations of objects) and natural language descriptions. The paper also discusses methods for how to deal with this type of problem with cross-modal convolutional neural networks.

Figure 19: CMPlaces cross-modal scene representations

**Note**: Taken from the CMPlaces paper showing two examples, bedrooms and kindergarten classrooms, across different modalities. Conventional Neural Network approaches learn representations that don’t transfer well across modalities and this paper attempts to generate a shared representation “agnostic of modality”.

Source: Aytar et al. (2016)[177]

In CMPlaces we see explicit mention of transfer learning, domain invariant representations, domain adaptation and multi-modal learning, all of which serve to demonstrate further the current undertow of Computer Vision research. The authors focus on trying to find “domain/modality-independent representations”, which could correspond to the higher level abstractions where humans draw their unified representations from. For instance take ‘cat’ across its various modalities, humans see the word ‘cat’ in writing, a picture drawn in a sketchbook, a real world-image or mentioned in speech but we still have the same unified representation abstracted at a higher level above these modalities.

“Humans are able to leverage knowledge and experiences independently of the modality they perceive it in, and a similar capability in machines would enable several important applications in retrieval and recognition”.

MS-Celeb-1M[178] contains images of one million celebrities with ten million training images in a training set for Facial Recognition.
Open Images[179] comes courtesy of Google Inc. and comprises ~9 million URLs to images complete with multiple labels, a vast improvement over typical single label images. Open images spans 6000 categories, a large improvement over the 1000 classes offered previously by ImageNet (with less focus on canines) and should prove indispensable to the Machine Learning community.
YouTube-8M[180] also comes courtesy of Google with 8 million video URLs, 500,000 hours of video, 4800 classes, Avg. 1.8 Labels per video. Some examples of the labels are: ‘Arts & Entertainment’, ‘Shopping’ and ‘Pets & Animals’. Video datasets are much more difficult to label and collect hence the massive value this dataset provides.

That being said, advancements in image understanding, such as segmentation, object classification and detection have brought video understanding to the fore of research. However, prior to this dataset release there was a real lack in the variety and scale of real-world video datasets available. Furthermore, this dataset was just recently updated,[181] and this year in association with Kaggle, Google is organising a video understanding competition as part of CVPR 2017.[182]

General information about YouTube-8M: here[183]

Ungroupable extras and interesting trends

As this piece draws to a close, we lament the limitations under which we had to construct it. Indeed, the field of Computer Vision is too expansive to cover in any real, meaningful depth, and as such many omissions were made. One such omission is, unfortunately, almost everything that didn’t use Neural Networks. We know there is great work outside of NNs, and we acknowledge our own biases, but we feel that the impetus lies with these approaches currently, and our subjective selection of material for inclusion was predominantly based on the reception received from the research community at large (and the results speak for themselves).

We would also like to stress that there are hundreds of other papers in the above topics, and this amalgam of topics is not curated as a definitive, but rather hopes to encourage interested parties to read further along the entrances we provide. As such, this final section acts as a catch all for some of the other applications we loved, trends we wished to highlight and justifications we wanted to make to the reader.

Applications/use cases

Applications for the blind from Facebook[184] and hardware from Baidu.[185]
Emotion detection combines facial detection and semantic analysis, and is growing rapidly. There are 20+ APIs currently available.[186]
Extracting roads from aerial imagery,[187] land use classification from aerial maps and population density maps.[188]
Amazon Go further raised the profile of Computer Vision by demonstrating a queue-less shopping experience,[189] although there remain some functional issues at present.[190]
There is a huge volume of work being done for Autonomous Vehicles that we largely didn’t touch. However, for those wishing to delve into general market trends, there’s an excellent piece by Moritz Mueller-Freitag of Twenty Billion Neurons about the German auto industry and the impact of autonomous vehicles.[191]
Other interesting areas: Image Retrieval/Search,[192] Gesture Recognition, Inpainting and Facial Reconstruction.
There is considerable work around Digital Imaging and Communications in Medicine (DICOM) and other medical applications, especially related to imaging. For instance, there have been (and still are) numerous Kaggle detection competitions (lung cancer, cervical cancer), some with large monetary incentives, in which algorithms attempt to outperform specialists at the classification/detection tasks in question.

However, while work continues on improving the error rates of these algorithms their value as a tool for medical practitioners appears increasingly evident. This is particularly striking when we consider the performance improvements in breast cancer detection achieved by combining AI systems[193] with medical specialists.[194] In this instance, robot-human symbiosis produces accuracy far greater than the sum of its parts at 99.5%.

This is just one example of the torrent of medical applications currently being pursued by the deep learning/machine learning communities. Some cynical members of our team jokingly make light of these attempts as a means to ingratiate society to the idea of AI research as a ubiquitous, benevolent force. But as long as the technology helps the healthcare industry, and it is introduced in a safe and considered manner, we wholeheartedly welcome such advances.

Hardware/markets

Growing markets for Robotic Vision/Machine Vision (separate fields) and potential target markets for IoT. A personal favourite of ours is the use of Deep Learning, a Raspberry Pi and TensorFlow by a farmer’s son to sort cucumbers in Japan based on unique producer heuristics for quality, e.g. shape, size and colour.[195] This produced massive decreases in human-time spent by his mother sorting cucumbers.
The trend of shrinking compute requirements and migrating to mobile is evident, but it’s also complemented by steep hardware acceleration. Soon we’ll see pocket sized CNNs and Vision Processing Units (VPUs) everywhere. For instance, the Movidius Myriad2 is used in Google’s Project Tango and drones.[196]

The Movidius Fathom stick,[197] which also uses the Myriad2’s technology, allows users to add SOTA Computer Vision performance to consumer devices. The Fathom stick, which has the physical properties of a USB stick, brings the power of a Neural Network to almost any device: Brains on a stick.

Sensors and systems that use something other than visible light. Examples include radar, thermographic cameras, hyperspectral imaging, sonar, magnetic resonance imaging, etc.
Reduction in cost of LIDAR, which use light and radar to measure distances, and offer many advantages over normal RGB cameras. There are many LIDAR devices for currently less than $500.
Hololens and the near-countless other Augmented Reality headsets[198] entering the market.
Project Tango by Google[199] represents the next big commercialisation of SLAM. Tango is an augmented reality computing platform, comprising both novel software and hardware. Tango allows the detection of mobile device position, relative to the world, without the use of GPS or other external information while simultaneously mapping the area around the device in 3D.

Corporate partners Lenovo brought affordable Tango enabled phones to market in 2016, allowing hundreds of developers to begin creating applications for the platform. Tango employs the following software technologies: Motion Tracking, Area Learning, and Depth Perception.

Update October 2019: What Happened to Google’s Tango?

Omissions based on forthcoming publications

There is also considerable, and increasing overlap between Computer Vision techniques and other domains in Machine Learning and Artificial Intelligence. These other domains and hybrid use cases are the subject of The M Tank’s forthcoming publications and, as with the whole of this piece, we partitioned content based on our own heuristics.

For instance, we decided to place the two integral Computer Vision tasks, Image Captioning and Visual Question Answering, in our forthcoming NLP piece along with Visual Speech Recognition because of the combination of CV and NLP involved. Whereas the application of Generative Models to images we place in our work on Generative Models. Examples included in these future works are:

Lip Reading: In 2016 we saw huge lip reading advancements in programs such as LipNet[200], which combine Computer Vision and NLP into Visual Speech Recognition.
Generative models applied to images will feature as part of our depiction of the violent* battle between the Autoregressive Models (PixelRNN, PixelCNN, ByteNet, VPN, WaveNet), Generative Adversarial Networks (GANs), Variational Autoencoders and, as you should expect by this stage, all of their variants, combinations and hybrids.

*Disclaimer: The team wishes to mention that they do not condone Network on Network (NoN) violence in any form and are sympathisers to the movement towards Generative Unadversarial Networks (GUNs).[201]

In the final section, we’ll offer some concluding remarks and a recapitulation of some of the trends we identified. We would hope that we were comprehensive enough to show a bird’s-eye view of where the Computer Vision field is loosely situated and where it is headed in the near-term. We also would like to draw particular attention to the fact that our work does not cover January-August 2017. The blistering pace of research output means that much of this work could be outdated already; we encourage readers to go and find out whether it is for themselves. But this rapid pace of growth also brings with it lucrative opportunities as the Computer Vision hardware and software markets are expected to reach $48.6 Billion by 2022.

Figure 20: Computer Vision Revenue by Application Market[202]

**Note**: Estimation of Computer Vision revenue by application market spanning the period from 2015–2022. The largest growth is forecasted to come from applications within the automotive, consumer, robotics and machine vision sectors.
**Source**: Tractica (2016)[203]

Conclusion

In conclusion we’d like to highlight some of the trends and recurring themes that cropped up repeatedly throughout our research review process. First and foremost, we’d like to draw attention to the Machine Learning research community’s voracious pursuit of optimisation. This is most notable in the year on year changes in accuracy rates, but especially in the intra-year changes in accuracy. We’d like to underscore this point and return to it in a moment.

Error rates are not the only fanatically optimised parameter, with researchers working on improving speed, efficiency and even the algorithm’s ability to generalise to other tasks and problems in completely new ways. We are acutely aware of the research coming to the fore with approaches like one-shot learning, generative modelling, transfer learning and, as of recently, evolutionary learning, and we feel that these research principles are gradually exerting greater influence on the approaches of the best performing work.

While this last point is unequivocally meant in commendation for, rather than denigration of, this trend, one can’t help but to cast their mind toward the (very) distant spectre of Artificial General Intelligence, whether merited a thought or not. Far from being alarmist, we just wish to highlight to both experts and laypersons that this concern arises from here, from the startling progress that’s already evident in Computer Vision and other AI subfields. Properly articulated concerns from the public can only come through education about these advancements and their impacts in general. This may then in turn quell the power of media sentiment and misinformation in AI.

We chose to focus on a one year timeline for two reasons. The first relates to the sheer volume of work being produced. Even for people who follow the field very closely, it is becoming increasingly difficult to remain abreast of research as the number of publications grow exponentially. The second brings us back to our point on intra-year changes.

In taking a single year snapshot of progress, the reader can begin to comprehend the pace of research at present. We see improvement after improvement in such short time spans, but why? Researchers have cultivated a global community where building on previous approaches (architectures, meta-architectures, techniques, ideas, tips, wacky hacks, results, etc.), and infrastructures (libraries like Keras, TensorFlow and PyTorch, GPUs, etc.), is not only encouraged but also celebrated. A predominantly open source community with few parallels, which is continuously attracting new researchers and having its techniques reappropriated by fields like economics, physics and countless others.

It’s important to understand for those who have yet to notice, that among the already frantic chorus of divergent voices proclaiming divine insight into the true nature of this technology, there is at least agreement; agreement that this technology will alter the world in new and exciting ways. However, much disagreement still comes over the timeline on which these alterations will unravel.

Until such a time as we can accurately model the progress of these developments we will continue to provide information to the best of our abilities. With this resource we hoped to cater to the spectrum of AI experience, from researchers playing catch-up to anyone who simply wishes to obtain a grounding in Computer Vision and Artificial Intelligence. With this our project hopes to have added some value to the open source revolution that quietly hums beneath the technology of a lifetime.

With thanks,

The M Tank

Please feel free to place all feedback and suggestions in the comments section and we’ll revert as soon as we can. Alternatively, you can contact us directly through: info@themtank.com

The full piece is available at: www.themtank.org/a-year-in-computer-vision

References in order of appearance

[131] Szegedy et al. 2016. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. [Online] arXiv: 1602.07261. Available: arXiv:1602.07261v2

[132] Szegedy et al. 2015. Rethinking the Inception Architecture for Computer Vision. [Online] arXiv: 1512.00567. Available: arXiv:1512.00567v3

[133] Huang et al. 2016. Densely Connected Convolutional Networks. [Online] arXiv: 1608.06993. Available: arXiv:1608.06993v3

[134] ibid

[135] ibid

[136] Liuzhuang13. 2017. Code for Densely Connected Convolutional Networks (DenseNets). [Online] github.com. Available: https://github.com/liuzhuang13/DenseNet [Accessed: 03/04/2017].

[137] Larsson et al. 2016. FractalNet: Ultra-Deep Neural Networks without Residuals. [Online] arXiv: 1605.07648. Available: arXiv:1605.07648v2

[138] Huang et al. 2016. Densely Connected Convolutional Networks. [Online] arXiv: 1608.06993. Available: arXiv:1608.06993v3, pg. 1.

[139] Hossein HasanPour et al. 2016. Lets keep it simple: using simple architectures to outperform deeper architectures. [Online] arXiv: 1608.06037. Available: arXiv:1608.06037v3

[140] ibid

[141] Singh et al. 2016. Swapout: Learning an ensemble of deep architectures. [Online] arXiv: 1605.06465. Available: arXiv:1605.06465v1

[142] Iandola et al. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. [Online] arXiv: 1602.07360. Available: arXiv:1602.07360v4

[143] Shang et al. 2016. Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units. [Online] arXiv: 1603.05201. Available: arXiv:1603.05201v2

[144] Clevert et al. 2016. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). [Online] arXiv: 1511.07289. Available: arXiv:1511.07289v5

[145] Trottier et al. 2016. Parametric Exponential Linear Unit for Deep Convolutional Neural Networks. [Online] arXiv: 1605.09332. Available: arXiv:1605.09332v3

[146] Worrall et al. 2016. Harmonic Networks: Deep Translation and Rotation Equivariance. [Online] arXiv: 1612.04642. Available: arXiv:1612.04642v1

[147] Cohen & Welling. 2016. Group Equivariant Convolutional Networks. [Online] arXiv: 1602.07576. Available: arXiv:1602.07576v3

[148] Dieleman et al. 2016. Exploiting Cyclic Symmetry in Convolutional Neural Networks. [Online] arXiv: 1602.02660. Available: arXiv:1602.02660v2

[149] Cohen & Welling. 2016. Steerable CNNs. [Online] arXiv: 1612.08498. Available: arXiv:1612.08498v1

[150] Abdi, M., Nahavandi, S. 2016. Multi-Residual Networks: Improving the Speed and Accuracy of Residual Networks. [Online] arXiv: 1609.05672. Available: arXiv:1609.05672v3

[151] He et al. 2015. Deep Residual Learning for Image Recognition. [Online] arXiv: 1512.03385. Available: arXiv:1512.03385v1

[152] Quora. 2017. What is an intuitive explanation of Deep Residual Networks? [Website] www.quora.com. Available: https://www.quora.com/What-is-an-intuitive-explanation-of-Deep-Residual-Networks [Accessed: 03/04/2017].

[153] Zagoruyko, S. and Komodakis, N. 2017. Wide Residual Networks. [Online] arXiv: 1605.07146. Available: arXiv:1605.07146v3

[154] Huang et al. 2016. Deep Networks with Stochastic Depth. [Online] arXiv: 1603.09382. Available: arXiv:1603.09382v3

[155] Savarese et al. 2016. Learning Identity Mappings with Residual Gates. [Online] arXiv: 1611.01260. Available: arXiv:1611.01260v2

[156] Veit, Wilber and Belongie. 2016. Residual Networks Behave Like Ensembles of Relatively Shallow Networks. [Online] arXiv: 1605.06431. Available: arXiv:1605.06431v2

[157] He at al. 2016. Identity Mappings in Deep Residual Networks. [Online] arXiv: 1603.05027. Available: arXiv:1603.05027v3

[158] Abdi, M., Nahavandi, S. 2016. Multi-Residual Networks: Improving the Speed and Accuracy of Residual Networks. [Online] arXiv: 1609.05672. Available: arXiv:1609.05672v3

[159] Greff et al. 2017. Highway and Residual Networks learn Unrolled Iterative Estimation. [Online] arXiv: 1612. 07771. Available: arXiv:1612.07771v3

[160] Abdi and Nahavandi. 2017. Multi-Residual Networks: Improving the Speed and Accuracy of Residual Networks. [Online] 1609.05672. Available: arXiv:1609.05672v4

[161] Targ et al. 2016. Resnet in Resnet: Generalizing Residual Architectures. [Online] arXiv: 1603.08029. Available: arXiv:1603.08029v1

[162] Wu et al. 2016. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition. [Online] arXiv: 1611.10080. Available: arXiv:1611.10080v1

[163] Liao and Poggio. 2016. Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex. [Online] arXiv: 1604.03640. Available: arXiv:1604.03640v1

[164] Moniz and Pal. 2016. Convolutional Residual Memory Networks. [Online] arXiv: 1606.05262. Available: arXiv:1606.05262v3

[165] Hardt and Ma. 2016. Identity Matters in Deep Learning. [Online] arXiv: 1611.04231. Available: arXiv:1611.04231v2

[166] Shah et al. 2016. Deep Residual Networks with Exponential Linear Unit. [Online] arXiv: 1604.04112. Available: arXiv:1604.04112v4

[167] Shen and Zeng. 2016. Weighted Residuals for Very Deep Networks. [Online] arXiv: 1605.08831. Available: arXiv:1605.08831v1

[168] Ben Hamner. 2016. Twitter Status. [Online] Twitter. Available: https://twitter.com/benhamner/status/789909204832227329

[169] ImageNet. 2017. Homepage. [Online] Available: http://image-net.org/index [Accessed: 04/01/2017]

[170] COCO. 2017. Common Objects in Common Homepage. [Online] Available: http://mscoco.org/ [Accessed: 04/01/2017]

[171] CIFARs. 2017. The CIFAR-10 dataset. [Online] Available: https://www.cs.toronto.edu/~kriz/cifar.html [Accessed: 04/01/2017]

[172] MNIST. 2017. THE MNIST DATABASE of handwritten digits. [Online] Available: http://yann.lecun.com/exdb/mnist/ [Accessed: 04/01/2017]

[173] Zhou et al. 2016. Places2. [Online] Available: http://places2.csail.mit.edu/ [Accessed: 06/01/2017]

[174] ibid

[175] McCormac et al. 2017. SceneNet RGB-D: 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth. [Online] arXiv: 1612.05079v3. Available: arXiv:1612.05079v3

[176] Aytar et al. 2016. Cross-Modal Scene Networks. [Online] arXiv: 1610.09003. Available: arXiv:1610.09003v1

[177] ibid

[178] Guo et al. 2016. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. [Online] arXiv: 1607.08221. Available: arXiv:1607.08221v1

[179] Open Images. 2017. Open Images Dataset. [Online] Github. Available: https://github.com/openimages/dataset [Accessed: 08/01/2017]

[180] Abu-El-Haija et al. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. [Online] arXiv: 1609.08675. Available: arXiv:1609.08675v1

[181] Natsev, P. 2017. An updated YouTube-8M, a video understanding challenge, and a CVPR workshop. Oh my!. [Online] Google Research Blog. Available: https://research.googleblog.com/2017/02/an-updated-youtube-8m-video.html [Accessed: 26/02/2017].

[182] YouTube-8M. 2017. CVPR’17 Workshop on YouTube-8M Large-Scale Video Understanding. [Online] Google Research. Available: https://research.google.com/youtube8m/workshop.html [Accessed: 26/02/2017].

[183] Google. 2017. YouTube-8M Dataset. [Online] research.google.com. Available: https://research.google.com/youtube8m/ [Accessed: 04/03/2017].

[184] Wu, Pique & Wieland. 2016. Using Artificial Intelligence to Help Blind People ‘See’ Facebook. [Online] Facebook Newsroom. Available: http://newsroom.fb.com/news/2016/04/using-artificial-intelligence-to-help-blind-people-see-facebook/ [Accessed: 02/03/2017].

[185] Metz. 2016. Artificial Intelligence Finally Entered Our Everyday World. [Online] Wired. Available: https://www.wired.com/2016/01/2015-was-the-year-ai-finally-entered-the-everyday-world/ [Accessed: 02/03/2017].

[186] Doerrfeld. 2015. 20+ Emotion Recognition APIs That Will Leave You Impressed, and Concerned. [Online] Nordic Apis. Available: http://nordicapis.com/20-emotion-recognition-apis-that-will-leave-you-impressed-and-concerned/ [Accessed: 02/03/2017].

[187] Johnson, A. 2016. Trailbehind/DeepOSM — Train a deep learning net with OpenStreetMap features and satellite imagery. [Online] Github.com. Available: https://github.com/trailbehind/DeepOSM [Accessed: 29/03/2017].

[188] Gros and Tiecke. 2016. Connecting the world with better maps. [Online] Facebook Code. Available: https://code.facebook.com/posts/1676452492623525/connecting-the-world-with-better-maps/ [Accessed: 02/03/2017].

[189] Amazon. 2017. Frequently Asked Questions — Amazon Go. [Website] Amazon.com. Available: https://www.amazon.com/b?node=16008589011 [Accessed: 29/03/2017].

[190] Reisinger, D. 2017. Amazon’s Cashier-Free Store Might Be Easy to Break. [Online] Fortune Tech. Available: http://fortune.com/2017/03/28/amazon-go-cashier-free-store/ [Accessed: 29/03/2017].

[191] Mueller-Freitag, M. 2017. Germany asleep at the wheel? [Blog] Twenty Billion Neurons — Medium.com. Available: https://medium.com/twentybn/germany-asleep-at-the-wheel-d800445d6da2

[192] Gordo et al. 2016. Deep Image Retrieval: Learning global representations for image search. [Online] arXiv: 1604.01325. Available: arXiv:1604.01325v2

[193] Wang et al. 2016. Deep Learning for Identifying Metastatic Breast Cancer. [Online] arXiv: 1606.05718. Available: arXiv:1606.05718v1

[194] Rosenfeld, J. 2016. AI Achieves Near-Human Detection of Breast Cancer. [Online] Mentalfloss.com. Available: http://mentalfloss.com/article/82415/ai-achieves-near-human-detection-breast-cancer [Accessed: 27/03/2017].

[195] Sato, K. 2016. How a Japanese cucumber farmer is using deep learning and TensorFlow. [Blog] Google Cloud Platform. Available: https://cloud.google.com/blog/big-data/2016/08/how-a-japanese-cucumber-farmer-is-using-deep-learning-and-tensorflow

[196] Banerjee, P. 2016. The Rise of VPUs: Giving eyes to machines. [Online] www.digit.in. Available: http://www.digit.in/general/the-rise-of-vpus-giving-eyes-to-machines-29561.html [Accessed: 22/03/2017.

[197] Movidius. 2017. Embedded Neural Network Compute Framework: Fathom. [Online] Movidius.com. Available: https://www.movidius.com/solutions/machine-vision-algorithms/machine-learning [Accessed: 03/03/2017].

[198] Dzyre, N. 2016. 10 Forthcoming Augmented Reality & Smart Glasses You Can Buy. [Blog] Hongkiat. Available: http://www.hongkiat.com/blog/augmented-reality-smart-glasses/ [Accessed: 03/03/2017].

[199] Google. 2017. Tango. [Website] get.google.com. Available: https://get.google.com/tango/ [Accessed: 23/03/2017].

[200] Assael et al. 2016. LipNet: End-to-End Sentence-level Lipreading. [Online] arXiv: 1611.01599. Available: arXiv:1611.01599v2

[201] Albanie et al. 2017. Stopping GAN Violence: Generative Unadversarial Networks. [Online] arXiv: 1703.02528. Available: arXiv:1703.02528v1

[202] Tractica. 2016. Computer Vision Hardware and Software Market to Reach $48.6 Billion by 2022. [Website] www.tractica.com. Available: https://www.tractica.com/newsroom/press-releases/computer-vision-hardware-and-software-market-to-reach-48-6-billion-by-2022/ [Accessed: 12/03/2017].

[203] ibid