The 14th European Conference on Computer Vision
Deep learning is the main platform for computer vision research today and widely discussed for multiple applications at the 14th European Conference on Computer Vision held in Amsterdam from October 11-14, 2016. Deep learning is a branch of machine learning. It is based on computer models (sometimes called deep neural networks) for learning progressively more abstract levels of data representations. Major applications for deep learning range from advanced fields of science to areas of everyday life that were the focus of this conference such as:
- Street sign detection and reading
- Word spotting in images and videos
- Handwritten documents matching
- Road scene understanding and autonomous driving
- Visual object classification and tracking
- Video Segmentation
- Crowd understanding and pedestrian behavior prediction
- Face recognition and facial expression recognition
Researchers and thought leaders from leading universities such as MIT, Stanford, Carnegie Mellon, Oxford and Cambridge shared their studies and results alongside representatives from companies such as Google and Google DeepMind, Facebook, Amazon and Microsoft. Parascript was also in attendance, represented by Alexander Filatov, the Parascript Chief Executive Officer.
“Convolutional Neural Networks and Recurrent Neural Networks based on Long Short-Term Memory architecture serve as major building blocks for most systems that use deep learning,” CEO Alexander Filatov explained. “This is an area that we are extremely interested in as a company, and it promises to transform the future of many industries including our own.”
Trends in Deep Learning and Computer Vision
The advantages of Deep Learning approach were demonstrated by many applications, for example, by AlphaGo system which made winning headlines in October 2015. AlphaGo, developed by Google DeepMind, plays the Japanese board game Go. Last year, it made history as the first software program to beat three-time European champion [human] on a full-sized board without handicaps. In March 2016 AlphaGo won against the top Go player in the world [human]. AlphaGo deep neural networks had been trained on 30 million moves from games of human experts, and then the system learned additionally from thousands of games played between different instances of the system.
“End-to-end learning systems are gaining popularity,” said Mr. Filatov. “Take the application for reading street signs. These signs are in photos taken by moving cars. The traditional approach would be to first find candidate locations of a street sign in the image, then split candidate signs into lines, then split each line into words, each word into characters, and then read all the characters. Development of such system is a resource-intensive and time-consuming process. The results achieved by such complicated systems are far from desired. The reason for low results is that each stage of this traditional process can introduce errors, and there is no wider context taken into account at later stages.”
End-to-End Learning Systems
“With end-to-end learning there are no sequential steps,” Mr. Filatov said. “Instead you are building a sophisticated deep learning neural network architecture that takes the whole image as an input, including complex real-life background that includes a street sign. Then, the learning procedure automatically trains the system to output all the street sign’s characters.”
Since deep learning requires millions of images in order to train the system to output high quality results, it is difficult and often unrealistic to collect data in such massive quantities. As a result, there is a significant push for generating synthetic data for training. For example, to read street signs, it is possible to generate content of signs using multiple computer fonts, take a collection of real-life backgrounds, add models of noise such as character deformations, color changes, occlusions, etc., and combine these components in random fashion to generate the required amount of images for training.
The positive consequence of such an approach is that ground truth information for training is automatically generated. “This means that ground truth data or ‘objective’ and ‘provable’ data no longer need to be gathered manually for training and testing the system so it’s accurate and fast, but most importantly, an enormous savings in people’s time,” said Mr. Filatov.