“No Easy Button, But We Are Getting Closer” is Part 1 of 2 on machine learning and deep learning in Intelligent Document Processing (IDP).
In just about every corner of the technology and business media, you find breathless articles extolling the magic of machine learning, especially a type called deep learning. With new machines that can learn and make inferences from vast amounts of data, practically any process can be automated with high rates of accuracy.
For the most part, this is true. It is true that machines can enable processes to be automated, some almost completely. The results of automation can achieve levels of accuracy higher than what we humans, who get bored and tired, can achieve. It is also true that machines learn from vast amounts of data. But while expectations within organizations are focused on levels of automation and accuracy, a key ingredient is left out: vast amounts of data.
Depending on Heuristics
You see, unlike us humans, where we can take a relatively small set of data and some basic instruction in order to learn, machine learning, and especially deep learning, requires a large data set from which to analyze and develop inferences. Without delving into neuroscience concepts, humans are able to create “heuristics” that machines cannot. Heuristics are essentially shortcuts for cognitive tasks that enable us to be more efficient. A heuristic as like an “intuition” that you have about taking an action based on relatively little information. The development of heuristics doesn’t take much data.
You can hand a person five different invoices with instructions on what data you need, and they quickly develop heuristics on how to locate the same data on each and all the hundreds or thousands of invoices after that. The flip side is that heuristics aren’t always correct. They definitely are not comprehensive since they do not work in all situations.
Crunching Vast Amounts of Data
Conversely, machine learning cannot jump to conclusions on how to act based on intuition. If you give it the same five invoices, it cannot automatically develop a reliable inference on where your required data is located. Rather, the power of machines is that they can crunch significantly larger amount of data than humans, and they can detect even seemingly invisible attributes in data to come to a conclusion.
So what does this mean? Machine learning is not useful in all situations; it works better (and is better than humans) in situations where there is a high degree of variance on a large amount of data that needs to be processed to specific requirements. Think handwriting recognition. Think of weather forecasting. Think of understanding language.
Each of these represents a problem where there is a significant amount of variance in the data. For handwriting, there is a different “font” for every person. For weather forecasting, there is a seemingly infinite number of variables that impact outcomes. For language understanding, in addition to the different ways in which words are spoken, there are thousands of ways to string words together.
Machine learning can crunch the enormity of data involved with each and develop “models” for how to produce output whether it is a transcribed handwritten letter, a 10-day forecast for Colorado or responding to a verbal request.
How Much Data Is Enough
Back to the input data problem. Does every machine learning project require hundreds of thousands of samples to be a success? Machines do generally work better with more data and luckily there are shortcuts to that problem as well. So the answer to 100,000 question “no”.
Next week, in Part 2 of this article, we delve into the factors that govern the necessary number of samples as well as strategies and shortcuts to reduce the effort.