Now that Software-defined Intelligent Document Processing (IDP) is here, is gathering input data really that big of a problem?
In the age of machine learning, input data is the new gold. You might have seen statements like this in the news, but it’s not always clear why this is true, or even more importantly, why it’s true for your organization and what you’re in for if the necessary input data isn’t identified and addressed.
Garbage In, Garbage Out
The explanation all starts with a very old phrase: “garbage in, garbage out.” Largely attributed to an IBM programmer and instructor named George Fuechsel, who used this phrase to emphasize that computers are pretty “dumb” and cannot produce good output from bad input. In the computing world, there is evidence everywhere in the form of stories and case studies about enterprises that spent a lot of time investing in the computing infrastructure only to be undone by the poor-quality input data, which was used to design the system.
When it comes to machine learning, we now have software that can analyze data and act on it, all with minimal human oversight.
When it comes to machine learning, we now have software that can analyze data and act on it, all with minimal human oversight. For instance, the car company Tesla has invested in collecting a lot of driving-related data including video and the actions of the driver. This data has been analyzed to mimic behavior of a driver’s ability to slow or speed-up a car or stay within a lane. Imagine if the input data was faulty. What if the video only included driving on a one-lane road vs. a multi-lane highway or where the behavior of the driver was erratic, perhaps recording the behavior of a constant lane-changer. The resulting behavior of the system would also be bad.
So machine learning-based software and associated processes double-down on that old adage from George. If machine learning were doing all of the analysis, then there would be no one to observe and catch these potential bad behaviors. We must spend a lot more time ensuring that the input data is pristine to ensure that the output is what we expect and want.
What Pristine Data Looks Like
So what does “pristine’ look like? The answer is it depends upon the scope of what you are trying to accomplish, but one word is always used: representativeness. This means that the input data represents the actual real-world activities that are within the scope of a project. The data should accurately reflect the range of events and activities that will be encountered when put in use. Input data used for automating driving should have all the various road signage, examples of different street designs and appropriate scenes, which represent typical scenarios.
Intelligent Document Processing benefits from Machine Learning’s ability to analyze a significant amount of data and identify attributes that we might not see – all to automate the arduous, complex, expensive and risk-prone processes of configuring software so that you can make document-based information as easy-to-use as information stored in a database.
IDP and Machine Learning
For intelligent document processing that uses machine learning, the input data are the documents and the associated labels that describe what successful completion means. It can be a set of documents and their associated types such as “purchase contract” or “insurance policy.” Or, it can be a set of documents and the data values for which you want to automate data entry into other systems.
All the big tech companies make massive investments in creating and gaining access to millions upon millions of data in order to deliver intelligent services. And let’s face it, gathering and curating large sample sets that are representative and free of bias is not a trivial task.
When using machine learning within IDP, is gathering or creating sample data such a hard thing to do? The good news is that if you already have a process that uses document-based information, you already have a sample set. The task then becomes a matter of adding tags to that data so that machine learning can analyze it. Adding tags to the data is a much simpler, less costly and lower-risk process of employing staff or a third party service to tag that data.
The good news is that if you already have a process that uses document-based information, you already have a sample set.
Instead of spending hundreds of thousands of dollars on training and professional services, you invest a few hundred to a few thousand dollars creating high-quality input data. That’s a trade-off most businesses can and are willing to take on.