The Data Science Approach
This week, I participated in AIIM’s Multi-channel Capture virtual event where I covered the need for a data science approach wherever machine learning is applied to Intelligent Document Processing. The upshot? It all starts with high quality input data. Anything but high quality data will result in unmet expectations at best.
So it was interesting when a Wired.com article hit my inbox yesterday focused on the recent revelations regarding the quality of very widely used training sets such as ImageNet (and nine other data sets!). And I used to think that input data quality problems were only for the uninformed organization wishing to wade into the machine learning waters. Now we find out that training data used by thousands of people is also suspect.
As the old adage says “garbage in, garbage out.” With machine learning, we turn the sensitivity up a lot higher.
Fundamentals of Training Data
Training data, regardless of the task, is typically composed of two parts: 1) the data itself and 2) the expected outcome for each data element. For instance, training data for use with document automation of Explanation of Benefits (EOBs) would consist of each EOB document along with the required corresponding values (also known as outcomes) of each exactly as they are on the document.
There is no room for transposing data such as normalizing dates so that they all follow the same format. The data needs to be exactly as it is on the document.
Finding these errors was accomplished with a novel approach by researchers at MIT, and it is one that we’ve also used at Parascript to discover problems with data that is supposedly full of truthiness. Basically, you turn a reliable trained machine learning model back onto the suspect training data set and observe the discrepancies. In the article, they actually use one of my favorite examples of images with cats (yes, I am a cat person). As the article explains, if the model predicts with high probability that the image contains a cat, but the label states it is an image of a spoon, the label is likely incorrect.
An Example: Mortgage Lending
Parascript has conducted similar analysis of client data sets that are used to verify systems that are relied on to output high quality data. The article explains that bad data can hide problems with bad models and make good models look bad. The result is use of poor performing models in the real world. One analysis Parascript did was on a large data set of mortgage loan files. The client used this data to test their own system and was under the impression that their system was achieving accuracy in the high 90%. We found otherwise. Using our pre-trained document classifiers, we proved that they were actually getting about 10 percentage points lower, which resulted in a significant, but hidden cost. We helped them create a better training and test data set.
Applied Machine Learning
There is no doubt that applied machine learning has significant benefits, but without understanding how to curate high quality training data, any project is at high risk of failure. The solution to this problem is a lot easier to solve than the significant amount of complexity with alternative manually configured systems. With machine learning that enables auto-configuration at our disposal, we now have the time to give training data our full attention.