In part 4 of our series on key factors driving Intelligent Document Processing (IDP) adoption, we get to some data science stuff. This data science bit, often referred to as a confidence score, is probably the most important factor involved with your decision to adopt IDP (or not). The three articles in the series leading up to this one focus on context, machine learning and how to work optimally in suboptimal conditions.
When it comes to enjoying AI-based automation that needs little human intervention, we move into the realm of machine learning predictions of outcome based on statistical analysis using representative sample sets. There, I got the jargon out of the way. Now for what all this really means.
AI-based Automation & ML-based Predictions
Let’s say you want to know what the weather will be like tomorrow. Most people would reasonably suspect that a weather prediction hardly starts and ends with knowing what the weather is like today. You would want to compile a large historical series of data going back decades. And then, you’d look carefully at the current weather data not just for your area, but for surrounding areas as well.
Another example is trying to predict the outcome of an election. Again, few would expect an accurate prediction from polling one specific town. Rather, polls across a number of regions, representing a wide variety of demographics would need to be conducted.
The point here is that any attempt to make a prediction, even predicting the specific document type involved with a mortgage loan file or an insurance claim, requires a lot of data and attention to statistical best practices. So first, you need to have a good amount of data that provides examples of the range of documents you regularly encounter within your processes.
IDP Confidence Scores
Next comes the part with confidence scores. But just what is a confidence score anyway? Most people don’t have a solid understanding and, from my experience, even those within the IDP industry, including the analysts that cover them, have only a modest understanding. That’s a problem. There’s also another problem. Many systems that claim to provide high levels of accuracy cannot also get you high levels of unattended automation. So after you read this article, consider yourself among the few that really get it.
A confidence score is a number ranging from 0 to some high point (different software use different ranges) that is output along with an answer (sometimes also called a prediction). The purpose of the confidence score is to enable an ability for a system to understand, with some level of precision in probability, that an answer is likely correct or incorrect. The biggest mistake people often make is assuming that a confidence score, in and of itself is a measure of probability; it is not. A score of 10 doesn’t mean 10% probability of correct just as a score of 100 doesn’t mean 100% correct. In fact, there are systems that output scores greater than 1000. Can a score of 1000 really mean that the answer has a probability of being 1000% correct? Of course not!
Confidence Scores and Probability
To convert a confidence score into a probability, you need a large number of answers and corresponding scores from real sample data. So there’s that ugly truth of needing large samples sets. Presuming you have a large representative sample set, and you have run your data through a configured system, you now have data with which to work.
First, you need to have the real 100% accurate values for each answer output from the system. For instance, if you want to measure the total payment on a health remittance, you need the actual amounts as they exist on the remittances themselves. This is commonly referred to as ground truth data.
System Accuracy
The first measurement you can make is to calculate total system accuracy. This is simply the number of accurate answers divided by the total number of answers. So if you have 800 accurate answers out of a test run of 1000, you have a system accuracy of 80%. The reality is that many IDP implementations where a lot of effort is put in rarely even go this far. Accuracy measurements might only be made on 100 samples which is generally not enough to provide a reliable measurement.
The second thing you need to do is sort the data by confidence scores so that you have an ordered list from lowest to highest or vice versa.
Having this information, even just glancing at the data, you should start to see a point on the confidence score column where the mostly incorrect answers start to turn into mostly correct answers. This point might be at a score of 20 or it might be at a score of 120. The key is that, using a large sample set, you should start to see that switch between bad data and good data.
For instance, using that health remittance payment amount example, out of 1000 total answers, you might see that 90% of the answers with confidence scores below 80 are incorrect and that 95% of answers with 80 or above are correct. You need to use trial and error to calculate the optimal confidence score threshold that supplies a high percentage of accurate data.
Separating Good Data from Bad
Here is where another gotcha can be experienced. Some systems and the developers building them, might only focus on overall system accuracy, never taking the time to ensure that the confidence scores are useful. The result is that, while you may observe a high system accuracy, you will not see an obvious threshold where a reliable separation of good data from bad can be established. This means that either you deal with the errors of the system, or you must manually verify 100% of the output. This is far from what we would consider unattended automation.
Achieving Maximum Unattended Automation
Internal consistency with confidence scores, regardless of the scores themselves is the critical component that provides maximum unattended automation.
The good news is increasingly systems are performing these measurements on their own. Using machine learning, adjusting configurations, models and confidence score thresholds, the systems improve output and ensure that the 99% accuracy you need is always met.