I routinely like to juxtapose IDP software with other enterprise software mostly because the differences, while routinely neglected, are vast, and we’re not just talking about applications or features and functionality. When it comes to most enterprise software, emphasis is on user or process efficiency or introducing new capabilities to the business all of which can typically be evaluated and compared by a checklist of features and other capabilities. When it comes to IDP software, the raison d’être is to achieve the highest levels of document automation achievable, typically measured by the percentage of tasks that can be automated at levels of accuracy equal to or higher than a human being. So how do you evaluate that?
To make things more problematic, the IDP market, which now numbers more than 70 different vendors, is rife with claims like “out-of-the-box you can already expect over 80% automation at over 98% accuracy” leaving you to emphasize the trust of your vendor. So here is a practical cheat sheet on concepts and basic questions you should incorporate into your own document automation journey.
Confidence Scores
It all starts with the confidence score. Confidence Scores are not just associated with IDP software; any result (also called “answer” or “prediction”) comes with a confidence score. The purpose of this score is to help users of output from machine learning to determine if they are correct or not. These scores often use a 0-100 scale but they don’t have to. Some use scales completely different, like 0-1500.
Key things to know:
- They are not probabilistic. In other words, a single score means nothing so a score of 60 doesn’t mean 60% probable that the answer is correct. To apply probability, you’ll need to use a “confidence score threshold”.
- Scores can be associated at the character level (as in the words C-O-N-F-I-D-E-N-C-E S-C-O-R-E has a confidence score for each letter), the word level (i.e. a single score for the word CONFIDENCE and one for SCORE), or a field level (i.e. a single score for CONFIDENCE SCORE).
Confidence Score Thresholds
When it comes to the ability to turn individual confidence scores into actionable data, you need to apply a threshold that determines, statistically, if data is probably correct or incorrect. Thresholds are defined after comparing ML answers to what the actual correct answer should be. For a great primer on how thresholds are established, get this document.
Things to know:
- For IDP, thresholds should be established at the page or document level for document classification output and at the field level for data extraction output.
- Output with scores that fall above a threshold should generally be treated as correct while those below should be reviewed.
- The goal of any threshold is to optimize for accuracy such that scores above a threshold meet your accuracy rate, whether that is 90%, 95%, 98%, or 99%.
- There is no single threshold; each task (e.g. classification or field-level extraction) will have its own threshold.
- While every machine learning answer has confidence scores, they are not always reliable. For instance, many systems will output confidence scores in a way that makes it impossible to use a threshold to determine accuracy – leaving organizations to review ALL output.
System Accuracy
System accuracy is a measure of all correct output from IDP software, compared to all output. Another measure of accuracy can be done at a task level (e.g. document classified correctly or field extracted correctly).
Key things to know:
- You typically need to gauge system accuracy with 1000 samples or more to have a realistic understanding of true performance.
- The more variance you have with your documents, typically measured by the number of different “layouts”, the more samples you will need.
System Automation Rate
The automation rate is the measure of all output above a given confidence score threshold compared to all output above and below the threshold.
Ground Truth Data
The only way to create confidence score thresholds and to measure rates of system accuracy and automation is to use what is called Ground Truth Data. To put it simply, this data is the “answer key”; the actual correct value as it is on a page of a document. We take this data and then compare it to the ML output.
If all of this has you wondering how you’ll ever select the right solution, we’ve created a guide on conducting a proof-of-concept focusing on practically applying the key fundamentals covered in this article. Get it here.