Transactional OCR — Recognition and Capture with Confidence
If you have ever worked with transactional OCR (aka invoice processing, check processing, invoice processing, etc.) at a technology-level or examined OCR output, you’ve likely noticed a number that is supplied with each field-level answer. This number is typically called a “confidence score.” Some scores use a scale of 0-100, some use scales that go to 800 and still others might have scores that range from 0-1000. Regardless of the scale employed, these scores are there for a reason and are intended for your use. The person using OCR for the purpose of determining when to consider treating any specific answer as potentially incorrect. When answers are potentially incorrect, the typical process is to route them for human review, or go through additional automated validation using other means.
Some level of confidence measurement is in most machine learning and AI software. Sometimes it is visible, and sometimes it is only used for internal tuning.
The Confidence Score
What is a confidence score? Does it really reflect confidence in the answer? Using a scale of 0-100, is a score of 80 always better than a score of 30? Does a score of 80 mean that the answer has a probability of 80 percent of being correct? The answer is both straightforward and somewhat abstract.
Let’s explore this. Confidence scores are numbers internally-generated by the software and are based upon statistical analysis of results of a very large sample set. Ideally, the objective is to establish some sort of probabilistic mechanism to help those using it to determine the overall quality of the output. In the most abstract sense, a high score should reflect a higher probability of accuracy than a low score regardless of the actual data.
That said, there are a lot of ways this originally-intended objective is affected, depending upon the technology and the application. When we move away from “generic OCR” based upon text fonts and other standard analysis and move towards more-specific applications such as invoice processing, check processing and other more-focused applications, often additional technologies and tuning are applied at a field level that impact the generation of these scores. The impact is that you can no longer take the score by itself as a way to determine when you want to treat an answer as potentially incorrect. You cannot just pick a number and then use it. Production data samples become a very important part of your success. Analyzing the scores along with the sample output against ground truth data is essential to understand what confidence scores mean high accuracy and what scores point to lower accuracy. A higher score, at a field level, should still indicate a higher likelihood of accuracy. However, the threshold to indicate an accepted answer versus one that is rejected can be set at a confidence score of 60, 80, or potentially a lower score. It depends on what you find in the results.
Confidence Scores and Quality Results
For this reason, a confidence score of 30 for any specific answer versus a confidence score of 90 is meaningless until the average scores and related accuracy are determined and fully understood for a representative sample set. Evaluating answers for a handful of documents is insufficient. When our clients have questions regarding confidence scores during evaluation or testing, we always request that they test using larger samples and then evaluate the overall results. Questions such as “Why is performance not good?” are often based upon misinterpreted analysis looking solely at answer-by-answer confidence scores without understanding if those answer and scores tend to reflect more reliable thresholds on a larger set of data.
So when testing software, before reaching for the phone to ask why a certain score is being presented or why a low score accompanies a perfectly accurate answer, examine the overall range of scores on a larger set of results. It may be that the software results and associated confidence scores are accurate, but, like many of us, reflect a little lack of confidence.