Attaining 100% document automation within this lifetime may be unattainable, but it doesn’t keep us from working toward the ideal.
Recently, we were asked: what percentage of invoices does your software get 100% correct?
In this case, the potential client wanted to remove the need to send documents to his offshore data entry group. If one field out of 10 was incorrect, the entire document had to be sent. This caused additional delays, and his customers expected quick turnarounds. If he could get 100% accuracy on a large percentage of those documents, then it would be ideal.
Moonshot Goal for Accuracy
Wanting 100% accuracy is always the moonshot objective, but at this stage of technical capability, it just isn’t possible. Humans aren’t even close to 100% accurate so why should our machines be?
The problem starts with approaches to measure accuracy and the confusion around when those measurements make sense and when they do not.
How to Measure Accuracy
At a high level, there are two basic ways to measure accuracy: page and field (or data element). When measuring page-level accuracy, you either measure the percentage of characters and/or words that are accurate. With field-based measurement, the presumption is that you are only interested in specific data so you measure the accuracy of that data.
So when does it make sense to select one measurement over another?
The answer is quite simple: if you are processing forms or attempting to extract only specific data, then page-level measurements will never suffice as percentage of accuracy on the character- or word-level will never provide you with meaningful data.
For instance, if you need to locate and extract 10 data elements from a page of 1000 words, a 99% character or word-level accuracy measurement will not let you know if all 10 data elements are accurate. At 99% accuracy, it means that there are 10 words or characters that are erroneous. It could be possible that all 10 data elements are affected rendering a 0% measurement, not 99%. Sure you can attempt to use validation of OCR answers to correct the data, but that will not get you anywhere near 99%.
Field-level Accuracy: Why It Matters
This is why we prefer to measure field-level accuracy. When we measure field-level accuracy, as you can start to understand due to OCR errors, the level of accuracy can be significantly less than 99%. For structured forms where you can use more-reliable coordinates for location of data, it is possible to extract 95% of the data at a high accuracy rate of, say, 99%.
However, if you introduce poor image quality, that 95% can drop significantly. If you need to process invoices which have data that is highly variable, template-based approaches are not practical. The alternative technique of location of data without a template introduces significantly more error. We have measured client systems on other software that have data extraction success of 50% to 60%. After that, you need to measure the error of the extracted data.
So taking page-based measurements using the percentage of fields that are correct is useless. In reality, there will be a very low percentage of documents that are 100% accurate – most likely in the single digits due to all of the factors that cause errors. Rather than focus on what percentage of documents can have 100% automation, it is more useful to look at how much overall data entry can be 100% automated.
Oh, and our answer to the question:
What percentage of invoices does your software get 100% correct? About 5%. We reduce about 85% of total data entry requirements at above 95% accuracy.
This 5% is measured at a field level. We reduce about 85% of total data entry requirements at above 95% accuracy.
Here is a recent Data Interpretation eBook and another focused on Document Processing Automation you might find interesting. To find out more about our invoice data extraction, check out this video: