In Data Extraction, Accuracy is Only Half the Equation

I’ve lost count of how many times I’ve been asked the question “how accurate is your recognition?”. My admittedly flippant (but genuine) response is “it depends”.

I’m not debating the need for the question. It is very important to understand the level of accuracy for any solution designed to automate the process of identifying and extracting key content out of documents. The ultimate goal is 100% accuracy but realistically this number is hard to achieve. Again, it depends.

For many applications that only require transformation of scanned text to searchable data, then accuracy is undoubtedly the best number to understand; after all, most commercial OCR can achieve 98%-99% accuracy.

But something got missing along the way of understanding solution accuracy when it comes to more complex data extraction on forms and documents. While it is fairly obvious that accuracy is a very important part of the equation, the error rate of a solution can be just as important depending on the level of automation your business requires and the importance of the information you are using.

For purposes of a definition, the error rate is the relative percentage of data that is thought to be correct by a system and accepted as such. Erroneous data then gets used by other systems and processes. Depending on the “velocity” and the importance of this data, errors can mean very bad things.

Let’s take, for instance, check processing. It is the goal of any bank to achieve a completely automated straight-through processing system with 100% accuracy. But we know that with the variability of business check layouts and the inclusion handwritten values, that 100% accuracy is not practical at the start. But neither is a high error rate. High error rates mean that checks get misread and accounts are improperly updated. That check for $1000 gets recorded as $100. The account holder is effectively shortchanged. Bad stuff.

High error rates can also affect trust in the system. Take for instance a Web-based application that provides management of receipts. It is certainly possible to perform full-page OCR of the receipt and then populate relevant fields such as vendor, date, tax, and total. In practice, it is possible to tune a system to even achieve high 60% to 70% accuracy for the populated fields. But what about that other 30% to 40% of erroneous data? If you don’t have a means to identify errors, this data gets displayed to the user too. In this scenario, not only is the user forced to scrutinize every field, but it builds a lack of confidence with the user regarding the other correct data. Again, bad stuff.

With complex data, achieving upper-nineties accuracy is certainly a laudable goal, but in practice it isn’t achievable without the right formula in-place. So what is that formula? Learn all about accuracy, read rates and errors rates in our Ebook:

In Data Extraction, Accuracy is Only Half the Equation

Blog Topics

CONTACT PARASCRIPT