Often in conversations with clients and prospective clients, we hear phrases such as “we want to get 90 percent accuracy” from our system (with the actual percentage varying). What does that really mean?
A lot of confusion has been created by the accuracy claims of OCR vendors, mostly those that provide OCR SDKs. These claims are often carefully stated, something like “98 percent page-level accuracy.” This performance is typically measured at the character level. For example, this means if a page has 1000 characters, 980 will be correct. Unfortunately, this level of accuracy cannot be applied to processes that go far beyond page-level conversion of documents to text. When we add-in complexities such as data element accuracy, often referred to as “fields,” we have to measure in a very different manner. We can no longer assume that 98 percent accuracy will be achieved because those erroneous 20 characters could be spread-out over 20 different fields. Measuring by the field, if 20 of those out of 100 on a page have errors, then accuracy drops to 80 percent.
Achieving Field-level Accuracy
What is typically meant by the requirement that “we want to get 90% accuracy”? Translated and using methods for data element accuracy, it usually means “we want 90 percent of our fields to be 100 percent accurate.” Using that same full-page OCR accuracy measurement and then applying it to the same scenarios above, in order to ensure that 90 percent of those 100 fields are 100 percent accurate, the page-level OCR must achieve a 99 percent character-level of accuracy. Is this possible? Yes. And yet, it is not the norm. Additional mechanisms, beyond OCR, must be applied, such as providing what is called “context” during and after the recognition process in order to correct for OCR errors.
Also as humans, we typically experience a 2 percent to 5 percent error in reading and entering data, which can be higher depending upon other factors such as mental and physical condition. So expecting a high percentage of fields to be 100 percent accurate takes a significant amount of work. Such high ambitions may never be achieved when the data is variable or images are of poor quality, until context is applied.
Moving Beyond OCR
While Parascript develops its own OCR capabilities, we also use other third parties in order to have the best-possible OCR output. If we can completely do away with OCR by using electronic text, we will do it just to reduce errors. However, we find that we spend a significant amount of time going “beyond OCR” by leveraging more advanced technology that takes into consideration image quality, variance of document layouts and variance of data within each field. The ultimate result is a higher-level of accuracy than you can get through simply OCR alone.
If you found this article interesting, you may also find this insightful: Are You Really Getting the OCR Accuracy You Expect?