Where OCR software often falls short is in meeting business expectations some of which are the result of misconceptions. While advanced document capture software has been around for almost two decades, misconceptions about Optical Character Recognition (OCR) software and advanced capture are relatively commonplace and understandable since the terms are sometimes used interchangeably. OCR software is used to convert images of documents into text while advanced capture solutions are designed to reliably classify documents, and locate and extract data.
Recently, I worked with a scanned PDF document that had been processed by OCR software to provide access to the text. Here is a paragraph of the original document and the converted text.
The Image
Converted Text Results from OCR
OCR software is used to convert images of documents into text where results at the character level can be close to 100 percent accurate. That level of accuracy sounds great, but what does this really mean?
As you can see from the example above, the results at the character-level (the “C” in OCR) look 100% accurate, but when we examine the results from a word level, there are approximately 14.5% errors after you divide the number of word errors by the total number of words.
For practical applications of OCR software, which include using text to aid with search, these errors might not be a problem. After all, if I search for “change of control” it might find an instance where “ofcontrol” is not correctly separated.
Advanced Capture: Expectations & Results
However, when it comes to advanced capture, the correct separation of words is essential to accurately locate and extract specific data. Advanced capture solutions actually use text produced by OCR as one input, but advanced capture includes much more in terms of algorithms and other techniques to reliably locate data.
Advanced capture can employ techniques based upon n-gram that more reliably break-up words based upon known sequences. Additional methods of validation can be applied during the recognition process itself to improve overall output at a word level. However, getting words is not the end-game. Instead, it is extracting specific data that typically exist as a series of words. In this regard, advance capture solutions use a variety of methods to locate required data that extend well beyond dictionary-based approaches including relative proximity of one data element to another, and expected value patterns to name just two.
Structure and Semi-structured Documents
While the above text shows unstructured data, the same challenges of operating on OCR data remain for structured forms and semi-structured documents such as invoices or remittances.
The reality is that OCR software is not the same as advanced capture; advanced capture needs OCR as a requisite step, but applies much more in order to get what you really need: efficient, reliable access to the data you need.
###
If you found this article interesting, you may find the following executive briefing useful: