When we discuss data extraction needs with businesses, Optical Character Recognition (OCR) often comes up. This useful tool turns image-based text into computer-readable text. For example, if you want to turn a Fax of a document into a format that enables you to search for text, running full-page OCR supports that. Or, if you want to apply OCR to specific locations for extraction of form data, OCR also can do that.
Beyond OCR: Advanced Data Extraction
For many, data extraction and OCR appear synonymous. Nothing could be further from the truth. OCR is a valuable tool and extracting high quality data from a document often begins with OCR. However, advanced data extraction applies a significant amount of technology after OCR converts a document to text. Data location and extraction involves next generation technologies, going well beyond OCR and employing sophisticated algorithms and business logic.
Deceptively Simple: A Business Check
Take, for instance, a business check, which may appear fairly standard. There is an address, a payee line, date, check number, numeric amount, alpha amount, and other standardized data. And yet, the location of this data is highly variable in completely different regions of a check. To make it more difficult, the actual size of the check varies significantly which affects the spacing between the check data. Can you apply OCR on a business check? Sure. Can you efficiently and reliably locate the required check data in a highly accurate manner and extract it to a system data type-by-data type? The answer is again, yes; but not with OCR.
Context-based Data: High Accuracy
OCR provides text and positional information of each character, numeral, or word. It reveals nothing about the type of data that it locates. OCR is simply a tool to turn image-based text into computer-based text. That’s it. The quality of OCR output is essential, and the quality of OCR “engines” varies. Without the additional technology to convert computer text into “context-based data”, OCR offers simply a bunch of numbers and letters.
At Parascript, we have been providing context-based data for decades and deriving meaning from documents with high reliability. For business checks, banks demand straight-through processing. In order to do that, they need extremely low error rates—often 1 percent or less.
Using advanced region of interest location, feature extraction, and other location and context algorithms, Parascript supplies check data extraction with rates in the upper 80 percent to lower 90 percent with very low errors. With data extraction, especially for transactions, the real work is after getting the text – it’s all about turning letters and numbers into contextual data.