Structured and semi-structured document analysis with ICR engines is complex enough; enter in an unstructured requirement with a data extraction workflow and you have a major challenge. When considering an unstructured approach, an analyst needs to determine if the project is a good candidate.
The following 7 document characteristics present clues as to why your document may not be suited for unstructured Optical Character Recognition (OCR) or Intelligent Character Recognition (ICR).
1: Field Label Color – Unstructured data analysis for OCR/ICR relies heavily on finding keywords that may be associated with the interested field. When color is introduced, the image engine has to work extra hard to detect the keyword(s). If they become obscure or drop out completely, the location will become impossible. Color, even black with white lettering, obscures the word(s) and can lead to either un-located keywords or false positives.
2: Too many keyword instances – if a keyword is repeated on the page, the difficulty in determining the appropriate data to extract exponentially increases as the keyword count goes up. Even with zone location reference points, if the specific keyword is not consistently located in this zone, the recognition engine could result in too many responses increasing the false positives and making extraction nearly impossible.
3: Inconsistent distance from keyword – even with the advances in determining where the data field is located in reference to the keyword label, given inconsistencies with how humans fill out forms and even machine print offsets, when the distance is too varied between the keyword location and the data field, extraction can be nearly impossible.
4: Form density – density references how much information is on a page. If a form is produced with 6 to 8 point fonts, has many paragraphs of instructions, includes similar sections or has the same labels referenced repeatedly, keyword location and extraction will increase in complexity. Often, the written information will be either similar, too small, or the human hand print will be huge in reference to its expected area, obscuring its intent and/or other form areas. Often dense forms should be broken into two or three pages during their design.
5: Poor scan quality – while a universal issue for any capture solution, it’s greatly pronounced as an issue when unstructured approach is applied. Poor quality leads to false positives and nearly impossible keyword location and data extraction. If the pixel count is too low, the letters, words and shapes will be obscure through pixilation that makes it nearly impossible to ascertain any usable data.
6: Poorly printed forms – even in this day and age of high quality printing, inconsistency with print quality still exists, especially with publicly available forms on the web. When printing from an electronically assessable form, the print quality is dependent on the individuals print setup and skills to print correctly, print at a reasonable size and/or print to a decent printer.
7: Drop out color – there are forms designed specifically to drop out form design elements intended to make OCR/ICR engines read important data better. There’s also inconsequential drop out with highly stylized forms with various shades of red, blue or green. Scanner optics often time will not detect various shades of red and the scanned data will not show up. If the keyword locater is one of these elements, then dynamic unstructured data location is nearly impossible.
While these issues are not impossible to overcome with the right tools, they greatly affect an OCR/ICR engines ability to locate information and extract the important data. For those responsible for producing a workflow or advising business teams responsible for data extraction projects, the analyst should keep these seven issues in mind. By having methods, tools and techniques to overcome these concerns, this knowledge will lead to the best approach for data extraction.
Have a complex data extraction requirement? Visit the Parascript FormXtra(R) microsite and learn how to apply images without a predefined template, recognize content in tables, and redact sensitive information on-the-fly.