There are many capture systems that market the ability to automatically learn document formats and layouts or to allow the system to be trained. In either case, the objective is twofold:
- Minimize the effort required to define document rules specific to each document variant and
- Improve overall recognition rates and lower error rates.
The Problem with Semi-structured Documents
The concepts of training and learning are gaining popularity as document capture solution vendors (and their users) increasingly turn their attention towards automation of data extraction on semi-structured documents, where the number of format variants can be quite large.
In this case, simply having structured templates solve the problem is not viewed as acceptable. Why? Because someone, most likely the end-user or a services person, must evaluate each possible form variant and create a specific template to identify and extract the needed data. This effort can be quite extensive.
Functionality introduced in the mid 2000s and refined in the last few years involving pattern spotting and keyword searching enables more flexible document templates, but these still fall short when dealing with document types such as invoices or explanation of benefits forms where the variation of layouts can be quite dramatic, even if the variation amounts to only differences of centimeters or less. The result is recognition rates that dont get much above 50% for a pure out-of-the-box solution where no template is defined by the user.
In an implementation, the system is installed and configured for one or more document types. This can be done through creation of vocabularies of keywords and data patterns that can be used to locate data needed for extraction to other systems. It can also include providing for fixed coordinates of data that dont often change location.
Its understandable that searching for data is a lot less reliable than just specifying the exact location of the data. So even if careful attention and analysis of document formats is taken, a system can rarely achieve page-level recognition rates above 50%.
If recognition errors are taken into consideration, the results can be even worse, with some systems producing error rates at 30% or higher. In these cases, error rates are the most important aspect to examine as, regardless of the acceptance or read rate with respect to semi-structured fields (e.g. the system located the correct position of a field), the actual recognition error will be the determining factor of success since an error will either be passed-on to another system unnoticed or require intervention by a trained operator to assess and correct it.
Training and Learning Systems
Enter the concepts of training and learning systems.
Training or learning essentially allows for the automation (or in most cases semi-automation) of discreet template building over time for a specific implementation.
How Training Works
By combining the power of searching for data with development of more specific and structured document templates, recognition rates can be significantly improved while reducing error rates below 10%.
These systems are initially set-up using pre-built rules for specific document formats, for example an invoice document type or class. This particular class will have pre-built rules, logic, and dictionaries that allow for location and recognition of common data elements without the need for developing or using a more-structured templatized approach.
After an initial run of actual documents is completed to understand where the pre-built class falls short, the training begins. In many cases, training simply allows a user to identify the data that is not correctly located or misread and provide the system with the actual location or data type. This new information is used to automate the creation of a more-structured document template that applies the exact coordinates of this missing data and registers the document format into the system.
The next time the system runs across the same document, it will employ a combination of searching for the data (using routines that were initially successful) with the new template that simply tells the system where to find the other data. Using this dual-path approach applying both dynamic field location along with more-structured templates provides for substantial improvements in recognition performance.
Over time, and with enough problem documents encountered, the system can build a collection of templates that collectively cover a large percentage of the likely population of invoice formats the company receives.
The system essentially is trained through the progression of templates being added to its library and therefore reducing the scope of the data that requires searching routines.
Benefits are that overall recognition rates definitely and significantly improve without the explicit need to create potentially hundreds of variations of structured document templates for each invoice variation.
Drawbacks are that these systems usually require a good number of invoices that are a representative set of the real variation of formats they will encounter in production. Rarely will a business be able to supply a realistic and exhaustive representative sample. So a test system must be run for days if not weeks to gather a representative set.
Additionally, these systems are trained using the aid of humans so while the templates themselves can be created and added automatically, the location of the data must be supplied by a human. All of this takes time.
Lastly, if the requirements change after comprehensive training has been conducted, the entire set must be re-trained which can be a significant maintenance cost. For example, for an EOB document, a customer might wish to add the provider number to the set of extracted data. But adding this field requires the entire system to undergo training once again.
Advanced Machine Learning
The primary difference with learning systems, as opposed to training, is that the system not only manages the automation of template creation, but the system does not need the aid of humans to locate the data that is missed by searching functions.
There are no true machine learning systems in use for document classification or document capture solutions, but some vendors do attempt to incorporate elements of automated machine adjustments in order to reduce, as much as they can, human intervention for exceptions. Parascript does use learning algorithms in its core recognition engines and continues to research advancements in this area to reduce overall costs including initial set-up and ongoing maintenance.