Within document capture arena which includes both classification of documents as well as recognition of data within documents, there is a lot of talk about automated or machine “learning” and “training.” The result is that there are a lot of questions regarding what exactly is meant by learning. Let’s discuss several relatively common ways that “learning” is accomplished and demystify it.
Auto-classification, Learning and Training
A lot of vendors make big claims regarding capabilities such as “auto-classification.” When vendors talk about the auto-classification of documents, the most common approach is a fairly rigid visual analysis that compares incoming documents at a pixel-level or they use keyword-based approach. While these methods can provide good performance, they take a significant amount of time and/or have difficulties handling any variances of documents or variances in the data contained within these documents.
As for learning, using these mechanisms, many capture vendors use the word “learning” or “training” for auto-classification to describe adding new document classes within a production workflow. For example, a given set of documents has rules created and associated with it that define how incoming documents get assigned to a particular class. If a new document is input, or if a variant of an existing document is not identified, the workflow routes the document to a person to manually classify it. This new manually classified document will then be added as a specific document type to the rules so that the next time the same document is encountered, it is properly assigned.
Recognition and Machine Learning
On to the recognition process. When it comes to “learning” or “training,” the most common approach is to employ a workflow that allows a gradual accumulation of templates for newly identified document classes or variants. In this way, when an unknown document is encountered, or if a classified document is unable to undergo successful recognition, it goes through an exception workflow. The document is presented to a user who then manually locates each required field. Location uses “rubber band OCR,” which allows the user to “draw” a box around the data or to use a simple point-and-click to locate the needed information using XY coordinates. These specific templates are then submitted directly or undergo a manual review and then used in production so that the next time the same document is encountered, fields are properly extracted.
Is This Machine Learning? Really?
Is this really learning or training? In a simplistic sense, using a workflow-based method, the system does gradually accumulate information to help future processes with both classification and data extraction. However, it does not employ true machine learning or automated learning. This is actually simply an ever-expanding library of structured templates. That’s sufficient in a situation where documents never change.
True Machine Learning
True machine learning automates both the analysis of the document as well as the creation of the rules. Updating those rules is also an automated process, and the system does not rely on templates, but instead abstracts the rules from the specific document types. In this way, the system is more flexible and easier to set-up and maintain.
There are some examples of using machine learning techniques for creation and maintenance of extraction rules. In this case, the workflow still remains where a user identifies a new document, but the actual analysis of field location is either accomplished by the user or the system, using pre-defined algorithms. In both cases, the resulting information is applied in a more abstract sense to allow for a larger population of document variants to be processed.
For classification, user input simply amounts to providing some sample documents as well as telling the system the document type to which the samples belong. The system takes it from there and automatically generates inferences without creating brittle rules. Terms such as “non-deterministic” or “statistically-based” classification point to what techniques are being applied “under the hood” in these software solutions. It is critical to understand and explore the technical jargon used.
Many companies are not interested in spending time building complex and burdensome rules which makes sense because it increases both initial and ongoing project costs. If you fall into this camp, be sure to scrutinize any marketing messages that make claims about “learning systems.” Some indeed incorporate true machine-learning while others are dressed-up workflows. Discerning the difference between the two is a critical factor in determining what technology to choose.