In a previous article, I discussed the various approaches that organizations can take with regard locating and extracting data on various types of documents such as forms, semi-structured documents like remittance, and unstructured documents like health provider contracts. This article dives deeper into that most derided approach: the template.
Let’s start with the statement: Templates are the most precise way to instruct IDP software to locate data.
This is because rather than use “lossy” techniques such as looking for field labels, using regular expressions or abstract machine learning models, you provide very specific information on where to look. The reality is that if you could create a template for every document variant, it would be the most precise. This goes for forms and for complex semi-structured documents as well such as invoice and health remittances,
Building Flexible Solutions
And yet, building (and maintaining) templates for all sorts of document variants quickly becomes very problematic if not downright impossible. The reality is that it is rare for a documents that we work with day in and day out to be standardized. Even forms, when scanned or shared with a smart phone have small variances that fool templates. And humans cannot always detect and account for all of these variants. The upshot is that, even though templates are the most precise, they are the most brittle and translates to man-years worth of work to create and maintain.
Avoiding Errors in Dynamic Environments
This is where other techniques often come into play and, as I stated earlier, they are lossy, introducing a lot of room for error. Take, for instance, using field labels or keywords. You might be able to review a lot of samples, and note that there are a finite number of ways to label the “Patient ID” on a health remittance, otherwise known as an Explanation of Payment” or EOP. But is it practical — or even possible — to note the universe of labels that could be used? The answer is “not remotely.” When you couple this challenge with other fields on an EOP, the result is a lot of errors locating data.
Other techniques, when measured at a page level, are just as lossy resulting in sub-par performance. When measured by the number of data fields successfully extracted over a large volume of EOPs, performance is reasonable. The significant gains in efficiency of configuring the software, you lose with some error; and most organizations make this trade-off for overall net efficiency gains.
Leveraging Machine Learning
What if it were possible to improve performance, not by having to manually create and maintain hundreds or thousands of templates, but by using systems that can configure themselves. We know through a lot of practical experience that software can identify small details that we humans often miss, and they can do this work at scale on millions of documents.
Using software to configure itself means that we are not only freed from most of the chore of configuring software, but there is a chance that we can reap the best precision of templates at scale without the traditional drawbacks.
Even better, software can use a “hybrid approach” to creation of rules that use “mini-templates” on one part of a document that reveals itself to be fairly standardized and static while using more “freeform” logic to locate and extract other highly-variant data.
I like to say that we at Parascript are working to let organizations all over the world have their cake and eat it, too. The ability to get the precision of templates without incurring the cost or brittleness of them is one lofty goal of ours, and we’re achieving it. Ask us to show you on your data.