How can IDP software simultaneously push through both the complexity barrier and expand to more complex documents? Machine Learning, that’s how. Now, I won’t ever make the claim that machine learning is a magical silver bullet, but when it comes to the chore of crunching large amounts of data to identify patterns and optimal solutions, few tools are better and the results can seem like magic. When applied to intelligent document processing, machine learning can take a large body of sample data and develop inferences on how to reliably automate tasks that would take an army of talented IDP specialists to even hope to duplicate.
The result is that staff no longer have to spend days, weeks, and months, reviewing documents to identify the most common locations of needed data or develop complex algorithms to reliably pull data from highly variable documents. Armed with what is called “ground truth data” (basically the answer key), machines see even the slightest changes in data and can produce endless schemes to address even the most complex document classification and data extraction problems.
Take, for instance, a health remittance. If companies think automating invoices is difficult, these documents are an order of magnitude more difficult. With complex, multi-dimensional tabular data that varies across each insurer in addition to the other highly variable data, getting data reliably out of health remittances represents a task as daunting as climbing Mount Everest without the aid of oxygen.
But with machine learning, the task moves from a complex configuration project to a sample data curation project. An organization with a sizable representative training data set can reduce effort measured in man-years to a few weeks. And by applying additional grammatical information to the training set, machines can even work with documents like contracts or correspondence; documents that require a higher level of interpretation in order to reliably find and present needed data.
Yet it is exactly this transition from a complex configuration problem to one of sample data curation that stands to jeopardize the progression of IDP to all reaches of the organization. Part of the issue is that organizations aren’t prepared to supply needed data. Regardless of claims by vendors that solutions only require a few training examples, the reality is that in order to produce high performance (measured in both quantity and quality of data), organizations need thousands of samples and they need them to be statistically representative. If you don’t know what that means, well…
The upshot is pretty clear – organizations stand to gain access to more document-based information than ever before, at costs that easily justify using IDP software in any document-oriented process. But they need to start addressing the need for data now.
If you’re interested in how Parascript can help your organization apply ML-based document automation to any business process, give us a shout.