In last week’s article a bubble was burst: automating that simple structured form ain’t so simple. You might have been led to believe by what you read on the Interwebs that your forms are a piece of cake to apply IDP in order to achieve high levels of automation. But what you’re reading is mostly optimistic marketing which is more than willing to review the happy path of automation within a process that is tightly controlled leaving no margin for error.
But in the real world, documents come in all shapes and sizes and are acquired through many different technologies. All of these variances conspire to create a complex IDP project even for highly standardized forms. The result is often much less “touchless automation” in real-world projects than what you expect and need. So you’re probably reading this asking yourself “what is he getting at and why is he trying to make me give up? Isn’t Parascript an IDP vendor?”
Ok, it’s time to put a positive (and realistic) spin on all this complexity. The reality is that any project where you need to get high levels of optimization requires a lot of attention to detail which means some planning. But the good news is that the level of effort required is decreasing by the year (if not the quarter) due to the continued progression of machine learning-based configuration which includes analysis and correction of image quality. Just a few years ago, trained Parascript staff would require weeks to review sample data and construct various rules and algorithms to detect and deal with a variety of image quality issues within a single client project and often for a single document type.
It all starts with acquiring a statistically representative set of samples. From there, we apply computer vision and clustering techniques to group documents by categories of image quality issues. From there, each grouping undergoes a variety of tests using image perfection techniques in order to optimize each for follow-on automation such as document classification and data extraction. All of this is done for each document type as the nature of documents can significantly affect image quality. For instance, a standard health claim form like we discussed in the previous article, contains a lot of data fields and small pre-printed text. These “dense” forms can introduce a lot of problems when they are converted into images. A less-dense form such as a certificate of property insurance will have different problems.
The upside is that a lot of this work can now be automated with a high degree of precision; often higher than what humans can achieve due to limitations for the amount of data we can handle. Machines, on the other hand, can crunch a lot of data and notice even minute changes that could affect performance. Differences in scale, shifting, and slight rotation of images, even the pre-printed form itself all pose serious problems for successful, high-performance expectations. Rather than require a person to apply the necessary corrections, we can use machine learning to detect each problem and automatically apply the corrective action that yields the highest level of quality. That’s what we did for those horrible faxed CMS-1500 health claim forms. Instead of manually reviewing and configuring a system as what is still the typical case, we use our trained machine learning algorithms to do all that work for us. Even better, configurations that are done in a pre-production phase can continue to be automatically refined in production as new documents with previously unencountered problems are identified.
Successful, highly-optimized IDP deals with more than just classification and data extraction. Hey you: faxed document with scaling and blurry image problems. I’d like to introduce you to Parascript automated machine learning-based image perfection.