Machine learning lets us train our software to complete the most mundane organizational and data entry tasks. If you have ever had a project where you needed to deal with document capture in your business workflow, it is highly probable that you reviewed documents and devised rules for how they should be processed. Maybe you needed rules to discern between different document types, or how to route documents based upon the type of data, or even what types of data needed to be extracted to make the business process most efficient.
The vast majority of these types of document-oriented workflows require a significant amount of effort to devise the rules, implement and maintain them.
Let’s take a loan documentation inventory example. Regardless of the loan type—an auto loan, home loan, or a home equity loan—many documents are needed to satisfy credit and compliance requirements. Verifying that all documentation requirements are met, and that individual documents contain the necessary data is a time-consuming task. Many banks and other lending organizations struggle with how to make this process more reliable and efficient to satisfy customer demand and higher expectations.
Machine Training vs. Traditional Automation
Let’s address this need with two approaches: using machine learning versus traditional user-created rules.
Traditional Automation
Traditional automation starts with manual evaluation. A properly-managed project starts with an inventory of all required documentation for a given lender and loan type. Next, staff collects examples of all document types from different sources. They organize 100s of these documents by type and review them to identify unique characteristics to automatically identify them during a typical loan approval workflow.
User-created Rules
Once the staff identifies all the documents’ characteristics, an analyst encodes them as rules within a document capture system. Once rules are encoded, they must be tested in order to uncover any misclassifications that require adding new rules or fine-tuning existing ones. After testing and refining are completed, the rules go into the production workflow.
Next, it’s necessary to create rules that locate needed data within the documents before the data can be extracted and validated. A similar process of analysis, testing and tuning takes place to ensure the maximum amount of data is located and extracted, but also to understand the accuracy that governs when manual review is needed. Because many of the documents are not standardized, a wide range of rules must be created.
This is a tremendous amount of work to get to production. If any document changes, either from a document type or data layout perspective, the rules have to be re-evaluated and re-tuned. So the work is never finished. It is no wonder that many lenders have not invested in traditional automation.
Training the Machine
Now let’s look at the same set of requirements using machine learning technologies. For the initial document discovery perspective, a technique called “clustering” can be used to automate the logical grouping of like documents. Documents can be organized automatically. Applications can be grouped with applications; photos of driver’s licenses can be grouped with identification documents and so on. The result is a set of documents grouped by likeness that can then be further evaluated.
Robust Rule Auto-Generation
Next each grouping, if part of a required document can be given a document type (or class) then the samples can be imported into machine learning designed to automatically identify key characteristics of each document type (often technically called “feature extraction”). The result is an automated set of rules for each document type. When performance is not good for a specific class, the staff can add those misclassified or unclassified documents to the class sample set to “re-train” the software.
Learning-on-the-Fly for High Quality Data Results
Data extraction is also simplified by taking sample loan files that have been processed along with the data required for each document. Together these automatically train the software to locate the matching data and derive positional rules for each data field. The software uses the processed data for each page and locates the corresponding data on the document. It will do this for each sample and then automatically create algorithms based upon exact location, changes in placement across each example and relative position to other data, among other things. The staff simply examines the results.
The machine learning technology used to configure the system also makes adjustments. Complicated projects that typically would take weeks, if not months, are significantly reduced, saving both time and money. Machine learning technology streamlines the manual processes used in production and helps eliminate the labor-intensive tasks required for initial configuration.
If you found this article interesting, you might also find this useful: