This is “No Easy Button, But We Are Getting Closer” Part 2 of 2 on machine learning and deep learning in Intelligent Document Processing (IDP). Part 1 discusses why the ultimate value of IDP is the ability to deliver as much structured data from documents as possible with the highest levels of accuracy and reliability. In this article, we discuss strategies and shortcuts you can use to understand what amount of data is satisfactory and how to get there.
Delivering Precision
With any need to deliver precision, input or training data becomes critical. The general rule of thumb is “the more the better.” This is no different with intelligent document processing solutions where the ultimate value is the ability to deliver as much structured data from documents as possible with the highest levels of accuracy and reliability.
We live at a pretty interesting time. Never before has technology provided us with the ability to focus on what we want from it vs. deal with the complexity of how to implement it. This is all due to the real-world application of machine learning. Even so, most organizations are still not really prepared to deal with practical applications of machine learning. This is mostly due to the general lack of high-quality training data. Even though the comparative costs of curating high-quality training data sets is a fraction of that needed to staff, configure and maintain traditional systems, there are practical ways of making the “training data dilemma” more manageable.
Rendering the “Training Data Dilemma” Manageable
Certainly, one way the number of samples required for configuring IDP software goes down if we provide the machine with more instructions and guidance. In doing so, you replace some of the need for samples with explicit instructions. But that is not what most organizations want when they plan to use machine learning. They want high automation levels with high accuracy based on a minimal amount of work on relatively few samples.
So what to do? The good news is that use of machine learning applications does not always necessitate mountains of training data. The key is to first have clearly-defined requirements and then have a solid understanding of the nature of the training data required. It all comes down to a word called representativeness.
Nature of Data or Representativeness
Presuming requirements are solidified in terms of the scope of documents and tasks you wish to automate, we will turn to the nature of data or representativeness. This term refers to how closely your training data resembles the attributes of real-world production data.
Say, for instance, your project deals with automating health insurance payments. Typically this process involves three or four general types of documentation: the reference claim, an explanation of payment, a check payment and potentially correspondence. So the first order of business is to group data by each document type ensuring there are samples of each.
The next step is determining exactly how many samples are required for each type. This is where the real challenge lies. The reason for the difficulty is that analysis is required to identify the amount of variety within each document category. It is not practical to review 100% of a year’s worth of documents. It is usually helpful to assign each document type based on the amount of variance. Such a grouping could look like the following:
- Low or No Variance: documents in this category would include structured forms where each form is highly standardized such as a health claim. It can also represent some cases of semi-structured documents (e.g. purchase orders or invoices) if the number of layouts is limited. The number of variants/layouts for each type should not exceed 5.
- Moderate Variance: Here we start to move into a type of document called ‘semi-structured’ or for specific forms where there are many varieties. This could include business checks where there are many varieties of the same type. Generally Moderate Variance documents have between 6 and 30 layouts per document type.
- High Variance: These include semi-structured data where the number of variants exceed 30 or for unstructured documents (think contracts, etc.) where there is no specific layout and where targeted data can be anywhere within the document. Explanation of Payment documents are a perfect example of a document type where the number of variants of layout can easily exceed 30.
Now that we have our category assignments for each document type, we can focus on identifying a reasonable number of training samples for each category.
Assessing the Number of Samples Necessary
For all groups, you can follow a general rule of thumb. For each variant, you should only need about 10 examples. If, for instance, you have 10 forms each with no variant, you only need 10 examples for each form for a total of 100. If you have a single invoice type with 5 different layouts, you need 10 for each meaning a total of 50. Curating samples in the “Low or No Variance” and “Moderate Variance” categories is fairly straightforward because you can identify the number of variants.
Where It Gets Tricky
Where it gets tricky is if you have a high-variance use case where the total number of variants is hard to identify. Think of the Explanation of Payment (EOP) document. Depending upon the services rendered and the number of patients involved with the payment, the EOP could be one page or 100 pages. Each service identified can consist of different numbers of rows of data with different numbers of tabular data on each page.
How can you identify the total number of variants and then assemble 10 samples of each? This is where sampling comes into play. This is the topic of Part 3 of our IDP series.