In order to refresh your memory… here are Part 1 & Part 2.
In this last part 3 covering the issue of training samples, we come to the most complex part of the process of curating training sets – High Variance document types.
Looking at the Explanation of Payment (EOP) document within insurance payments as a perfect example, depending upon the services rendered and the number of patients involved with the payment, the EOP could be one page or 100 pages and each page can include an unknown number of patient services rendered each with a potentially different number of line items. How can you identify the total number of variants and then assemble 10 samples of each?
Sampling is the answer. Sampling is the practice of making selections from real production data in order to create a subset of data that roughly matches the characteristics of a full production stream of the data. We sample largely because it is generally difficult, expensive, or impossible to collect and/or work with all the data that organizations process. The key is to take samples in a way that will increase the representativeness of your overall production data. Here are some general rules of thumb:
- The more data the better. The larger your sample set is, the higher the probability that the samples collectively will contain the characteristics of all of your data. But as with anything, there are diminishing returns with extremely large sample sets.
- Take samples from a longer time period. Taking samples from an entire year’s worth of data is generally better than taking samples from last week. The reason is that data tends to fluctuate over time due to a number of factors.
- Take samples randomly. Instead of taking samples from only one client or from one type of batch, take a certain number from every source.
- Evenly distribute your samples. If your goal is to have a sample set of 10,000, it is better to take around 200 samples per week and even better to take around 40 samples per day. This smooths out the potential for one day, week, or month consisting of too many of the same type of layout.
Presuming that you have collected your samples diligently, you’re probably good to go. But there is an additional step you can take to get a rough idea of whether or not your sample data truly reflects the attributes of all of your data. Using a technology called clustering, you can have the software automatically sort and group documents by likeliness – on both your samples and on a large amount of production data. Likeness can be based upon visual attributes, text attributes, or both (we like both). The output of clustering is two or more groups (or clusters) of documents and the overall number should resemble the number of groups on a full year’s worth of data. If the number of groups within your sample set falls within about 10% to 15% of the number of groups within a year’s worth of data, you’ve got a good sample set. You’re now ready to use the samples to train the system.
PS. After reading all of this, if you’re wondering “there has to be a better way”, good for you! It’s always worthwhile to examine alternatives to sample curation, especially if getting access to sample data is difficult. One of the most interesting and promising alternatives to using genuine samples is to create the samples, referred to as “synthetic data”. This type of alternative is very different from just mocking up a few fake examples because the synthetic data needs to be in similar quantities as a typical sample set and also needs to have the same amount of variance. Machine learning algorithms are typically employed to first analyze a smaller set of samples and then, using the results of analysis, create a large number of samples that closely resemble a random sampling of actual data. The reality is that few organizations have the technical wherewithal to create their own synthetic data, but vendors of ML-based software might be willing and able to do it for you – it never hurts to ask.