Sample Data for Intelligent Capture | Knowledge Base | Definition
How to Many Samples Are Necessary for Intelligent Capture to be Successful?
One of the most common questions that organizations needing to implement intelligent document processing have is, “how many samples do we need?” There are two drivers for how many samples should be used and two areas where samples are necessary.
First of all, sample data is necessary when the intelligent capture system is being configured. Now, if you’re using machine learning, these are typically called the training set or a learning set. The second area where samples are used is in measuring system precision. With precision and configuration in intelligent capture, the real focus of intelligent capture is on optimizing the comprehensiveness of data that you can get out of unstructured information from documents and the precision or the accuracy of the data output/results.
Unlike other systems, the real focus is on data results, which involves a lot of measurement. The number of samples and type of samples you use to configure a system will help determine the outcomes that you expect. So it’s really driven by the number of documents within your scope. Document samples are document types. It could be contracts; it could be purchase orders; it could be bills of lading. Or, it could be medical charts, but the samples are driven on how many document types you have within the scope of your project. Just as important, the variance within each document type is also critical. For example, when automating invoice data extraction, if you only have one vendor’s invoice type, the chances are that you could get by with a few samples. However, if you’re dealing with invoices from many different types of vendors, then samples from each of the 2,500+ different vendors may be more appropriate.
Even though you have one document type called invoice, you’ve got potentially 25 to 2,500 to 5,000 different variations within “invoices” so a representative set of samples and data output will train the system for very accurate results. If your scope involves 10 document types with three variations of each type, the objective would be to get 30 samples.
Quite simply, if you’re dealing with a form and you’re only dealing with one form layout regard, despite the issues that you might have with scanning quality and that type of thing, you’re going to have a very standardized document. Fewer samples are necessary. Whether you’re manually configuring a system or using machine learning, you always need to have a lot of focus on the input data set. With machine learning, it is even more so because you’re seeding the functions or the tasks of actually configuring a system using rules to machine learning algorithms. To understand why machine learning algorithms behave the way they do, examine the input datasets. When you’re using machine learning and when you’re working with a system that really uses machine learning algorithms (instead of rules-based approaches), the emphasis on having a representative data set becomes that much more important.
Deep learning algorithms, like the ones used in our software within smart learning, perform very well on smaller-sized sample sets. Another way to get around huge sample data sets are pre-trained models. We collect large sample sets to pre-train intelligent capture models so that they are available out-of-the-box for immediate use. Pre-trained models find specific data on specific document types, and then gradually train to your specific data set. So it’s kind of like having your cake and eating it too.
Initially, if you’re not using a pre-trained model and you can curate data automatically using intelligent capture. So our smart learning system has a data curation function, which will analyze production data, the outputs and then gradually collect it so that it fits your production information and then uses that to learn. Over time, these models become a lot more efficient.