What Do I Need to Know About Evaluating Intelligent Capture?
Software Evaluations
Intelligent capture software is not like other business software where evaluation of options can be done through simple verification of features.
Unlike a CRM system where suitability to meet a need can be managed through verification of supporting capabilities, the objective for using intelligent capture software is always the comprehensiveness of accurate data.
Evaluations of intelligent capture solutions should be treated like an evaluation for a precision instrument. You need to evaluate the data output produced by the system and measure how much data is able to be presented and at what level of accuracy.
The level of knowledge required of any particular option to produce real-world results is typically very high. This often makes it impractical for an organization to evaluate one, let alone several solutions on their own.
Fortunately, you can request that the vendor perform the configuration necessary to review results. This may or may not require a service fee. Evaluations always require sample data to configure a system and test data to measure the performance of the system.
Providing Samples
Measuring the precision and reliability of any system requires use of more than a few samples. In real-world situations, there can be a significant variance in documents for quality, layout, and data type. Configuring a system based on a few samples only allows an organization to understand the user experience of configuration, not the system performance. For manual configurations, an example of each variant of a single document is important. It may be impractical to visually inspect a large number of production documents to identify each variant, especially for scanned documents that may have many different types of quality issues. Generally, it’s good to have 50 to 100 samples of each document type if the documents are fairly standardized (such as forms) and 500 to 1000 of each document type if the layout and data types are more varied (e.g., invoices). For configurations that use machine learning, high-quality input data becomes an imperative since machine learning can be sensitive to bias in data. The general rule is the more data the better since there are no additional costs, apart from compute time, to process 500, 1000 or 10,000 samples. The software does the hard work. For structured data, such as forms that are scanned, 250 samples per document type is ideal. For variable data, 1000 to 1500 samples per document type is better. Test data should match the characteristics of the configuration data in both quality and quantity. The larger the sample set, the more precise the statistical measurement.
How Does Image Quality Affect Performance?
For evaluations and Proof of Concepts (PoCs) that involve scanned documents, it is a fact of life that images of documents are never as good as their physical or born-digital versions. With the introduction of document capture via smart phones, the problems can be much worse.
For instance, fax machines can introduce a number of quality problems such as introduction of noise in the form of dark spots or additional pixels, scaling problems such as stretching vertically or horizontally, or shifting the image up, down, or to the side. Smart phones can introduce shading, lighting, or focus problems that can directly affect field-level recognition.
Each of these problems must be identified, taken into consideration and managed as they can reduce the number of documents successfully identified or get in the way of field-level recognition due to distortions. Generally, a scanned document performs at between 15-30 percentage points lower than a pristine color scan or digital document.
How to Measure and Evaluate
What Should I Measure and How Should I Evaluate Results?
Apart from understanding the user experience with regards to configuration and review of output, the main focus should be with measuring the output of the system in a manner that allows for an apples-to-apples comparison.
Once candidate systems are configured to the satisfaction of the evaluation requirements, sample data should be run through these systems in a configuration that allows all data to be output.
Output from one system should be compared to other systems at both a global level and at a task level. For instance, if the task is document classification, the results should be compared at the document level. This is both the global level and the task level. If the task includes data extraction of 5 fields, the measurement should be for all fields of the sample set and then for each of the 5 fields.
Some systems emphasize use of what are called confidence scores. When confidence scores are available, an additional measurement should be used to evaluate not only the precision of the system, but the percentage of tasks that can be completely automated with no need for manual review. We call this straight through processing (STP).
Confidence Scores and How to Use Scores
Defining and Using Confidence Scores
A confidence score is a number assigned to each document or field-level output from an intelligent capture system. For instance, a document classification answer and corresponding confidence score might look like “document_class=Invoice; confidence_score=55”. Confidence scores are generated by the intelligent capture software as a means to understand if the output of the software is correct or not. Reliable confidence scores along with good data science principles are the key to straight-through processing.
Probably the most-common misconception of confidence scores is that they are a measure of probability. Using the example above, many misinterpret a confidence score of “55” to mean that the system is 55% sure that the document is correctly classified as an invoice. But this is not correct.
The reality is that a confidence score, all by itself, means nothing. To understand the probability of any specific answer, you must find the optimal “confidence score threshold”. For instance, continuing the example, for a document class assignment, you must examine both the answer and its confidence score over a large sample set, say 100 different class assignments and confidence scores to understand the meaning of a score of “55” vs. other scores.
To gain the meaning of confidence scores and to identify a confidence score threshold, you will sort all 100 of your answers from the lowest confidence score to the highest. Doing this, you should see quickly that the majority of answers at the lower range should be incorrect while at the opposite end, the majority of answers should be correct. Ideally there should be a point on the list that separates the mostly-accurate answers from the mostly-inaccurate ones (no system is perfect!). This point on your list is your confidence score threshold. This threshold could be a low number, say 35, or it could be a high number, such as 85. The purpose is always the same: to separate data that is mostly correct from data that is mostly incorrect.
Comparing Systems Using Confidence Scores
Once you have the output from each system (using the same test data), you can not only calculate the accuracy of each system which is defined as the overall amount of output data that matches your “answer key” overall and for each field, you can also identify if the system can support STP and at what level.
For each set of system output, sort the output by the corresponding confidence score. This means that for document classification, you sort by confidence score for all classification output. For data extraction, you sort for each field.
Once sorted, you can then observe if there is a natural grouping of correct answers and incorrect answers separated by a specific confidence score number; this is your confidence score threshold. The number of all answers above this identified confidence score threshold relative to all answers is the percentage of straight through processing (STP) you can achieve for each field. The number of correct answers above the threshold divided by all answers above the threshold is your STP accuracy.
Achieving STP means that the workflow process is automated and completed successfully without human intervention. STP is an automated electronic process that is used by enterprises large and small. Tasks can move “straight though” in a fully-automated manner. This capability relies upon a very important factor: the ability to determine with high precision that a task was not just executed, but that it was executed correctly. This appears deceptively simple.
For very basic, routine tasks, straight through processing is regularly achieved by organizations today. For more complex processes, STP has proved more elusive. Gaps exist between expectations of organizations and the realities that many enterprises face today in truly achieving STP. Bridging this gap through modern technology solutions is now possible.