Document Capture and Recognition
Document capture is a well-known operation within organizations. Scan. OCR. Validate. Large document services providers process millions of documents daily and operate large-scale workflows that focus on high efficiency and high accuracy. Part of this operation involves data quality to ensure that service level agreements are met. It is not uncommon for data quality requirements to approach 99.5 percent accuracy, which is very difficult to attain without significant efforts.
How you actually attain your data quality can mean the difference between saving or spending literally millions of dollars each year.
Manual Data Entry Processes
The most-obvious (and low-tech) means to ensure high quality is to use data entry operators for validating output of other operators. In this mode, services providers will implement “two-pass” systems where two operators will both key the same data and if there are any discrepancies, a third operator will evaluate the answers and select the correct one. Since research shows that human error ranges between .5 percent and 5 percent, depending upon the data and application, using two operators independently with a third for auditing can theoretically support very accurate data. This method is the most expensive way to attain high data quality since having two operators key the same data doubles the data entry expense.
The Two-pass System
Another, more efficient strategy implements an “artificial” two-pass system. In this system, OCR is applied to the stream of documents as a first pass with the second pass fulfilled by having an operator key data, which is then compared to the output of OCR.
If the two match, the data is considered accurate. Any variance is handled by a second operator in the same manner as a two-pass system. In this way, the services provider significantly lowers the amount of manual work by having OCR automation act as the first operator. Savings in this range can be significant if the OCR is accurate enough.
Multiple OCR Systems
A third means of yet more efficiency and cost reduction is to implement more than one OCR system. In this variation, the output of two (or more) different OCR systems are compared against each other. Any discrepancies are handled by an operator.
While OCR typically cannot match the accuracy of an operator, it can achieve quite high levels of data extraction and it is far more probable that two OCR systems would issue the same result for a correct answer than the same two systems having the same erroneous answer. So a match can be considered accurate. Note that this method is not the same as OCR “voting” where different algorithms select the best answer.
It is likely that your organization uses one or more of these three methods, but there is yet another method that can result in significant cost reductions, even beyond the above. We’re talking about reducing costs 50 percent or more over what your organization currently spends.
Superior Quality, More Reliable, Low-Cost Data
This method should be seriously considered if you have a significant data entry operation with numbers of 50 or more staff. Many service providers that use OCR send the this data along with the fields that were not extracted to operators as a safety measure.
If the data is correct, there’s no real penalty is there? The answer is “yes, absolutely,” since the operators still have to take time to scan both the OCR value and the page in order to determine accuracy. In many respects, this can take LONGER than just typing a blank field. So the reality is that this safety measure is causing more harm than good – something that can be done away with but only if you implement this method.
In this approach, OCR output is tuned at the individual field-level to achieve a specific error rate, say 1 percent error. The result is that OCR output can ALWAYS meet your required accuracy without any need for further evaluation. Only data that cannot be extracted or that falls below a specific parameter goes to operators, saving significant time and ultimately significant money. If you have 400 data entry operators, savings can amount of millions of dollars each year, often with an overall improvement of data quality!
Not all software can be tuned to this level and not all organizations have the technical or statistical prowess to tackle such a project. That’s where Parascript comes in. To find out more, check out the Parascript Accelerator Program for BPOs.