Parascript is working with more and more service providers that need to perform data extraction, either as a value-add service to end users or as a critical part of their primary managed business service offering. In either case, it is more than likely, this data extraction is accomplished using a combination of computers and humans. The workflow goes something like this:
- Documents are submitted.
- Computers attempt to locate and extract required data.
- Any required data not able to be located or any data that falls below a certain threshold will be routed to a human for data entry/correction.
While this workflow seems basic, there are two critical components of this workflow that are often not well understood, let alone managed by many businesses.
The first has to do with the threshold. I won’t go into specifics as there is a great blog that covers the use of confidence scores, thresholds, and operating points that you can find here. The highlight of the threshold topic is that, unless a business actually goes to the trouble of analyzing the accuracy of their data extraction software (not simply just OCR), the human-centered data entry/correction function is either doing too much or not enough. In the case of the former, per-page costs rise as staff are reviewing data that doesn’t need to be reviewed. In the latter, erroneous data is slipping through unnoticed.
Considering the human side of data extraction, it is generally-accepted that we typically generate errors on average around 3 percent of the time when performing data entry or validation. So all of a sudden, a goal of 99 percent data accuracy seems very difficult to achieve. Where do businesses even start in order to measure accuracy?
In another article, the concept of ground truth data is introduced; this is the single most important aspect of understanding accuracy of data extraction software. The problem arises when businesses use the output of their current data extraction operations as ground truth data for the purposes of analyzing data extraction software accuracy. Why is this when it would seem that this is a logical place to start? The answer lies within that 3 percent human error coupled with misaligned data extraction thresholds.
For illustration, consider that a business performs data entry manually. If that business does not employ auditing of their data entry and/or double-blind data entry (where two data entry staff work on the same field and only exact matches are accepted), then it is highly likely their output has errors at least 3 percent of the time. If this business decides it wants to automate through data extraction, using this data as ground truth will result in a misreported data extraction software accuracy of no more than 97 percent and most likely, a lot less. This, in turn, will result in a downward adjustment of the threshold that will send accurate data for correction when it shouldn’t, which results in higher workloads for data entry staff and reduced efficiency.
In another example, a service provider already uses automated data extraction combined with manual data correction. If that service provider does not have a good understanding of data extraction accuracy and has the threshold set too high, erroneous data is treated as accurate and allowed to be used without further review. This results in less data entry workload, but increased errors.
The answer is that, while the output of current data extraction and entry operations is a good place to start, in order to ensure that measurements are as accurate as possible, the data treated as ground truth must itself undergo quality review to ensure that it is 99.999 percent accurate. This is what we do on behalf of our clients in order to ensure that initial and ongoing operations are successful and dependable.