In the industry, we all hear claims regarding accuracy of data. Whether it is a claim from a technology vendor stating over 99 percent accuracy of OCR or a services provider stating its data quality is 100 percent guaranteed.
How do we really verify that performance? Do you have data quality procedures in-place to actually measure the quality of the data that you receive? Take for instance the popular 99 percent accuracy claim. If you regularly process 100,000 documents, to test that accuracy percentage with any reliable measure, you would have to measure the results on 1000 documents at a minimum. That measure would still have a larger margin of error that would not tell you if you are getting exactly 99 percent accuracy. To reduce the margin of error below .9 percent, you would need to measure over 90,000 documents from that pool of 100,000.
In addition, for data extraction accuracy, you would require something with which to compare actual results. In the sciences, this is referred to as “ground truth.” We use this all of the time in order to ensure that our measurements regarding data extraction accuracy are as precise as possible. Understanding stated accuracy compared to actual accuracy is important since the “downstream” effects of erroneous data can be dramatic. Also, the accuracy of systems designed for data extraction has a way of “drifting” due to changes in the data whether it is the data source, the format, or some other unnoticed variation. Variations occur all the time.
Measuring accuracy is a major effort. It is not surprising that many organizations little, if any, insight into the quality of the data that they use and rely upon every day. Interestingly, organizations that have a high degree of confidence regarding their data quality measurements often create a false sense of security. In one case for a major business, we found that the actual error rate of their data perfection staff was 10 TIMES HIGHER than their own measurements. In another case, the organization assured us that their vendor was providing them a given rate of accuracy only to discover that actual accuracy rates were over 20 percent WORSE than reported.
Don’t believe the hype. Getting high quality data is a major challenge. The best-run organizations struggle with this issue. If a technology or service provider claims that they can get you a certain level of accuracy, ask them for the proof. And then, design a way to consistently measure and validate what they provide. Better yet, make them consistently provide the proof themselves.