When it comes to the classification of documents, there is one generally accepted approach that is favored: use of text content. The content can be used by a variety of techniques in order to arrive at assigned documents types or classes for any given document.
Yet an interesting finding was made when examining the performance of classifiers in considering alternative methods of classification. For instance, while text is most-often used as the primary input, visual analysis of documents can also be used as input for determining a document’s class. And when more than one classification technique is used, the results can be striking.
In the examples provided, the information gathered is based upon a comparison of OCR-based text that can include errors, but the positive effects of combining classifiers still holds. In all measurements, Parascript document classifiers were used.
We Cannot Live on Text Alone
When measuring the results of content alone, we found that using text derived from a popular OCR product yielded reasonable results.
For instance, if a company required 99 percent accuracy of assigned document classes, they could automate about 70 percent of their document classification workload (with no review at all allowing them to flow straight-through) leaving 30 percent of all documents requiring manual review.
A Picture is Worth an Extra 30%
When visual classification is used in combination with text classification, the amount of manual review required to achieve the same 99 percent accuracy was more than halved with just over 10 percent of documents requiring verification. If a company processes 12 million documents per year, this simple combination means a reduction of 2.4 million documents that do not require review at all, which can translate to over 2000 hours of savings to achieve a level of accuracy that a person could never achieve.
Now that is what I call a beautiful picture.
_________
If you found this article interesting, you might read, “KMWorld: Leveraging Automated Classification“.