We are often asked about our classification capabilities, which complement our intelligent image and document capture. Since many solution providers in the document capture arena talk about providing “automatic document classification,” it’s worth exploring what this means.
In most cases, it means the act of grouping documents or batch processing for the purposes of identifying where one document ends and the next begins. The traditional means for accomplishing this function has been through what are called “separator pages,” which are blank pages or they can even be the documents themselves, which make use of barcodes that establish the document type. Barcodes are either pre-printed or affixed to the document during batch preparation. This is commonly referred to as “document separation.”
In most cases, the major benefit of classification is the ability to remove the need for this batch preparation process, which removes the work involved. It also (theoretically) removes the potential for error.
Common ways to classify documents for this need can involve creating rules that locate key words or specific anchor zones that serve to differentiate one document from another or to establish document boundaries. Setting this up does involve work, but the idea is to set it up once and use it with all of the batch processing.
Next Generation Classification
How is this traditional classification approach different from other types of classification, such as classification used for information discovery?
First, classification used within a batch capture process typically only has to deal with a smaller set of documents so the scope of documents is much smaller.
Second, since the scope of documents is smaller, it is easier to use keywords or known document attributes to identify and differentiate one document type from another and to provide explicit rules to establish document boundaries.
In many cases for small data sets, traditional classification solutions simply provide support for zones and keyword searches in order to accomplish this type of batch-oriented document classification.
For information discovery and governance, classification needs are much more complex. First, there is often no concept of a batch. In fact, many times the need to classify documents starts with a completely unknown volume of documents. There is no ability to define batches because they simply do not exist. Second, due to the potentially large variance of document types, it is impossible to pre-define common document characteristics that can be used to distinguish one document from another, let alone establish document boundaries. Third, since we are dealing with a potentially large variety of documents even within a specific document class, it is impractical to apply any one type of classification rule or method.
Discovery-based classification has a much deeper, more complex challenge than the traditional document-separation problem associated with batch capture. The result is that most document capture solutions provide the most basic classification via document separation, but they are unable to meet actual discovery and data governance needs.
Enter comprehensive automated classification. This is classification that doesn’t require a lot of user input to define and use classes to automate a wide range of document classification needs. Next generation classification applications can be trained and fine-tuned for any document type. Visual elements are combined with content-based classification, and then user-specific rules, if necessary, are applied to establish high-quality document classification results. Can this type of document classification be used for “batch classification”? Absolutely. In addition, the classification is easier to use, more flexible and part of our new FormXtra 6.0 release.
If you found this interesting, you might enjoy our ebook on document classification, downloadable below. To find out more, contact us.