Imagine being assigned a task to group apples by type. Easy enough, you can organize into colors (green, red, orange-red, yellow), size (large and small) and taste/texture (sweet, tart, juicy, crisp, etc.).
Now imagine being tasked with grouping apples that you are only allowed to see in black-and-white photos. All of a sudden, your choices of groups are limited. If you need to group by taste or color, you are likely to make a lot of mistakes. Doesn’t it make sense to consider all the attributes of an apple when assigning it into a group? Of course, it does.
With document classification, most attempts try to get it right with just one or two types of attributes, either the text content, the file metadata, or visual elements such as presence of table data. In fact, most classification tools use text content. This makes about as much sense as grouping apples based on black-and-white photos. To achieve accurate classification while also producing the lowest rates of errors requires another approach.
Take, for instance, a class of documents such as “benefits application”. It’s conceivable that a single text attribute approach might be able to assign documents into this class by looking at elements like the presence of certain words such as “benefits”, “vision”, “HRA”, and some other types of context-specific language. It might also be conceivable that you could use visual analysis to assign documents into a class based upon a logo of a benefits provider.
A glance at the forms at the top of this article shows that while some forms have distinctive visual clues such as logos, others do not. Additionally, if you could review the text, you would find that there is no consistent set of words or phrases that exist across even these four examples. Additionally, evaluating structural characteristics such as presence of tables or blocks of data might be able to assign some, but you would probably get a lot of false positives with other document types such as claims forms and invoices. And if you’re trying to classify by the written information within them to further classify them based upon the covered person, then results will be even worse due to ability to use handwritten information.
Now let’s consider an alternative approach; one that allows a single classification workflow based upon the synthesis of text content, layout, visual elements, and other content such as either typed or handwritten information. Using this approach and having a relevant set of samples, machine learning can use any and all of the available information on a document when creating the classification algorithms as well as when analyzing a candidate document. When you consider a classification project that can include multiple document types, having the ability to use all of the information becomes even more important as both the complexity and the chance of errors increase with each new document type and variant.
Using the above same four variants, classification can analyze the fields, the entered information, the structure, and visual elements to develop hypotheses for each candidate document. As with the classification of apples, it’s much better to use all of the available information in order to tell one apple from another and an apple from an orange.
If you found this article interesting, you may find this video useful or another on document classification and information governance: