With the age of “big data” and “information chaos” upon us, a lot of attention has turned to how to “tame” the large volumes of information traversing the physical and virtual boundaries of organizations. The concept of informaiton governance, while not entirely new, has taken a larger prominance in media and within industry.
Within information governance, classification has become a major buzzword, being positioned as a silver bullet for solving the challenge. With the ability to take a set of unknown documents and automate all the way to having them sorted, organized, and described document classification would seem to be the ideal solution.
Classification, however, can mean a lot of things and depending upon the business problem, one document classification solution can be very different from another.
In this article, we’ll touch upon two major types of classification; text and visual.
Text Classification
Classification of text is all about deriving relevance to a given subject and essentially falls into two major camps: statistical and semantic. These two groups can be “blended” but most solutions still fall into one of the two. Semantic classificaiton is all about deriving relationships using the grammatical meaning of words. Most often, semantic analysis will extend analysis of words by adding synonyms and nearby words to attempt to derive the specific topic. Natural language processing and latent semantic analysis are improvements within this area but are still essentially grammatically-based techniques. Depending upon the language, use of slang or abbreviations can cause “hiccups” during classificaiton by taking certain words and applying the incorrect meaning. As a result, semantic-based classification requires a lot of work and upkeep.
Statistical classification, on the other hand, dispenses with deriving meaning by focusing more on the statistical relationships between words of a given volume. Using various statistical models, it is possible to have relevance without requiring grammatical elements that require a lot of upkeep.
As mentioned earlier, the two methods are increasingly being used as supportive to one another but most classification has a “core” of one or the other.
Visual Classification
While most of the attention is put on text-based capabilities, image-based classification using computer vision and pattern recognition offers very capable classification in its own right. This is because there is a lot of information regarding the visual components of a document – the presence of logos, tables, and layout – that can be used to group documents without even requiring text-based approaches. It’s the same as sitting a person down at a desk and asking them to group documents within a folder. The person will often just start by organizing like documents based upon visual cues alone – especially if they are not subject matter experts. Solutions using image-based classification vary by the granularity of the visual elements they can use; while some can take every visual component into consideration, others just group by overall layout, ignoring logos or other picture-like elements. A benefit of visual classificaiton is that it can be much quicker than text-based techniques, especially if text first has to be extracted with optical character recognition.
Summing it All Up
Just as with the two major text-based techniques often being blended, it can also be of great benefit to combine visual and text classification. Visual can act as an efficient “first pass” with text classification called into action for documents that cannot be grouped or if more detail about the document is required.
For big data applications, organizations will actually make greatest use of capabilities that can span all types of unstructured information and use all of the data on a document for classification purposes.