When someone says “document classification” what immediately comes to mind? Chances are, the answer to that question is highly dependent upon who you are, what role you play within an organization, and your familiarity with document capture or management software.
The key to a good definition is a good understanding of how document classification is accomplished. The reality is that there is no real clear single answer to how document classification is applied. From a basic definition, classification is simply the grouping of documents based upon similarity. The objective for classification is to understand what the document is so that it can be used and managed in a manner specific to that document.
But where it gets highly varied is that there are a wide variety of definitions for similarity, a wide variety of needs to group documents, and just as many techniques and technologies to solve them.
Take for instance, basic document imaging. Here, classification is simply the structured metadata that is added to a document by a person looking at the document and typing-in the data. Many times there will be pre-defined metadata from which to select, and sometimes there is not.
Classification could also mean simply matching a document based upon a pre-defined template. I can provide a sample image of a government healthcare claim form and easily match that to others either manually or with software.
Or classification can be based upon presence of specific information on the document. For example, if I want to classify a volume of documents based upon account, I would look for the presence of a specific account number, name of the person holding the account, or some other set of unique identifiers.
All of these are common “classification” approaches in the document capture software and document management industry.
The one thing common of all of these is that someone has to either classify the document, or create the rules to perform classification. And doing either is serious and significant up-front work. And it requires subject matter experts.
To deal with this expense, there have been quests for the silver bullet. In the early 2000’s enterprise search was thought to be the answer. Why bother with tagging and classifying documents in the first place when you can extract the full text of documents, add them to a search index, and then give staff the search tool.
As a result, organizations of all sizes experienced all sorts of interesting consequences from people accessing documents that included sensitive information, to loss of data due to IT following normal data expunging policies. So search has become a great way to find documents, but it doesn’t solve the classification problem. Search has been added to the arsenal of classification techniques by providing another way to tag a document based upon the relative weighting and frequency of specific terms. But by itself, it still remains a fairly clumsy way to classify documents.
Semantic analysis is the latest starlet to enter the limelight to help with reducing the expense of classification. Semantics allow for evaluation of the content of a document to identify features such as nouns, verbs, and adjectives. From there, you can use a lexicon to identify locations, people, or actions. This information can be used to classify a document. But the technology and the tools to use the technology is still difficult to use and results are highly dependent upon the quality of the upfront work to define specific vocabularies that are important for classes of documents.
The reality is that until software can do what a human can: look-at, read, and interpret documents according to a strict rubric, there is no silver bullet that will classify documents completely automatically. And all methods and techniques could and probably should be used to ensure classification is relevant to a specific organization’s needs.