Document Classification | Knowledge Base | Definition
What Is Document Classification and How Does It Work?
Document Classification organizes documents by type, assigning a document to a group. To classify documents, document classification automation uses either or both of two major types of data stored within documents: text-based information and visual-oriented information. Text is easy to understand. Visual information can be pictures, presence of logos or other visually-distinct information such as data structured in a tabular format. The most common type of document classification is a rules-based approach where a subject matter expert identifies words that are unique to each document type. From there, rules are encoded that dictate document class assignments based on the presence of one or more of these words. The benefits are that rules-based approaches are fairly straightforward to understand and create. The drawbacks include the amount of time required to analyze and construct the rules as well as the potential that words identified as belonging to one document class might also belong to another. This creates the potential for errors.
A more modern approach to classifying documents using text uses a machine learning algorithm that operates on the text to automate the process of identifying words or phrases that are distinctive enough to determine the proper document class. These algorithms identify textual “features” that can go beyond what most humans can identify, and they can operate on a much larger data set to provide more comprehensive coverage. The actual type of machine learning algorithm used is not relevant provided the performance is suitable to the need. In some cases, multiple algorithms or techniques are used depending upon the nature of the information.
The benefits are that the investment in time and effort to construct rules is removed and replaced by “compute time” with significantly more data going into analysis to ensure that automation can be more comprehensive. The drawback is machine learning can be a black box with little visibility into the process yielding the results.
Another means to classify documents is to use the visual information available. This type of classification requires algorithms based on computer vision that enable it to identify key visual features of a document to sort one document type from another. Often, especially with visually-distinct document classification projects, visual classification does not require use of OCR, which can be a time-consuming process.