In the previous blog, we discussed challenges with describing documents and three areas that impact scope. This time, we’ll discuss technologies that can be used to make the process more efficient. There is a lot of talk about classification, machine learning, and artificial intelligence. But when it comes to practical information on what technology is best suited for a particular problem, the talk often becomes vague. Let’s take a closer look at classification techniques.
Semantic or Natural Language-based Classification
A lot of attention is given to semantic or natural language-based classification that can derive meaning from a document in order to classify it. These technologies range from less-complex, synonym-based approaches all the way to natural language processing, which examines the sentence structure. Semantic processing uses ontological databases in order to discern between a sentence about a fire from a sentence about someone being fired.
The classification technologies that focus on language are ideal for cases where there are few available representative samples or where the needs are highly region-specific since meanings, even within the same language, can vary depending upon location. These solutions do require constant upkeep due to changes in language and meaning.
Statistical Classification
Statistical classification, on the other hand, is language independent so it can be applied more generally. Rather than focus on the meaning of the text within a document, this type of classification relies on algorithms that examine word frequencies and positions relative to one another. When a good representational sample set is available, statistical classification is an ideal option. For some technologies you can even add more linguistic capabilities such as synonym databases to allow for different words to be included in the same analysis even though they might not appear in the sample set.
It’s important to understand the range of text within documents, availability of sample sets for training the classification software, and the frequency with which the underlying words may change.
Supervised versus Unsupervised Machine Learning
Associated with both types of classification are the concepts of supervised vs. unsupervised learning. Supervised learning means that there is an input that tells the technology what to look for. For instance, I can take a sample of documents from my collection of invoices, and input it into the system along with the description: “invoices.” From here, the system will discover features common among the samples and use these common features to create a rule set to classify other documents as invoices.
Unsupervised learning doesn’t have any input defining for the system what an invoice “looks like.” Rather, the system develops its own inferences. So while it is possible that an unsupervised system might discover that all invoices have the word “invoice” on them and infer that a set of documents are invoices. To have more definitive accuracy, input is required to verify that the unsupervised inference is indeed accurate.
Clustering
The most common type of unsupervised learning is called, “clustering.” This type of learning takes input, such as documents, and groups them based upon inferred similarities or “features.” Accuracy of clustering depends upon the nature of the input, but it is generally much lower than that of true, supervised learning-based classification.
Some solution vendors might claim that their solution is completely unsupervised. In reality, all classification solutions with a reasonable level of accuracy must be somewhat supervised. IBM’s Watson is not unsupervised. It learns through input and then interaction.
When a solution touts being completely unsupervised, there is typically some sort of feedback that is hidden. For example, a classification solution might present a set of documents without any training and then take a user’s selection as input. Again, it is rare for any accurate solution to not include some sort of input or feedback.
Stay tuned, next we’ll delve into accuracy and error rates. If you found this piece interesting, you might read, “KMWorld: Leveraging Automated Classification“.