Is NLP Really Needed to Classify Documents?

Document Classification | Intelligent Capture Stack

October 26, 2020

Do You Really Need NLP to Classify Documents?

by The Parascript Team,

document classification | natural language processing

As if it weren’t already challenging to discern marketing messages with its bevy of three-letter abbreviations and technology claims into real understanding of how intelligent document processing can actually solve problems, let me introduce another three letter abbreviation to the IDP fray: NLP.

NLP stands for Natural Language Processing. You will likely find all sorts of references to this technology domain with increasing frequency within the IDP software market. But what does it mean, and when do you really need NLP to automate document-oriented processes?

I’ll start by focusing on what you DON’T need it for: document classification.

The Art of Document Classification

Let’s take a step back and look at what document classification, from a technical standpoint, does and why we use it. Strictly speaking, document classification is the function of IDP software that makes a determination on what kind of document has been submitted into a process workflow. Whether that workflow is associated with underwriting based off of data in a loan file or auditing quality of care using medical charts, document classification is charged with figuring out – from all of the possible document types out there – what a specific document is. Is it a progress note? Proof of insurance? An ID?

Now let’s turn to the art of document classification. Historically, and even to this day, the most common way to automate document classification is to use a rules-based approach. This means that a person, usually (or hopefully) a subject matter expert, reviews documents and identifies which text within the document reliably indicates that document type. There are easy examples and more difficult ones. For instance, with a high degree of reliability, you can expect a document with the word “invoice” is actually an invoice. With even a higher degree of reliability, a document containing both the words “invoice” and “total amount” is likely to be an invoice.

Document Classification for Complex Cases

Other cases are more complex. For instance, you might have a requirement to identify two types of text-heavy agreements: a homeowner’s insurance policy and a homeowner’s flood insurance policy. Here it is typically harder to easily identify words that reliably discern one from the other. The word “insurance” is not enough, nor is “policy,” “flood” or “homeowner.” It is likely that both document types share these words so a rules-based approach does not work as easily.

This is where machine learning approaches are a better fit. This is because machine learning can evaluate all the text of any number of examples of each policy type and identify the range of words and phrases that provide reliable clues as to the actual document type. Machine learning approaches use statistical analysis to gauge frequency of identified words, proximity of one relevant word to another, etc. In doing so, the entire text can be statistically analyzed producing a model that can be used to a high level of precision in identifying each type of insurance policy.

NLP and Document Identification

So where would NLP come into play with the task of identifying document types? It’s hard to find a good use case where it is needed. One use case might be to use NLP to identify a “sentiment” of a document, such as figuring out if correspondence indicates a generally positive state or negative state. But when it comes to classifying documents by type, NLP is pretty useless. Why?

NLP is not a single technology. Rather it is the practice of taking text and deconstructing it into grammatical structures: identification of nouns, verbs, adjectives, etc. On this basis, sentences can be analyzed according not only the words, but their use within a given segment of text. So a big part of NLP is simply the process of adding grammatical context to a specific document’s text. From there, we still need to use machine learning (or even more basic, rules-based approaches) to convert text into something more structured and actionable.

Where NLP Is Useful

Where we generally need NLP and use it within the confines of IDP is in the area of data location and extraction such as with the need to identify specific information within very prose-like, unstructured text and convert it to normalized, structured data. The standard techniques such as looking for standardized locations or data labels as we can do with structured forms or data as on invoices just does not work because there aren’t specific locations or labels available. In these cases, we need additional information to aid with location.

Insurance Policy Use Case

For instance, in the case of a homeowner’s insurance policy where the needed information is likely buried in paragraphs of text, we use NLP grammatical context to not only identify the amount of insurance coverage within an insurance policy, but we can associate the amount with the type and conditions of coverage such as liability, flood, basic dwelling or outbuildings.

A sentence with no standard structure or labels such as “we do not cover land, including land on which the dwelling is located” can be deconstructed to define what is not included in dwelling coverage. Conversely, we can identify what is included within the unstructured sentence “we cover: the dwelling on the residence premises shown in the Declarations, including structures attached to the dwelling and materials and supplies located on or next to the residence premises used to construct, alter or repair the dwelling or other structures on the residence premises.”

Even more importantly, use of NLP can convert this text into standardized output. Discovery of dwelling coverage can result in outputting structured data such as:

Residence<\Dwelling Coverage1>
Attached Structures<\Dwelling Coverage2>
On-Premise Construction Supplies<\Dwelling Coverage3>

The ability to “understand” the meaning of unstructured text and to convert it to normalized, structured data is supported by the power of NLP context.

Leveraging the Right Tools

So back to document classification. If statistical methods produce very good results, why would we consider use of NLP within IDP? Apart from that it makes for great marketing claims, we wouldn’t. As with any problem, the key is matching the right tool to the problem at hand.

Social Share

Article Catagories

Industry Insights

Partnerships

Product Updates

Endurance Italia Partnership Announcement

by chofer | Mar 3, 2025 | Banking, Parascript, Service Providers

Features and Characteristics That Make Up Handwriting

by Gabriela.Davila@parascript.com | Jul 24, 2024 | Handwriting Recognition, IDP

Handwriting has been a powerful form of knowledge retention for more than five thousand years. It provides an efficient means to record information and connect us to the past and the future. Being a skill we learn at a young age and one that’s deeply unique to each...

Problematic Document Processing in RPA Solutions

by Gabriela.Davila@parascript.com | Jun 11, 2024 | Automation, IDP, Trends

Let’s look at recent trends in new products and how their offering might leave buyers disappointed after the novelty of AI wears off. AI sells! What does that mean for you? It’s official – AI is everywhere and the push for acceptance has begun. Google has AI-based...