There are three primary areas businesses should examine when approaching a document capture project. Without having a solid grounding, business can face significant pain regarding their ability to effectively collect data and ensure its quality, all without significantly increasing exception handling.
Many companies dont take advantage of data recognition and extraction capabilities, even for the simplest, most-straightforward documents and processes. Paper still rules and few businesses process that paper automatically.
There is a common set of themes that businesses can use to guide a project from early scoping all the way through to production monitoring. These are:
- A review and analysis of all the data types, structures, and formats involved with a process along with data quality objectives and key performance indicators or KPIs.
- Implementation of a solid test plan regarding data extraction rules against real-world samples and
- Establishing a program aimed at reviewing the objectives and KPIs established in the initial analysis.
The flood of information
Businesses today can pretty-much organize incoming information into four categories: input (MFDs, mobile, email), format (TIFF, PDF, XML, SharePoint), data (ASCII, machine print, handprint, cursive), and structure (structured, semi-structured, unstructured). There is a broad array of information coming-in from multiple sources, in a variety of data formats, in a range of data types, and organized in many different ways, which can create more than 500 different combinations of data.
Lets look at some very important questions that need to be answered before you even consider a solution.
What is the scope of your project involving data requirements?
This isnt just about identifying the types of documents you need to process, but the variety of documents in terms of structure, formats, and the data itself. Dont go into this part with a narrow focus only to find out that, after you get a solution into place, that a significant amount of data must be handled manually because its hand written or trapped in an email message.
What sources of data are used in your business, formally and informally?
The second question is to gain a comprehensive understanding of the existing channels into which your scoped data is received. Is it in Email? Fax? What about FTP or Web portal? And dont forget online services such as Google Apps and Dropbox. Its also important to address both formal and informal channels. While your company may not have a formal mobile capture strategy, you can be assured that someone is using their smart phone to capture documents. Having a solid inventory of ALL the input channels means that the solution you build will be suitable to meet all needs.
What levels of quality are required by your downstream systems and what costs are encountered as a result of bad data?
The third question is probably the most important area because a thorough understanding of how data is used and at what cost is probably the whole reason any business would consider an automated document capture project. That said, most companies approach this as an automation problem, focusing on the time or inefficiencies associated with manual processes. Its unfortunate because a lot of projects can be considered failures, not because the completed solution doesnt automate manual processes, but because the quality of the data is not as good as it could be and the impact is hidden. Its important to add to your inventory, a list of all affected systems and how the data is used. Different systems have different tolerances for data quality. Know yours.
Know (and improve) your results
Based on the knowledge you gained from our first topic, its critical that a company set quality objectives prior to selecting a solution and prior to the project start.
But its well known that most companies do not have a good handle on the actual precision of their document capture processes, because they failed to establish objectives early and develop the project around meeting those objectives. Along with these objectives, companies should develop a sample of documents along with truth data. Being armed with both the actual data and the results of the system not only supports tuning your implementation but enables ongoing measurements.
One major factor with meeting objectives is the ability to test during implementation its not enough to just set objectives; youve got to put good technology to use and ensure that you can reach them. Everyone should understand that even with OCR and its incredible character-level accuracy, when you approach field-level data, try and bring as much relevant data to the table as possible in order to validate extracted fields. Things like automated database look-ups, pre-defined field-level vocabularies, cross-field validation, and other sorts of thresholding and noise lists should always be used where ever possible. Then test and tune. It should always be iterative. Doing these things should ultimately reduce the amount of exception data processing that is required.
Keep it up
The third major theme is production-level monitoring. Most companies, when they implement a document automation solution, spend a great deal of time monitoring the application workflow, throughput and uptime, but not the actual data quality which should have been a major reason for the solution in the first place. The reasons are numerous including that is wasnt part of the project scope or there is lack of availability of staff time to perform reviews.
But inability to actively monitor and maintain data quality places the implementation and downstream systems in jeopardy. Even if you are diligent and perform rigorous pre-production data quality testing, systems can change, affected by introduction of new data types, formats, or channels, and can be affected by the people managing the system. Its not enough to test once, you have to build the capability for ongoing testing into your implementation.
The good news is that its not hard. Modern solutions have invested a lot in the area of business analytics and capture solutions are no different. These solutions should support handling data exceptions that are automatically identified by established quality thresholds and they should also support both automation of human audits and reporting of key data quality statistics. Once these systems and workflows are in-place, ongoing efforts to monitor and manage data quality are quite low.
Ready to learn more? Make sure to check out Part I and Part II of our data quality series.