Data extraction projects face a number of challenges that turn a once successful project achieving high performance results into a project “drifting” into sub-optimal waters, which can lead to project failure.
Challenges Leading to Failure
New or Changed Document Formats
The first and probably easiest challenges to identify relate to changes in document format. Successful (i.e., accurate and efficient) data extraction requires a solid understand of where the data is located. In most form-based data, it is best to approach extraction as a “structured document” where extracted data is located by the physical coordinates of that data on the page. This is often called “zoned fields” or “structured fields.” It is straightforward to design structured form extraction rules, but it is difficult to achieve high performance due to even the smallest variance of locations. The more “dense” the form is with data, the more sensitive the extraction rules are. Because most BPOs deal with other organizations’ data, they don’t have the ability to control the format. If there are no procedures in place to consistently examine the input, then the result is higher costs through increased manual data entry. We recently experienced just this problem with a client when their customer made the slightest change to a very dense form. The result wass that for several months, fields were not located satisfactorily resulting in an increase in fields that required manual entry or verification. By the time the client recognized this problem, it had cost the client tens of thousands of dollars in unnecessary data entry costs.
How to Address Changed Document Formats
There are two primary ways to deal with this challenge. The first is to include provisions within your customer agreement that forms will not change without prior consulting with your team. This provision will not guarantee that forms won’t change, but it provides you with the protection that should it occur, you won’t be liable for SLAs. Further, recovery of those unnecessary costs in order to maintain the SLA could be a term that places more onus on preventing this problem.
The second more customer-friendly way to address the problem is to measure accuracy on your sample set of images and truth data to record your accept and reject rates on a field-by-field basis for the form in question. And then, you create summary reports of the output of your system and monitor whether they fall outside a certain tolerance, say, 2 percent from your measured results. If they do, you are then armed with an early warning sign to investigate the root cause. Don’t forget that when you do need to change your field extraction rules, you need to refresh your sample images and truth data.
Changes in Image Quality
The second cause—which often occurs without anyone’s knowledge—is when a change in image quality occurs that affects extraction. This is often due to hardware degradation, new hardware implementation or changes to a scanner configuration. Probably the best example of this is when a client starts to receive forms generated from a fax machine. Fax machines can do some interesting things to a document, such as compress it that can have the unintended effect of re-scaling the image. Or, it can add header data to the form, which pushes the rest of the fields on the form down on the page. With scanners, a change in resolution output from 200DPI to 100DPI can have disastrous effects on extraction quality since most image processing and OCR engines are “tuned” at a certain DPI. Anything outside of that DPI range can appear out-of-focus to software. Lastly, simple wear-and-tear of hardware can result in slight shifts of the document to the left or right that can affect the location of fields.
How to Address Changed Image Quality
BPOs are left to the mercy of the input quality from their customers, so much of the remedies involve the same measure-and-monitor activities identified in the previous paragraphs. With measurements during your design stage, you can pick tolerances that can point to potential issues, which can be investigated. If you discover the root cause is image quality, you can work with the client to identify the problem sources and either accommodate them in another way, such as exception queues that use manual data entry or develop a new set of extraction rules that better-handle these problem images.
Changes in Data Formats or Types
The third area of potential problems relates to changes in data formats or types. This is different from changes in the overall form format (i.e., the location of data). Sometimes a client changes the way in which their data is entered. For instance, a date field might change from MMDDYY to MMDDYYYY due to your customers expanding their business abroad. Or, an industry regulatory body revises the requirements for recording certain data, such as certain payment codes. The result is that your data validation rules also need to change in order to properly extract and verify the data accuracy. If these changes are not tracked, this may result in data output that does not meet the client’s requirements.
How to Address Data Format Changes
Most likely a change in data requirements will trigger a change to the agreement, and ultimately, operational manuals on how data is to be validated and output. That said, if the changes are due to client preference, then you need to be ready to react. The good news is that if you’ve read above, you probably have the answer – benchmark testing and tolerance tracking. If you start to identify an increase in field-level reject rates and can identify the root as changes in how data is entered, you have then identified the culprit and can make arrangements with the customer to change specifications.
There is a lot that BPO’s cannot control, but luckily with some planning and effort to measure performance initially and continuously, issues resulting from changes in formats, image quality, and field types can be identified and resolved with a common process.