Dark data is simultaneously everywhere and challenging to access. For the past several years, Parascript has been doing research with AIIM and ARMA on dark data such as hard-to-process documents with handwriting. According to IDC, most of “big data” is actually dark data, which is commonly defined as “the information organizations collect, process and store during regular business activities, but fail to use for other purposes” due to the limitations of available technology to access it.
More recently, we’ve come across examples of this dark data problem most notably within the realm of discovering and protecting sensitive information. Imagine a Health Services Provider (HSP) that has a significant amount of patient records imaged or imported and stored within its document system. The HSP has reasonable amounts of metadata for its managed documents. However, the metadata does not adequately identify where the sensitive data is located. If they assume that all of the data is sensitive, then this information is unnecessarily limited to only certain staff. What they want is the ability to separate billing-related documents from care-specific documents and then to verify the presence of certain sensitive data. Social security numbers, credit card numbers and other identity-related information need to be protected.
While handwriting recognition has been around for years and is very reliable for certain needs, many HSPs and other organizations have not taken advantage of it primarily because they have not been aware of its practical uses or even of its existence. When we talk about meeting the HSP’s needs such as the example above, until recently, the technology just hasn’t been available. A combination of classification, image analysis, and feature extraction were required to enable organizations to group documents into their respective types, locate specific elements of the documents, regardless of content type, and then extract these data to enable further evaluation and, ultimately protection. Software developers with real domain expertise could stitch together a system from various technologies, but it was difficult, time consuming, and expensive.
These are highly complex tasks, granted, but capable technology is actually more accessible to more organizations than ever before. If your organization has a document store, whether it is an ECM system, a medical record system, a file share, or a warehouse, sifting through volumes of documents to find and protect the sensitive data is a necessity and a challenge that has just gotten easier to tackle. We will be presenting businesses cases that deal with this specific challenge at the AIIM March 4 Webinar and we invite you to come and see what we’re up to.
For more on dark data, check out key findings in a recent AIIM study and a recent post on capturing dark data in the information governance process.