If you have an Amazon Echo or have used Siri, then you have experience advanced artificial intelligence. The combination of deep neural networks to translate your speech to text, understand the meaning and then construct an appropriate answer is what powers these new products.
Unfortunately, you don’t always get an appropriate answer, and the Web is rife with examples of how these technologies fail. Is this a problem with the technology? Are these products being over hyped?
Getting the Training
The problem lies not with the fundamental technologies, but with the amount of data required to train a machine on how to do something that a human can do easily. We have developed certain expectations for our interactions with other people, and therefore, we have come to expect that anything designed to replace that interaction should have the same fidelity.
Humans must be trained how to perform a certain function. This training starts at birth for social interactions and language development. It continues with job training to carry-out tasks. Machines need a lot of training, too. The data involved is called ground truth data, and it is the hidden gold behind any advanced AI application.
The Hidden Gold Behind Every AI
Simply put, ground truth data represents the actual outcomes of a specific task. Take, for instance, the ability to identify a storefront in an image. The ground truth would be various data that include images of both storefronts and ones without. (Or, in the case of the squirrel pictured above—when it’s in the photo or when it’s absent from the photo.) Each image also comes with the answer, the “truth”, regarding whether the image contains a storefront. This data is fed into a machine learning system and gradually develops inferences on what a store front “looks like”. We humans can perform simple tasks like this with a high degree of accuracy. For a machine to have the same level of accuracy, you need both millions of example images WITH storefronts as well as examples WITHOUT storefronts. And you need the ground truth.
Development of ground truth takes enormous efforts and enormous cost. The costs vary based upon the complexity of the task. Google has spent hundreds of millions of dollars gathering driving data for its autonomous car project. IBM has spent similar amounts collecting, curating and processing data to improve Watson’s capabilities. These are examples of complex tasks. Regardless of the complexity, any company that wishes to develop an applied AI product must start with the challenge of identifying and developing the massive amounts of data required to train the system before they can start the actual work. The way they go about it can be as interesting as the results.
Collecting the AI Gold
Data to feed a system to identify a storefront can be collected via crowdsourced efforts such as reCaptcha, or, in the case of Alexa, you and I can provide the data through our daily interactions and, sometimes, corrections to commands. Any place where a human interaction can be recorded is a place where this new AI gold can be collected and used to improve AI systems.
Due to the need for ground truth data, the companies that can tap into the most user interactions, either through product use, crowd-sourced contribution, or other methods have a significant advantage over competitors that lack either the data or the ability to generate it. This creates an environment of haves and have-nots where the value to just access this data is increasingly more and more valuable. We can see this battle for data going on now with efforts to expand use of these AI services to hotels, cars, and any place where interaction will advance the knowledgebase of the AI.
Ground Truth
So the next time you interact with Alexa to play a song, use Google Translate, or give Siri a command, think about all the effort taken to “prime” the AI to be able to deliver a relevant, if even basic, response. And if the outcome isn’t what you expected, maybe contribute to the cause and issue a complaint – in the form of ground truth.
If you found this article interesting, you may find this Data Quality Tip Sheet useful: