These days it’s hard to find a public company that isn’t talking up how artificial intelligence is transforming its business. From the obvious (Tesla using AI to improve auto-pilot performance) to the less obvious (Levis using AI to drive better product decisions), everyone wants in on AI.
To get there, however, organizations are going to need to get a lot smarter about data. To even get close to serious AI you need supervised learning which, in turn, depends on labeled data. Raw data must be painstakingly labeled before it can be used to power supervised learning models. This budget line item is big enough for C-suite attention. Executives that have spent the last 10 years stockpiling data and now need to turn that data into revenue face three choices:
1. DIY and build your own bespoke data labeling system. Be ready and budget for major investments in people, technology, and time to create a robust, production-grade system at scale that you will maintain in perpetuity. Sound straightforward? After all, that’s what Google and Facebook did. The same holds true for Pinterest, Uber, and other unicorns. But those aren’t good comps for you. Unlike you, they had battalions of PhDs and IT budgets the size of a small country’s GDP to build and maintain these complex labeling systems. Can your organization afford this ongoing investment, even if you have the talent and time to build a from-scratch production system at scale in the first place? If you’re the CIO, that’s sure to be a top MBO.
2. Outsource. There is nothing wrong with professional services partners, but you will still have to develop your own internal tooling. This choice takes your business into risky territory. Many providers of these solutions mingle third-party data with your own proprietary data to make N sample sizes much larger, theoretically resulting in better models. Do you have confidence in the audit trail of your own data to keep it proprietary throughout the entire lifecycle of your persistent data labeling requirements? Are the processes you develop as competitive differentiators in your AI journey repeatable and reliable — even if your provider goes out of business? Your decade of hoarded IP — data — could possibly help enrich a competitor who is also building its systems with your partners. Scale.ai is the largest of these service companies, serving primarily the autonomous vehicle industry.
3. Use a training data platform (TDP). Relatively new to the market, these are solutions that provide a unified platform to aggregate all of the work of collecting, labeling, and feeding data into supervised learning models, or that help build the models themselves. This approach can help organizations of any size to standardize workflows in the same way that Salesforce and Hubspot have for managing customer relationships. Some of these platforms automate complex tasks using integrated machine learning algorithms, making the work easier still. Best of all, a TDP solution frees up expensive headcount, like data scientists, to spend time building the actual structures they were hired to create — not to build and maintain complex and brittle bespoke systems. The purer TDP players include Labelbox, Alegion, and Superb.ai.
Why you need a training data platform
The first thing any organization on an AI journey needs to understand is that data labeling is one of the most expensive and time-consuming parts of developing a supervised machine learning system. Data labeling does not stop when a machine learning system has matured to production use. It persists and usually grows. Regardless of whether organizations outsource their labeling or do it all in-house, they need a TDP to manage the work.
A TDP is designed to facilitate the entire data labeling process. The idea is to produce better data, faster, thereby enabling organizations to create performant AI models and applications as quickly as possible. There are a few companies in the space using the term today, but few are true TDPs.
Two things ought to be table stakes: enterprise-readiness and an intuitive interface. If it’s not enterprise-ready, IT departments will reject it. If it’s not intuitive, users will route around IT and find something that’s easier to use. Any system that handles sensitive, business-critical information needs enterprise-grade security and scalability or it will be a non-starter. But so is anything that feels like an old-school enterprise product. We’re at least a decade into the consumerization of IT. Anything that isn’t as simple to use as Instagram just won’t get used. Remember Siebel’s famous salesforce automation shelfware? Salesforce stole that business out from under their noses with an easy user experience and cloud delivery.
Beyond those basics, there are three big requirements: annotate, manage, and iterate. If a system you are considering does not satisfy all three of these requirements, then you’re not choosing a true TDP. Here are the must-haves on your list of considerations:
Annotate. A TDP must provide tools for intelligently automating annotation. As much labeling as possible should be done automatically. A good TDP should be able to work with a limited amount of professionally-labeled data. For example, it would start with tumors circled by radiologists in X-rays before pre-labeling the tumors itself. The task of humans then is to correct anything that was mislabeled. The machine assigns a confidence output — for example, it might be 80% confident that a given label is correct. The highest priority for humans should be checking and correcting the labels in which the machines have the least confidence. As such, organizations should look to automate annotation and invest in professional services to ensure the accuracy and integrity of the labeled data. Much of the work around annotation can easily be done without human help.
Manage. A TDP should serve as the central system of record for data training projects. It’s where data scientists and other team members collaborate. Workflows can be created and tasks can be assigned either through integrations with traditional project management tools or within the platform itself.
It’s also where datasets can be surfaced again for later projects. For example, each year in the United States, roughly 30% of all homes are quoted for home insurance. In order to predict and price risk, insurers depend on data, such as the age of the home’s roof, the presence of a pool or trampoline, or the distance of a tree to the home. To assist this process, companies now leverage computer vision to provide insurance companies with continual analysis via satellite imagery. A company should be able to use a TDP to reuse existing datasets when classifying homes in a new market. For example, if a company enters the UK market, it should be able to re-use existing training data from the US and simply update it to adjust for local differences such as building materials. These iteration cycles allow companies to provide highly accurate data while adapting quickly to keep up with the continuous changes being made to homes across the US and beyond.
That means your TDP needs to provide APIs for integration with other software, whether that’s project management applications, tools for harvesting and processing data, or SDKs that let organizations customize their tools and extend the TDP to meet their needs.
Iterate. A true TDP knows that annotated data is never static. Instead, it’s constantly changing, ever iterating as more data joins the dataset and the models provide feedback on efficacy of the data. Indeed, the key to accurate data is iteration. Test the model. Improve the model. Test again. And again and again. A tractor’s smart sprayer might apply herbicide to one kind of weed 50% of the time, but as more images of the weed are added to the training data, future iterations of the sprayer’s computer vision model may boost that to 90% or higher. As other weeds are added to the training data, meanwhile, the sprayer can recognize those unwanted plants. This can be a time-consuming process, and it generally requires humans in the loop, even if much of the process is automated. You have to do iterations, but the idea is to get your models as good as they can be as quickly as possible. The purpose of a TDP is to accelerate those iterations and to make each iteration better than the last, saving time and money.
The future
Just as the shift in the 18th century to standardization and interchangeable parts ignited the Industrial Revolution, so, too, will a standard framework for defining TDPs begin to take AI to new levels. It is still early days, but it’s clear that labeled data — managed through a true TDP — can reliably turn raw data (your company’s precious IP) into a competitive advantage in almost any industry.
But C-suite executives need to understand the need for investing to tap the potential riches of AI. They have three choices today, and whichever decision they make, it will be expensive, whether it’s to build, outsource, or buy. As is often the case with key business infrastructure, there can be enormous hidden costs to building or outsourcing, especially when entering a new way of doing business. A true TDP “de-risks” that expensive decision while maintaining your company’s competitive moat, your IP.
(Disclosure: I work for AWS, but the views expressed here are mine.)
Matt Asay is a Principal at Amazon Web Services. He was formerly Head of Developer Ecosystem for Adobe and held roles at MongoDB, Nodeable (acquired by Appcelerator), mobile HTML5 start-up Strobe (acquired by Facebook);and Canonical. He is an emeritus board member of the Open Source Initiative (OSI).
VentureBeat is always looking for insightful guest posts from expert data and AI practioners.
VentureBeat
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform
- networking features, and more