Free Data!

“What good are wings without the courage to fly?” These words of wisdom come to mind as I consider the open-source craze among leading artificial-intelligence technology providers.

Top firms, including IBM, Google and Facebook, have opened the source code of their artificial intelligence software tools, making them available for developers to use in their own devices and applications. This is most certainly a good thing, for the companies themselves and for the AI business generally.

However, open source is only part of the equation. Unlike previous generations of software, AI algorithms are worthless without a dataset to work on. And in contrast to their open-source code policies, these companies maintain a closed-data stance, hoarding their vast information repositories as a competitive advantage for developing better AI technology.

Essentially these companies have given us wings — but have denied us the sky. What the top tech firms need is the courage to stop hoarding information and embrace open data, giving the rest of the world access to the information required for AI cognitive engines to attain their full potential.

The data-rich get richer.
In the age of AI, a new 1 percent is arising. This upper, upper crust consists of companies blessed both with machine-learning technology and with large quantities of information.

Some companies have been dubbed “the Superrich” of the AI business, including Google, Facebook, Amazon and Microsoft. It has been reported that, while there are very few of these companies in the world, they have a massive advantage over everyone else in the machine learning space because they have access to vast amounts of clean, structured data.

Such data is needed to train machine-learning algorithms, giving them the basic information they need to function on their own in the real world. For example, an object-recognition algorithm designed to recognize cats in photos will be trained by reviewing massive numbers of images depicting felines. These images need to have some structure, i.e., they must be tagged with keywords that properly indicate they are depicting cats.

The larger the quantity of training data, the better the algorithm will perform, with more information providing more examples that can be used to find patterns. Conversely, inadequate quantities of training data can produce algorithms that deliver substandard results—sometimes to the extreme embarrassment of their creators.

Because of this, the usefulness of an AI algorithm is intrinsically tied to the availability of high-quality data. In this regard, AI algorithms are fundamentally different from other types of software, whose code is valuable on its own without any additional data.

Thus, when a company open-sources an AI cognitive engine such as a translation tool, it’s not the same as open-sourcing a piece of traditional software, like a spreadsheet. Without also providing access to the data, open isn’t really open.

Close-minded.
Such data-denial is no accident. Rather, it’s part of a deliberate strategy to maintain a competitive advantage. With AI models well known and well distributed, the data set is the one commodity that can be locked away and kept from rivals.

That’s why top technology players are hoarding data. For example, IBM didn’t buy The Weather Channel’s data operations because it wanted to know if it’s going to rain in Tallahassee tomorrow.

Weather is the number-one factor driving global GDP. By combining The Weather Channel’s vast repository of climate-related information with its Watson AI, IBM can take the lead in forecasting the weather for private businesses, allowing it to do everything from predicting winter energy demand to forecasting crop yields.