Data Assets: Castles in the Air

When somebody begins talking about crypto, my brain shuts down. I’m not blind to the potential benefits of the technology. I support decentralization, transparency, and innovation. My problem with crypto isn’t the underlying technology. It’s the people caught up in the land grab. Most don’t give a shit about using crypto to solve actual problems. They’re in it for a quick buck. Most can’t tell you what a “block” is in a “blockchain.” Crypto today isn’t a technology — it’s a speculative asset.

Fortunately, I don’t hear much about crypto these days. I do, however, hear plenty about another speculative asset — data. With the rise of AI, firms are convinced that their proprietary data is the new gold, oil, or whatever other high-value commodity resonates with investors. Executives pour billions of dollars into platforms like Snowflake to store and organize their precious data assets. If you close your eyes, you can almost hear a focus-grouped version of the crypto story. The future is data; you better buy while you still can.

Data takes many forms. However, most executives are fixated on tabular data. They want neatly organized rows and columns of integers, floats, and strings. They’re obsessed with converting “messy” data to structured data assets that can feed applications and models. They’re told, “If you build it, they will come” — the “they” being AI.

Tabular data has its place, but it’s not the future. In Artificially Human, I talk about large companies going bankrupt before our eyes. Investments in tabular data are an example of what I mean. Firms can’t hype their way to success. At some point, technology fundamentals matter.

Data Refineries

There’s a bustling market for unrefined data. Plenty of companies are eager to purchase your email address and browsing history. Regulators pretend to stand in the way, but the market is winning. We value privacy, but we respond to personalization.

Most firms can’t monetize unrefined data. Nobody cares about the reason code attached to your customer service case. They may care that you called about your type 2 diabetes, but regulators won’t let the company sell that data. For most firms, the value is not in the raw data. It’s in the patterns hiding within the raw data.

Clive Humby understood this point when he claimed, “Data is the new oil” in 2006. Most executives take his words at face value — data, like oil, is a valuable commodity. What Humby meant is that data isn’t useful in its raw form. It must be refined.

This brings us to today, where organizations invest billions of dollars in data refineries. The “if you build it, they will come” mantra has executives stressing over missing their data cloud commitments. Last week, I met a founder who built an entire business around helping companies stuff more data into Snowflake. Refining data has become the objective rather than a means to an end.

Source: IDC

Here’s the problem. Clive Humby stressed the importance of refining data back in 2006. That was before large language models and other AI tools for processing unstructured data became viable. Heck, the mobile trend was only beginning. Apple wouldn’t announce the iPhone for another year.

Time has moved on, and so has AI. Computers used to require rules codified by humans. Machine learning ushered in an era where computers could find the rules themselves, provided humans first organized and labeled the data. Today, we’re entering an era where machines can find patterns in unstructured data. Data refineries are going the way of oil refineries. They’re still a critical piece of infrastructure, but I wouldn’t be betting the future on them.

Missing Pieces

What’s the harm in accumulating zettabytes of tabular data? Cloud storage is cheap, and you never know where insights might be hiding.

The issue is that tabular data is lossy. You throw away important context when you convert unstructured data to structured data. That was fine when machines were incapable of learning the additional context. It’s not okay in a world where the most potent algorithms thrive on contextual understanding.

Here’s an example. When you call customer service, the representative on the other line converts your conversation into tabular data. They record a reason code and update the company’s systems based on your conversation (e.g., updating your mailing address). The full audio of the call may be captured, but it’s often refined into tabular data using practices like sentiment analysis and quality assurance scoring. Your conversation is stored as a patchwork of tabular data.

Could you reconstruct your original call using only the data that the company stores? No, tabular data is lossy. You can't recreate the full audio from tables of integers, floats, and strings. Unless you retain the full audio recording, some context is lost forever.

In the past, lossy data collection was fine. Storing raw audio was expensive and, in some cases, prohibited. Converting unstructured audio to structured data allowed you to extract patterns. You could analyze why people were calling, how well agents handled the interactions, and even combine datasets to produce new insights (e.g., drivers of customer churn).

What you can’t do with tabular data is build a complete picture of me as a customer. You can’t find patterns in my language that allow you to communicate more effectively. You can’t look at how my mood changes over multiple calls to understand my state of mind. In short, converting unstructured data to structured data saves you money on storage at the expense of context.

Patterns in Context

I’ve already written about why context matters. I won’t repeat those points in this article. The critical insight is that large AI models with billions of parameters (e.g., GPT-4) can store increasingly complex patterns.

The question is whether the additional context has value. To answer that question, let’s try a thought experiment. Pretend you want to teach a person how to sell used cars. How would you do it?

One approach would be to create a massive spreadsheet that captures data from past sales. It could include basic data like the make, model, mileage, and other parameters about the vehicle. It could even have data about the market (e.g., comparable sales), past transactions (e.g., day of the week), and buyers (e.g., sentiment score). The new salesperson could have as much time as they like to study the spreadsheet.

Would the approach I described produce an effective salesperson? Perhaps, but it would also require a significant investment. Who is going to gather and clean all that data? How long will it take to find the insights hiding in the spreadsheet?

Is there an easier way? Of course, I could tell the new person my “tricks of the trade” and set them loose. After each interaction, I could coach them on what to do better the next time. All the while, they’re doing productive work instead of mulling over a spreadsheet.

Why is the second approach more efficient (and likely effective) than the first? I believe it’s because the data contained in the spreadsheet is missing important context. The spreadsheet doesn’t tell you which cars smell like smoke. It doesn’t show you the micro-expressions on buyers’ faces. In short, the spreadsheet is missing essential data.

AI stands for “artificial intelligence.” Machines think more like humans with each passing day. If you’re building data assets a human wouldn’t find valuable, perhaps it's time to reconsider your approach.

Pipelines > Refineries

A few months ago, Andrew Ng gave a talk at Stanford University about opportunities in AI. He presented a chart that estimated the value of AI technologies today and in three years. The largest market, by a long shot, was “Supervised learning (Labeling things).” The generative AI market was tiny by comparison.

Source: Stanford Online

I know better than to debate Andrew Ng when it comes to AI. His estimates for today and three years from now seem reasonable. That said, how might the picture change in five years? What about ten years?

The whole point of building data assets is to unlock future value. Three years is a short time horizon. I doubt companies can recoup the billions spent on structured data assets before the “generative AI” market overtakes the “supervised learning” market.

Supervised learning with structured data is not going away. We still code rules for machines by hand, even though machine learning has proven more effective in many situations. That said, the future is machines learning from unstructured data — like humans.

If you buy my arguments, the most valuable data in the future will be lossless. It’ll include sensory data similar to what we use to train humans (e.g., video, audio) and extrasensory data beyond our reach (e.g., LiDAR).

It’s cost-prohibitive to store lossless data today. Besides, we know a little lossless data can go a long way if machines learn like humans. There’s little point in accumulating lossless data now.

I would slow down data refining investments and ramp up data pipeline investments. Given the arc of AI progress, allocating more capital toward Internet of Things (IoT) devices, knowledge codification, and high-context storage (e.g., vector databases) makes sense.

If you still believe your data assets will fuel future growth, let me leave you with one last point to consider. Crypto evangelists are insufferable because there’s little substance behind their stories. They’re investing in narratives, not technologies. If you’re dumping millions into data assets, perhaps it’s worth asking if you’re doing the same.

Previous
Previous

Bridging the Achievement Gap with AI

Next
Next

Time: The Final Frontier?