Time: The Final Frontier?

I had no intention of writing an article tonight. I was sitting on my back patio, watching Succession on my iPad as I nursed myself back to health from whatever variant of COVID is currently making the rounds. Then, a close friend sent me a chart from McKinsey & Company and asked, “I feel like the mark on the wall for general AI has been 2027 for a while, but this chart suggests the consensus is growing. Are we in our final days?!?”

If you’ve read Artificially Human or any other content I’ve written, you’ll know I’m more bullish than most on the pace of AI advancement. I try to ignore the debate around artificial general intelligence (AGI) since I believe machines will collectively achieve humanity's capabilities before a single machine achieves the capabilities of a single human.

Unfortunately, after another 28 text messages with my friend, I can’t get his question out of my head. Are we only a few short years from achieving AGI? If not, why? It certainly seems like we’re headed in that direction. What could stall the relentless march of AI?

Logan, Shiv, Roman, Kendall, and Connor can wait. I’ve got ideas to write down before the NyQuil kicks in. Let’s get started…

Meandering up to the ledge

I find the case for machines achieving AGI within a few years easier to make than you might expect. I see little evidence that human intelligence is anything more than pattern recognition. It’s incredibly complex pattern recognition, but it’s pattern recognition nonetheless. Our brains find patterns in sensory data and use those patterns to make predictions about the world. Mechanical minds do the same, albeit with plenty of help from their biological mentors.

If you want evidence for what I’m saying, look at the evolution of large language models (LLMs). Progress was slow for a long time. We increased the size of LLMs with little reward. Then, around 10²³ training FLOPs, things got interesting.

A couple of generations later, we’re looking at LLMs that can outcompete humans on several language tasks. We didn’t suddenly discover a magic formula. We added more and more parameters until the models could store as many complex language patterns as a human.

A similar trend is underway for other modalities, including images and speech. If you think machines became super-smart, super-quick within individual modalities, wait until you see what they do with data that spans modalities. GPT-4 isn’t simply a larger language model than GPT-3. It’s a collection of models that brings together image and text data. Context matters, and combining multi-modal training data could lead us to the precipice of AGI.

What if it doesn’t?

Suppose 2027 comes and goes with no signs that we’re remotely close to AGI. What happened? Did Moore’s Law run out of steam? Is human intelligence more than recognizing patterns in data? Maybe, but I think there could be another barrier that doesn’t get as much attention.

Think about how you learned to be a human. Did you read a bunch of books? Did you look at a bunch of images? Sure, but you did those things over time. Experience is not the same as information. Experience is information gathered through time.

In AI research, I think we underestimate the importance of temporal data. Within each modality, temporal data is already embedded. The order of words in a sentence is temporal data. The sequence of images in a video is also temporal data. The problem is that the datasets we’re using to train our models typically don’t include temporal data across modalities.

Here’s a simple example. Let’s say you’re trying to create a multimodal representation of an apple. We have plenty of training data that links visual patterns (e.g., round, red) to other modalities like the text (e.g., the word “apple”), sounds (e.g., the “crunch” an apple makes), and tastes (e.g., sweet). However, even if you encode all those patterns, have you given the machine everything it needs to understand an apple?

I would argue “no,” there’s still a missing piece of data — time. The auditory “crunch” precedes the sweet taste when you bite an apple. Where is that data encoded in our zettabytes of training data? An apple is a silly example, but hopefully, this illustrates my point. What if achieving AGI requires us to encode patterns across modalities AND through time?

This is where we could hit a wall. We know it’s more computationally intensive to process data with temporal dimensions. Training an algorithm to recognize images of stop signs is simple. Training an algorithm to do it in data that includes a temporal dimension (video) is decidedly more difficult. Ask the companies struggling to train self-driving cars.

What’s my verdict?

I think there’s a non-zero chance we achieve AGI by 2027. Perhaps the temporal data in each modality’s training sets is sufficient for machines to learn about the world. I may not be able to see it with my primitive mind, but that doesn’t mean it’s not there.

That said, I think we’re going to need a bigger boat. I think temporal data across modalities matters more than we think. If that’s the case, it could be decades before we have the computing power and data required to train AGI-level machines. It may even require multimodal representations of our world in virtual environments where machines can learn like humans. If you think that all sounds fanciful, check out Apple’s announcement for the Vision Pro. It’s going to happen one way or another. It’s a matter of time.

I’m heading back to my NyQuil-fueled Succession marathon. Before I go, let me leave you with one piece of advice. Don’t try to predict when AGI will arrive. Humans suck at predicting exponentials. Remember COVID? I sure do. Life is more enjoyable when you aren’t concerned about things you can’t predict or control.

Previous
Previous

Data Assets: Castles in the Air

Next
Next

Footnote 189: A Copyright Story