Look Who’s Talking: Speech as a Modality

Well, this is annoying. I spent the past week working on an outline for my monthly article. It was coming along nicely.

This is not that article.

On Monday, OpenAI released GPT-4o during its Spring Update. I watched the demo like millions of other people. Now, I can’t stop thinking about the implications of speech as a language model modality.

OpenAI's Spring update

I hoped other authors would capture what was on my mind so I could move on with life (and my monthly article). Unfortunately, most of the press and social media coverage misses the point. GPT-4o isn’t a flirty companion or the AI in Her. It’s a shift in how we interact with AI.

Including speech as a modality may not seem like a big deal. We’ve had speech-to-text models like Whisper for years. We’ve also had text-to-speech models like those from ElevenLabs. I use these models daily to produce the audio for Paper-to-Podcast.

However, there’s a difference between augmenting generative AI models with speech capabilities and building speech directly into the models. I’ve been thinking about the implications of this advancement for two years.

My other article will have to wait.

Look Ma, No Hands

Humans aren’t designed to use webpages and mobile apps. Our ancestors didn’t invent the QWERTY keyboard to communicate with each other. They invented it to communicate with machines.

Computers require digital, structured data, while we live in a world of analog, unstructured data. Webpages and mobile apps are the portals through which we access machine intelligence.

Don’t believe me? Try booking a flight over the phone using your favorite airline’s interactive voice response (IVR) system. Once you’re done screaming at the machine, book the same flight using the company’s website.

Both channels use the same backend systems. The reason you prefer the website isn’t because you enjoy typing. It’s because the website allows you to access the airline’s machine intelligence in a predictable way.

Siri doesn’t suck because the concept of voice assistants is flawed. It sucks because the underlying model is flawed. The fact we use voice assistants at all is a testament to how much we prefer speaking to typing.

ChatGPT exists because most people can’t access the GPT-4 model using the application programming interface (API). OpenAI had to build a website on top of its model to reach 100 million monthly active users. The website doesn’t make the model more robust; it makes the model more accessible.

I’m sure GPT-4o is better, faster, and cheaper than GPT-4. But that’s not what interests me. I’m interested in accessing machine intelligence directly through speech. That’s an interface accessible to nearly every human on earth, regardless of technical proficiency.

Humane and Rabbit released speech-based AI hardware in recent months to scathing reviews. The underlying AI models weren’t ready for prime time, and the companies made promises they couldn’t keep.

Humane AI pin — form without function

That said, Humane and Rabbit have the right idea. Websites and mobile apps aren’t the best interfaces for advanced AI systems. They’re designed for a world where AI is dumb and behaves unpredictably. The optimal interface for AI is speech, and GPT-4o is a step in that direction.

Information Superhighway

Let’s say I have data in my head that I want to transfer to you. For example, I recently learned about how different return-to-work policies affected shifts in the U.S. labor market. How can you benefit from my recently acquired knowledge?

One option would be to connect our brains directly using an interface like Neuralink. That would give you access to everything I know, including my insights on return-to-work policies.

Too bad there’s no way I’m implanting a quarter-size device in my skull anytime soon. We need to find another way.

How about I take the ideas in my head, convert them into words, and email those words to you? You could then read the email to obtain the knowledge in my head. Given how much time we spend writing and reading emails and text messages, isn’t that the best method?

Not really. The average typing speed is around 40 words per minute. If you’re fast, you may get up to 100 words per minute. That’s still a pretty low transfer speed. Email and text messages make sense when the sender’s time is less valuable than the recipient’s or synchronous communication is difficult (e.g., time zone differences).

Speed isn’t the only issue with email and text communications. These mediums also lack context and feedback. You can’t hear the inflections in my voice or interrupt me with questions. I word-vomit what’s in my head and hope you interpret it the way I intend.

Source: ChatGPT (GPT-4o struggled as much as me to illustrate speech as a modality)

The middle ground between brain-computer interfaces and text-based methods is conversation. Humans speak at about 150 words per minute. Not only is speech faster than typing, but it also includes context that’s missing in written communications.

Suppose you respond “thanks” to my return-to-work policy email. Are you grateful or trying to type the minimum characters possible before pressing the delete key? There’s no way for me to tell.

If we’re having a conversation, I can tell if you’re interested. I can adjust based on your questions. I can stop talking once you have the information you need.

We’ve learned to accept the bandwidth limitations of digital channels. If you’re using a website or mobile app like ChatGPT, you’re sending and receiving information at less than half the rate you send and receive information with other humans.

Adding a speech modality to language models isn’t only about convenience and accessibility. It’s also about bandwidth. Accessing machine intelligence at a rate of 40 words per minute is frustrating. It’s like having a punchbowl of human knowledge at your fingertips and trying to access it through a soggy paper straw. That won’t be the case for long.

Stream of Consciousness

If speech is so great, why don’t more people use voice assistants? Why is my Amazon Alexa gathering dust and Siri so desperate for attention that she eavesdrops on random conversations?

It’s because speech has, until this point, been an interface layer atop large language models. Your utterances are converted to text devoid of emotional context before being passed to the AI model. The model then responds with text, which is converted back to speech.

The problem with speech as an interface layer is that it fails to take full advantage of AI capabilities. Speech processing occurs independently of thought.

Let’s return to the earlier example of booking a flight using your airline’s IVR system. If you say, “I want to book a flight from Chicago to Boston,” while your dog barks at the mailman, you rely on the text-to-speech algorithm to silence your canine companion. When speech is processed directly by the language model, your dog’s barking might alert the AI to ask if you’re traveling with a pet.

Next time you’re conversing, pay attention to the order in which you think and speak. You may form high-level ideas before opening your mouth, but you don’t craft every sentence before saying it. Thinking and speaking are one process. That’s what a speech modality does for language models.

In the coming months, we will hear plenty of stories about how GPT-4o is laggy and unpredictable. We heard similar rants about early versions of the text-only models (e.g., GPT-3). AI firms will fix the bugs. What will remain are AI models that think and speak seamlessly, like humans.

Like Us

In Artificially Human, I dedicated a chapter to describing how machines are becoming more like us at the same time as we’re becoming more like them. Adding a speech modality to large language models is the latest step in the march toward convergence.

ChatGPT has more than 100 million monthly active users. That’s a lot of people, but it’s only 1.2 percent of the global population. AI still has an adoption problem.

GPT-4o was rolled out with minimal hype. The headline of “Spring Update” didn’t exactly signal the beginning of a new generation of AI models. Sam Altman wasn’t even on stage for the event.

Perhaps OpenAI is still trying to figure out how to talk about its latest innovation. Maybe the company wants to keep things low-key because they’re still testing the model. I think there’s another reason.

Most people in the AI community expect OpenAI to release GPT-5 later this year. In an interview with Lex Friedman, Sam Altman said the company planned to release other important things first. GPT-4o is one of those “other important things.”

I think OpenAI has learned from experience. The company released GPT-3 to little fanfare. It wasn’t until the interface, ChatGPT, was released that adoption skyrocketed. OpenAI is reversing its deployment strategy. The company is releasing the interface first, adding speech as a modality to GPT-4.

We’ve been meeting machines where they are, using websites and mobile apps to access artificial intelligence. GPT-5 is poised to meet humans where we are. That’s how we’ll eventually solve the adoption problem.

Next
Next

The Interns: AI in the Workplace