Voice interfaces have been around for years, but let’s face it: Thus far, they’ve been pretty dumb. We need not dwell on the indignities of automated phone trees (“If you’re calling to make a payment, say ‘payment’”). Even our more sophisticated voice interfaces have relied on speech but somehow missed the power of language. Ask Google Now for the population of New York City and it obliges. Ask for the location of the Empire State Building: good to go. But go one logical step further and ask for the population of the city that contains the Empire State Building and it falters. Push Siri too hard and the assistant just refers you to a Google search. Anyone reared on scenes of Captain Kirk talking to theEnterprise’s computer or of Tony Stark bantering with Jarvis can’t help but be perpetually disappointed.
Ask around Silicon Valley these days, though, and you hear the same refrain over and over: It’s different now.
One hot day in early June, Keyvan Mohajer, CEO of SoundHound, shows me a prototype of a new app that his company has been working on in secret for almost 10 years. You may recognize SoundHound as the name of a popular music-recognition app—the one that can identify a tune for you if you hum it into your phone. It turns out that app was largely just a way of fueling Mohajer’s real dream: to create the best voice-based artificial-intelligence assistant in the world.
The prototype is called Hound, and it’s pretty incredible. Holding a black Nexus 5 smartphone, Mohajer taps a blue and white microphone icon and begins asking questions. He starts simply, asking for the time in Berlin and the population of Japan. Basic search-result stuff—followed by a twist: “What is the distance between them?” The app understands the context and fires back, “About 5,536 miles.”
Mohajer rattles off a barrage of questions, and the app answers every one. Correctly.
Then Mohajer gets rolling, smiling as he rattles off a barrage of questions that keep escalating in complexity. He asks Hound to calculate the monthly mortgage payments on a million-dollar home, and the app immediately asks him for the interest rate and the term of the loan before dishing out its answer: $4,270.84.
“What is the population of the capital of the country in which the Space Needle is located?” he asks. Hound figures out that Mohajer is fishing for the population of Washington, DC, faster than I do and spits out the correct answer in its rapid-fire robotic voice. “What is the population and capital for Japan and China, and their areas in square miles and square kilometers? And also tell me how many people live in India, and what is the area code for Germany, France, and Italy?” Mohajer would keep on adding questions, but he runs out of breath. I’ll spare you the minute-long response, but Hound answers every question. Correctly.
Hound, which is now in beta, is probably the fastest and most versatile voice recognition system unveiled thus far. It has an edge for now because it can do speech recognition and natural language processing simultaneously. But really, it’s only a matter of time before other systems catch up.
After all, the underlying ingredients—what Kaplan calls the “gating technologies” necessary for a strong conversational interface—are all pretty much available now to whoever’s buying. It’s a classic story of technological convergence: Advances in processing power, speech recognition, mobile connectivity, cloud computing, and neural networks have all surged to a critical mass at roughly the same time. These tools are finally good enough, cheap enough, and accessible enough to make the conversational interface real—and ubiquitous.
But it’s not just that conversational technology is finally possible to build. There’s also a growing need for it. As more devices come online, particularly those without screens—your light fixtures, your smoke alarm—we need a way to interact with them that doesn’t require buttons, menus, and icons.
At the same time, the world that Jobs built with the GUI is reaching its natural limits. Our immensely powerful onscreen interfaces require every imaginable feature to be hand-coded, to have an icon or menu option. Think about Photoshop or Excel: Both are so massively capable that using them properly requires bushwhacking through a dense jungle of keyboard shortcuts, menu trees, and impossible-to-find toolbars. Good luck just sitting down and cropping a photo. “The GUI has topped out,” Kaplan says. “It’s so overloaded now.”
That’s where the booming market in virtual assistants comes in: to come to your rescue when you’re lost amid the seven windows, five toolbars, and 30 tabs open on your screen, and to act as a liaison between apps and devices that don’t usually talk to each other.
You may not engage heavily with virtual assistants right now, but you probably will soon. This fall a major leap forward for the conversational interface will be announced by the ding of a push notification on your smartphone. Once you’ve upgraded to iOS 9, Android 6, or Windows 10, you will, by design, find yourself spending less time inside apps and more chatting with Siri, Google Now, or Cortana. And soon, a billion-plus Facebook users will be able to open a chat window and ask M, a new smart assistant, for almost anything (using text—for now). These are no longer just supplementary ways to do things. They’re the best way, and in some cases the only way. (In Apple’s HomeKit system for the connected house, you make sure everything’s off and locked by saying, “Hey Siri, good night.”)
At least in the beginning, the idea behind these newly enhanced virtual assistants is that they will simplify the complex, multistep things we’re all tired of doing via drop-down menus, complicated workflows, and hopscotching from app to app. Your assistant will know every corner of every app on your phone and will glide between them at your spoken command. And with time, they will also get to know something else: you.
Ask around Silicon Valley these days, though, and you hear the same refrain over and over: It’s different now.
One hot day in early June, Keyvan Mohajer, CEO of SoundHound, shows me a prototype of a new app that his company has been working on in secret for almost 10 years. You may recognize SoundHound as the name of a popular music-recognition app—the one that can identify a tune for you if you hum it into your phone. It turns out that app was largely just a way of fueling Mohajer’s real dream: to create the best voice-based artificial-intelligence assistant in the world.
The prototype is called Hound, and it’s pretty incredible. Holding a black Nexus 5 smartphone, Mohajer taps a blue and white microphone icon and begins asking questions. He starts simply, asking for the time in Berlin and the population of Japan. Basic search-result stuff—followed by a twist: “What is the distance between them?” The app understands the context and fires back, “About 5,536 miles.”
Mohajer rattles off a barrage of questions, and the app answers every one. Correctly.
Then Mohajer gets rolling, smiling as he rattles off a barrage of questions that keep escalating in complexity. He asks Hound to calculate the monthly mortgage payments on a million-dollar home, and the app immediately asks him for the interest rate and the term of the loan before dishing out its answer: $4,270.84.
“What is the population of the capital of the country in which the Space Needle is located?” he asks. Hound figures out that Mohajer is fishing for the population of Washington, DC, faster than I do and spits out the correct answer in its rapid-fire robotic voice. “What is the population and capital for Japan and China, and their areas in square miles and square kilometers? And also tell me how many people live in India, and what is the area code for Germany, France, and Italy?” Mohajer would keep on adding questions, but he runs out of breath. I’ll spare you the minute-long response, but Hound answers every question. Correctly.
Hound, which is now in beta, is probably the fastest and most versatile voice recognition system unveiled thus far. It has an edge for now because it can do speech recognition and natural language processing simultaneously. But really, it’s only a matter of time before other systems catch up.
After all, the underlying ingredients—what Kaplan calls the “gating technologies” necessary for a strong conversational interface—are all pretty much available now to whoever’s buying. It’s a classic story of technological convergence: Advances in processing power, speech recognition, mobile connectivity, cloud computing, and neural networks have all surged to a critical mass at roughly the same time. These tools are finally good enough, cheap enough, and accessible enough to make the conversational interface real—and ubiquitous.
But it’s not just that conversational technology is finally possible to build. There’s also a growing need for it. As more devices come online, particularly those without screens—your light fixtures, your smoke alarm—we need a way to interact with them that doesn’t require buttons, menus, and icons.
At the same time, the world that Jobs built with the GUI is reaching its natural limits. Our immensely powerful onscreen interfaces require every imaginable feature to be hand-coded, to have an icon or menu option. Think about Photoshop or Excel: Both are so massively capable that using them properly requires bushwhacking through a dense jungle of keyboard shortcuts, menu trees, and impossible-to-find toolbars. Good luck just sitting down and cropping a photo. “The GUI has topped out,” Kaplan says. “It’s so overloaded now.”
That’s where the booming market in virtual assistants comes in: to come to your rescue when you’re lost amid the seven windows, five toolbars, and 30 tabs open on your screen, and to act as a liaison between apps and devices that don’t usually talk to each other.
You may not engage heavily with virtual assistants right now, but you probably will soon. This fall a major leap forward for the conversational interface will be announced by the ding of a push notification on your smartphone. Once you’ve upgraded to iOS 9, Android 6, or Windows 10, you will, by design, find yourself spending less time inside apps and more chatting with Siri, Google Now, or Cortana. And soon, a billion-plus Facebook users will be able to open a chat window and ask M, a new smart assistant, for almost anything (using text—for now). These are no longer just supplementary ways to do things. They’re the best way, and in some cases the only way. (In Apple’s HomeKit system for the connected house, you make sure everything’s off and locked by saying, “Hey Siri, good night.”)
At least in the beginning, the idea behind these newly enhanced virtual assistants is that they will simplify the complex, multistep things we’re all tired of doing via drop-down menus, complicated workflows, and hopscotching from app to app. Your assistant will know every corner of every app on your phone and will glide between them at your spoken command. And with time, they will also get to know something else: you.
by David Pierce, Wired | Read more:
Image: Francesco Muzzi