Speech Recognition and Synthesis in the Browser

15 Jun, 2018
Xebia Background Header Wave

With the recent upsurge of Siri, Google Assistant and Amazon’s Alexa, speech recognition and synthesis have become an increasingly important tool in the developer’s toolbox. Working with speech data can not only improve the accessibility of your application. It can also increase conversion in your webshop, especially when customers shop on their mobile phones. Native apps have a large advantage in this space, as Apple’s SiriKit and Google’s Assistant SDK can get you up and running in a few minutes to hours.

For browsers, the story is different. If we disregard native apps for a second, plain old websites (or progressive web apps) seem to severely lag behind. How many websites can you mention from the top of your head that offer you the option to search with your voice? I can only think of a handful. Is it because people are scared to talk to their laptops, but are comfortable telling stories to their smartphones? Probably not. The biggest problem is that the easy solutions provided by libraries like SiriKit are not there for the web. Or are they?

To do speech recognition and synthesis you need a massive amount of training data for all kinds of languages in all kinds of settings. As a lone wolf or small startup, accruing that amount of data is near-impossible. So, you go to the companies that have already managed to train their neural networks: the Googles and Amazons. For a small amount of money per transcribed audio second, you can send your audio files to Google’s Cloud Speech-to-Text API or Amazon Transcribe. They will transcribe the audio within sub-second response times. Although the technology is cool, setting it up can be a hassle, especially if you want to use the audio recorder provided in the browser. The recording format of the browser does not always match the prescriptions of the cloud transcribers. This requires you to hit another backend for transcoding files into something that works. Yes I know, ugh.

When I recently had to implement a proof of concept voice search with some of my colleagues, we came across a 2012 (!) specification for speech recognition called the Web Speech API Specification. In it we found several interfaces like SpeechRecognition and SpeechSynthesis which seemed to do exactly what we were looking for. Another round on Google brought us to the following MDN page on the Web Speech API. To our big surprise, the browser support table showed the following:

Desktop support for Web Speech API

You don’t say! Chrome has had support since around 2013, when Chrome 25 was released. Coming to think of it, it is even weirder now that so little websites seem to make use of the native API’s that are supported in Chrome, Firefox and Edge. Disregarding the part of the user base that still uses Internet Explorer, we continued on our journey.

We based our implementation on a repository that accompanies the MDN page (available here). After downsizing it a little bit, our implementation looked something like the following:

And it worked! This piece of code will ask for your permission to use the microphone after 3 seconds. If you click ‘allow’, it will then listen how well you do on pronouncing Bahasa Indonesia. Finally, it logs the transcribed results in your console. If your Bahasa is rusty, you can of course change the language code on line 6 to any of the supported languages shown in this demo. Note that these are the languages available in Chrome, other browsers might have a different list.

The Web Speech API seems to do a pretty good job at transcribing user commands in various languages. By default, the recording will stop as soon as the user stops speaking. If you want to transcribe longer pieces of audio, you can also set recognition.continuous to true . In production environments, you would first want to check support for the SpeechRecognition interface and handle its absence. Also, it might be wise to only start recording after the user clicks a button instead of triggering it with a window.setTimeout() . But I’ll leave that as an exercise to you.

Speech recognition is a breeze to implement using the Web Speech API. Speech synthesis is even easier to implement, of which you can see the proof here. Enhancing your website with speech recognition can really enhance the user experience when it comes to searching or shopping. Inexperienced customers might like to express their intent in natural language, describing what they want to do or find. In traditional UIs, this is often not possible. Most search inputs force you to type in a condensed kind of language that does not even closely resemble a grammatically correct sentence or action, something which speech recognition might solve. Speech synthesis can in turn help in the realm of chatbots, notifications or messaging, amongst others. In short, smart assistant technologies can not only help us in our homes or on our phones, we can also use them anywhere on the web where natural language is a good transport for communication with your users. So, how are you going to use it in your app?

Léon Rodenburg
Léon Rodenburg is a full stack development consultant at Xebia. He has a background in Computer Science and Sinology and is always on the lookout for the crossroads between the two. Having lived and studied in China for quite some time, he has put his knowledge of the Chinese language into practice by experiencing on- and offline daily life in Beijing like a local.

Get in touch with us to learn more about the subject and related solutions

Explore related posts