It wasn’t long ago when being put through to a robot on company’s phone line was a painful experience to say the least. I’m sure, like me, you have had that moment where you’ve wanted to hang-up out of pure frustration at not being understood? Fortunately, however, speech recognition has vastly improved in recent years with high performance compute (HPC), and forward-thinking companies like Speechmatics, being at the heart of these improvements.
The power of HPC has opened doors to many technological advances within business and commerce and none more so than the area of speech recognition and natural language translation. Translating human voice has become a big, multi-million pound business.
Back in October of last year, LivePerson acquired VoiceBase and Tenfold, which both used artificial intelligence (AI) and complex natural language processing to turn audio into structured, rich data. This was closely followed by Microsoft closing on its approximately $16 billion acquisition of speech recognition company Nuance - an early pioneer of speech recognition AI, best known as the source of the speech recognition engine for Siri, Apple’s ever-popular personal assistant.
And, even more recently, global experts in deep learning and speech recognition, Speechmatics, who’s latest HPC cluster is based within our Harlow data centre announced they raised an impressive $62m in Series B funding.
Of course digital personal assistants like Siri and Alexa are where most people would have encountered this technology, but it is also being used to make driving safer through voice-activated navigation, save time for medical staff by automating dictation, and increase security with voice-based authentication.
As the LivePerson acquisition suggests, customer service is another area where voice recognition is growing rapidly in significance. Most people prefer voice requests to typing or clicking through menus, but few companies can afford to have a person available to handle every discussion and customers are increasingly complaining about ‘call centre fatigue’.
Technology now makes it easier to handle some of those interactions efficiently and it allows organisations to collect voice data. The data can be used to train AI-powered neural networks to understand speech better, but organisations naturally want to analyse it further and AI can help here too.
As a result, the global voice recognition market is booming, and was valued at $10.7 billion in 2020. This is expected to be worth $27.2 billion by 2026 and will include several overlapping technology types and use cases.
Voice recognition could be used by dictation software to convert the content of a meeting into a text document, for example. To do that, the computer needs to identify the words spoken but it doesn’t have to be very concerned about their meaning.
That changes somewhat when a voice assistant needs to tell you about the weather by ‘understanding’ a range of commands. In most cases, though, the device will recognise a variety of ways of asking the question but not every single one. These tools can also be used to analyse large quantities of audio and identify key qualities, such as sentiment.
These solutions are constantly evolving. For example, Speechly, a Finnish start-up, recently patented a new technological approach that combines speech-to-text with natural language understanding in a novel way. The company claims that this enables faster and more complex voice interactions than current solutions.
All this is done with a variety of algorithms, which use variations of grammar rules, probability, and speaker recognition to identify and classify spoken phrases. The difference between a good speech recognition algorithm and a poor one is accuracy rate and speed. Companies want the system to work as quickly as possible, with minimal errors.
Key behind all of this of course is data, and voice data that is tagged and categorised so AI can understand what it is and how it relates to everything else in the sphere of spoken word. As we’ve already seen with visual data (behind say autonomous driving) labelling data is a labour-intensive process and the end product is only as good as the initial input. It’s accurate, but limited, both in terms of speed and volume, but also in capturing the infinitely varied intricacies of global speech, tone, delivery, accents, slang, etc.
How can we label data faster? Or maybe we need to change how we look at this problem and refine our machine and deep learning models. The amount of data is an issue, but also a blessing, and if machines can be trained to be able to pull more accurate insights from raw data, then we’re really moving into the world of seamless voice translation at mass volume. It’s a game changing moment.
As I’ve mentioned at the start of this blog, much of this work relies on high performance computing (HPC) in both the cloud and colocation data centres to work successfully. Voice recognition is a very compute intensive operation that requires classic parallel processing, low bandwidth and inter-connectivity, and lots and lots and lots of GPUs. The cloud is a perfect receptacle for collecting, inputting and distributing data, and industrial scale data centres like Kao Data are ideal for the data intensive, fine-touch processing tasks – enabling an HPC cluster, supported by GPUs, to be tailored within a close, connected environment to crunch data.
It’s possible that within a few years, voice will be our main method of interacting with computers and they will ‘speak’ their information back to us through smart earphones or project data onto augmented reality glasses. The cloud, high performance computing and 5G connectivity will ensure that all of this happens with lightning speed. We are still in the early stages of voice recognition, but as the investment within the technology shows, there is plenty more to come.