Interview: VocalZoom talks voice isolation
14 June 2016 11:23 GMT

The potential for versatility, scalability and convenience presented by voice biometric solutions has helped the segment grow rapidly in recent years, with major banks and governments deploying them in contact centres and mobile deployments being attached to fintech apps.

Case studies of bank implementations by major US biometrics firms have shown that the technology can save financial institutions significant sums, while cutting authentication times down to seconds.

The future could be even brighter for voice, with the contactless interface and flexible access potential offered by the modality making voice biometrics seemingly a no-brainer for Internet of Things applications.

But like other biometric modalities there are always challenges – for voice in a mix of scenarios, background noise can confuse algorithms, threatening usability. Is “spoofing” the tech as easy as playing back a recorded, or even morphed, sound?

This is an area being targeted by the Israeli startup VocalZoom, which has developed a technology that it says will perfectly isolate a voice for both recognition and authentication.

Planet Biometrics caught up with the company’s CEO, Tal Bakish, to discuss his plans for future development.

Can you explain how each parts of the solution work?

The VocalZoom sensor is very simple. It operates in a way that is not unlike simple acoustic microphones, whose electronics measure the movement of a membrane when sound (pressure) waves in the air hit it. With the VocalZoom sensor, facial skin is the membrane, and the sensor measures this membrane’s movement during speech with a laser aimed at key areas of the face including the mouth, lips, cheek, throat and behind the ear. The skin membrane is moving because of sound waves inside the mouth which, because they are confined inside the mouth, are not affected by ambient noise. The VocalZoom sensor essentially “hears” only the speaker voice.  Nanometer-resolution interferometry techniques are used to measure the vibrations, the data is converted into intensity variations, algorithms filter out any vibrations not associated with the user’s speech, the remaining intensity variations are converted to signals, and these signals are then converted back to sound. The sensor supports an up-to-1 meter range when pointed to one of these fixed locations on a user’s face.  Simple and efficient.

Can you tell us about the company history behind VocalZoom?

In 2006, I joined some of my colleagues to work on few ideas. Back then, there was a lot of buzz circling around speech recognition technology, and the challenges facing this new technology were huge. We sat down and tried to figure out if there was a completely different way to solve the problems that faced successful speech recognition solutions. My colleagues and I had all studied physics, so we started looking at technologies that were beyond the scope of acoustic microphones, and we came up with an alternative:  optical technology that senses vibrations, married with interferometer technology, which measures the distance and velocity of window vibrations when people inside a building speak. We thought if windows vibrate when you speak, then surely everything must vibrate when you speak.

We started looking into some academic research that was done on this subject and began doing some experiments. We found that everything is vibrating around us when we speak but, more importantly, facial skin can also vibrate. From that point on, we started looking for a way to create a product that is small enough and could be manufactured at a low enough cost that it could be used to measure facial vibrations for two primary consumer applications:  voice control for headsets, wearables and the connected car, and voice authentication applications for access control and the smart home. This challenge took us around three years to overcome, but in 2010, we found a way to do it. From there, we founded VocalZoom and started moving forward.

What biometrics issues has your technology been developed to address? Why is it important to isolate voices?

There are two key issues that need to be addressed if voice is to be used for biometrics authentication:

Accuracy:  Voice biometrics is the holy grail of the industry -- It is seamless, robust, natural, frictionless, hands-free, etc. But voice recognition accuracy has been very low with existing solutions, which have generally been designed for human listeners rather than machines, which are incapable of inferring meaning as humans do if background noise periodically drowns out the speaker. There are two accuracy challenges to overcome: 1) noise (street noise outdoors, people speaking in the background, wind and other noise inside a car, etc.), and 2) sensitivity to the acoustic environment structure (i.e., whether the environment is a room or outside, a lobby or staircase, and whether the speaker is facing the microphone or talking from off to the side of it, etc.). VocalZoom solved these two challenges with a new category of Human to Machine Communication (HMC) sensor solution that enables superior noise reduction and delivers a near-perfect, repeatable reference voice signal for automatic speech recognition (ASR) engines. The VocalZoom sensor is unique in its ability to measure vibrations inside the mouth cavity and not outside the mouth, which is important because the speaker’s mouth does not change frequently and, therefore, the voice always sounds the same. The sensor can also measure skin vibrations for authentication purposes.

Simplifying strong authentication using multiple factors:  For optimum security and reliability, biometrics is used in in a multi-factor authentication model and solutions typically consist of two or more technologies; i.e., voice and fingerprint, or fingerprint with password, etc.  This can be expensive for the developer (marrying two separate technologies into a single multi-factor authentication platform or system) and inconvenient for the user (having to carry a card or remember a password in addition to presenting a biometric).  With a single VocalZoom sensor, there is the opportunity to deliver an all-biometric multi-factor authentication solution (voice and skin) that is easier to develop and more convenient to use.  .

Do you expect voice biometrics take-up to accelerate in banking?

Voice is already gaining momentum in banking. The problem is that it is limited to very specific use cases. As the world becomes more connected at home and in our cars and at work, users want the benefits of this connectivity for on-line banking and purchases, as well, without having to use passwords or going through cumbersome authentication processes. Voice is seamless and hassle-free, and there is the opportunity for voice authentication solutions to simplify interactions and transactions, and deliver a much better user experience. When the speech recognition of voice biometrics solutions reaches 99% accuracy, it will be a game changer – the VocalZoom sensor offers that opportunity.

 What other segments are right for this solution?

In addition to voice authentication, voice control is also an ideal application, and the main segment we have initially identified here is the connected car. Developers have started generating applications and services for the connected car, much like what happened in the mobile phone industry. The connected car is an ideal target for these applications and services because the driver and passengers are, in a sense, a captive audience - they are often inside the car and on the road for up to three hours a day. Auto makers as well as online sellers want to monetize this captive audience through a variety of products and services. But to establish a continuous connection to - and engagement with - the people in the car who will be consuming these products and services, there must be a voice interface, to ensure the best (and safest) possible user experience. If people can talk to their car in the same way they talk to their friends, it will become very easy to sell connected-car services and products.