Guest Post: Are voice biometrics trustworthy?
24 November 2016 16:37 GMT

By Terry Nelms, Director of Pindrop Labs

Voice biometrics has the potential to improve the security and user experience of systems and services that require authentication.  Voice is a natural interface for many systems and is a primary method of communication.  By leveraging voice for authentication, a separate security interface is not required resulting in a simplified system and an improved user experience.  However, for voice biometrics to be trusted and ubiquitous, it must be unobtrusive, robust and secure. 


Most security systems tend to be obtrusive; for example, the airport screening lines that must be navigated in order to board a flight.  This is unfortunate because the vast majority of users (passengers) are benign, but they have to pay the security toll so the few malicious ones can be detected.  Perhaps the biggest promise of biometrics is unobtrusive security.  This is especially true of voice biometrics since speaking is often already part of the user experience.

Voice biometric systems require users to go through an enrollment process so it can learn their voice.  Enrollment can be active or passive.   Active enrollment must be used with systems that are text-dependent.  This is a manual process where the user is required to speak an agreed upon phase.  In addition to enrollment, this phase must be spoken every time the user is authenticated. Therefore, text-dependent systems fail to make security unobtrusive. 

On the other hand, passive enrollment occurs during the user’s normal interaction with the system.  Passive enrollment is used by voice biometric systems that are text-independent.  They are able to learn the user’s voice during normal speech and do not require a specific phrase.  As a result, text-independent systems are truly unobtrusive to the user because they are almost completely invisible to them.


Voice biometrics must be robust to noise, channel and voice changes over time.  For instance, background noise is a common occurrence on many phone calls.  It is the non-speech sounds picked up by the microphone that include things such as the microwave, air conditioner, radio and kids talking.  If a voice biometric system is not robust to this noise it can cause authentication to fail when the background noise is different than what was present during enrollment.

Voice biometric systems should also be robust to the communication channel employed by the user.  For example, it should be possible to enroll a user calling from their landline and authenticate them at a later time when they call from their mobile phone.  Likewise, someone other than the user calling from their device should fail authentication because their voice is different and independent of the communication channel.  Thus, voice biometric systems need to minimize learning channel artifacts.

The human voice changes considerably as we age.  Consequently, voice biometric authentication from the initial enrollment will fail in time.  In order for voice biometric systems to be robust from aging, they must continually learn and adapt to the user’s ever changing voice.  However, they must be careful not to adapt too easily and learn from the voice of an imposter.


A primary security concern of voice biometrics (and biometrics in general) is what to do if it is compromised (i.e., a voice is stolen).  Unlike passwords your voice cannot be changed.  One way an attacker can steal a user’s voice is using a recording.  While an attacker could record the user directly, websites like YouTube have recorded audio that could be used for this purpose.   In the case of a bank call center, the attacker could simply replay part of the audio at the beginning of the call during authentication (e.g., during IVR navigation) and then speak with their normal voice when connected to an agent.

In addition to simply playing back a recording of the user, an attacker could use speech synthesis to impersonate the user.  Given enough audio, modern systems can build a voice that sounds very similar to the person being modeled.  This cloned voice could not only be used during authentication, but to also communicate and carry on a conversation with the agent or system. 

Lastly, the attributes that are extracted from the user’s voice for enrollment and authentication can be stolen.  They typically consist of a list of floating point numbers that are calculated when a user’s speech is analyzed.  For instance, if the attributes are extracted on a user’s device, an attacker could steal the attributes (e.g., through a compromised mobile phone) and then inject them into a separate session.  The stolen attributes would be used for authentication instead of those derived from the attacker’s voice.

For voice biometrics to be trusted and ubiquitous, it must be unobtrusive, robust and secure.  To be unobtrusive the enrollment process must be passive and text-independent, i.e., invisible to the user.  To be robust, it should limit the effects of noise, channel and voice aging on the authentication accuracy of the system.  Lastly, to be secure, voice biometric systems must protect the user’s voice from impersonation.