Skip to main content


中文 | 日本語


Archived content

NOTE: this is an archived page and the content is likely to be out of date.

Deep Learning-based Voiceprint Authentication from Very Short Speeches

Fujitsu Research & Development Center Co. Ltd.

Beijing, China, March 09, 2017

Fujitsu R&D Center Co., Ltd. (FRDC) announced the development of a kind of high-precision voiceprint authentication technology, and this technology, by making use of deep learning approach, is able to identify the speaker from a very short speech segment. This technology combines two deep learning engines, one is used to extract speech content-related features, and the other speaker-related features, thus realizing the “voice password” identity authentication functions, that is: the identity of the speaker can be accepted only when the speaker himself/herself speaks out correctly the preset contents. With this technology, the error rate of identity authentication is about 2.2% with a short speech of less than 3 seconds.

This technology can be widely used in call center, IoT interaction and other applications, where quick and safe user identity verification is required to strengthen operation safety and convenience.

Development Background

Voiceprint recognition is an important branch in the field of biometric authentication. Given its unique remote operation advantages, voiceprint-based identity authentication has gradually been accepted and approved in the fields, such as telephone banking of the financial sector, smart home, criminal investigation, security etc., and has become an important measure against fraud.

In call center business, one common problem is that the customers have to input password or answer a series of tedious questions to verify their identities before receiving service. This inquiry process will take a long time, e.g., more than 60 seconds in average, which not only reduces the efficiency of the customer service, but also brings boredom to the customers. Therefore, a safe and effective remote identity authentication solution will improve considerably the call center operational efficiency, and reduce its operating costs.


Traditional voiceprint recognition technology relies on statistics and signal processing methods to extract speaker-related characteristics from the speech to realize the identity authentication. However, this technology often requires relatively a long speech for speaker identification, for example, 30 seconds. However, in many application such as the call center and IoT interaction, it needs to quickly verify the user’s identity, thus, the traditional voiceprint recognition technology obviously cannot meet this kinds of requirements. Moreover, the traditional authentication methods can not prevent fraudulent practice of using other people's recording for identification.

About technology

  1. Reduce the speech length for stable voiceprint recognition

    Traditional voiceprint recognition technologies divide the speech into very short clips (a clip is usually about 20ms, and is called a frame), from which the pronunciation characteristic of each speaker is identified by the comparison with thousands of average Gaussian models (Gaussian model is one kind of statistical method, and it represents the average pronunciation characteristic). Due to the high complexity of these large amount of Gaussian models, much speech data is required by such statistic methods to get discriminant speak features.

    Deep learning engines are designed in our voiceprint recognition technology, where multiple consecutive speech frames are simultaneously processed for the learning of speaker’s characteristic, as shown in figure 1. Due to the increase of the speech clips being processed at one time, more information related to the pronunciation manner is contained, such as tone change, pause length, etc. As a result, the total length of the speech required for stable authentication is reduced.

  2. Combination of speaker feature and speech content

    We use two deep learning engines in this technology to extract respectively speaker himself/herself related characteristics and voice contents for identity authentication, thereby realizing “voice password” functions, that is, the identity of the speaker can only be accepted when the speaker himself/herself speaks out correctly the preset contents, as shown in Figure 2.

    On one hand, the voice password can prevent fraudulent practice of using other people's recording for identification; on the other hand, more discriminative speaker feature can be identified by fixing the speech content. If a speaker pronounces a certain phoneme, say /aa/, very differently from the general population, this specific pattern may be learned by the deep learning engine if /aa/ is included in his password and it becomes one key factor in authentication and fraud prevention.



Our technology only requires a short speech of less than 3 seconds for identity authentication due to the deep learning engines can process multiple speech frames at one time. Even for short speech, we may still keep high authentication accuracy. On a set of 200 people, the authentication error rate is only about 2.2%.

In the Future

FRDC will, in the future, apply this technology to the call center of finance and insurance area as well as other industries, and provide customers with efficient and safe identity authentication solutions. Moreover, FRDC will also continue to promote and expand voiceprint authentication application in prison family call management.

About Fujitsu

Fujitsu is the leading Japanese information and communication technology (ICT) company offering a full range of technology products, solutions and services. Approximately 156,000 Fujitsu people support customers in more than 100 countries. We use our experience and the power of ICT to shape the future of society with our customers. Fujitsu Limited (TSE: 6702) reported consolidated revenues of 4.7 trillion yen (US$41 billion) for the fiscal year ended March 31, 2016. For more information, please see

About Fujitsu R&D Center Co., Ltd.

Established in 1998, Fujitsu R&D Center Co., Ltd. is a wholly owned R&D center of Fujitsu Limited, located in Beijing. The center's research areas cover the major business fields of the Fujitsu Group, including information processing, telecommunications, semiconductors, and software and services. For more information, please see:


E-mail: E-mail:
Company:Fujitsu R&D Center Co., Ltd.
Information Technology Lab.

Press Release ID: 2017-03-09
Date: 09 March, 2017
City: Beijing, China
Company: Fujitsu Research and Development Center Co., Ltd.