Research on Voiceprint Recognition Zhang Wanli, Li Guoxin - IERI

Research on Voiceprint Recognition Zhang Wanli, Li Guoxin - IERI

2012 International Conference on Electrical and Computer Engineering

Advances in Biomedical Engineering, Vol.11

ong>Researchong> on Voiceprint Recognition

Zhang Wanli, Li Guoxin

Electronic information and engineering, Changchun University, Changchun, China

E-mail: ,

Keywords: Voiceprint; Feature extraction; Recognition

Abstract. The concept and the applications of voiceprint recognition are described in the paper. The

methods and results of voiceprint recognition at home and abroad are explained in the paper.

According problems existing in voiceprint recognition, many methods of resolution are presented.

1. Introduction

With the rapid worldwide emergence of e-commerce, people can apply online electronic transaction

through the opening network. Also, because there are a lot of sensitive information of personal,

military and government that only authorized people can visit [1]. So the network security has

become a key issue for network development, and authentication is an important aspect of network

security, and now people pay more and more attention to it [2].

Voiceprint is shown by electric acoustic instrument and the spectrum of voice waveform which

contains speech information. Voiceprint recognition is the technology of automatic speaker

recognition which recognizes speaker identification [3]. Its basic principle is that the unique

mathematical model for everyone is constructed by analyzing people’s voice and hearing, and the

model exactly matches with input voice and matching result who is the speaker is given by the


2. The Process and Methods of Voiceprint Recognition

The process of voiceprint recognition is shown in figure1. It is composed of the preprocessing,

feature extraction, model formation and storage, model training, and pattern matching and judgment

which will now be described in more detail.

model formation

and storage









Figure 1.

Voiceprint recognition diagram

2.1 Speech Pre-peocessing

Pre-processing includes sampling, quantization, pre- emphasis and windowing. Sampling and

quantization can transfer the original analog speech signal into digital speech signal which is discrete

in both time and magnitude [4]. The purpose of pre-emphasis is to enhance high-frequency

information, make spectrum flat and analyze the spectrum and voice channel parameters. Windowed

signals can be regarded as a stationary process in a short period. The speech is pre-processed by the

above methods as shown in figure 2.

978-1-61275-029-3 /10/$25.00 ©2012 IERI ICECE 2012


Figure 2.

speech signal

2.2 Feature Extraction

Feature extraction is the process of obtaining speech signal characteristics parameters from the

speech signal waveform. Currently features are linear prediction coefficients (LPC), Mel cepstrum

coefficients (MFCC) and or their mixture parameters.

1) Linear prediction coefficients

Linear prediction coefficients is one of the most powerful speech analysis techniques, and one of

the most useful methods for encoding good quality speech at a low bit rate. It provides extremely

accurate estimates of speech parameters, and is relatively efficient for computation [5].

Linear prediction coefficients determine the coefficients of a forward linear predictor by

minimizing the prediction error in the least squares sense. The prediction value can be obtained by

the following equation.


sˆ ( n)

= a s(

n − i)

where { a k

} are linear prediction coefficients and p is the order of the prediction filter. The

prediction error is





e( n)

= s(


− sˆ(


= s(


− a s(

n − i)

2) Mel cepstrum coefficients

The acoustic analyses based on the MFCC, which represent the ear model, have proved good

results in speaker recognition especially when a high number of coefficients are used [6].

The algorithm of MFCC consists of Framing, windowing, FFT, Mel-filtering, Log and DCT. The

first two stages have already been described. T he FFT algorithm is applied to get the magnitude

spectrum of the windowed speech data [7]. The Mel-filtering provides a model of hearing realized by

the bank of triangular filters uniformly spaced in the scale Mel. The scale Mel is given by

where f notes the frequency in Hz. The Mel-filtering is shown in figure3.


f mel

= 2595log10(1

+ )






Figure 3.


A discrete cosine transform (DCT) expresses a sequence of finitely many data points in terms of a

sum of cosine functions oscillating at different frequencies. The last algorithm stage performed to

obtain Mel frequency cepstral coefficients is a Discrete Cosine Transform (DCT) which encodes the

Mel logarithmic magnitude spectrum to the Mel frequency cepstral coefficients (MFCC). The Mel

frequency cepstral coefficients are





2 πi

∑ m


cos[ ( j − 0.5)]

p p



where p is the number of filters and

2.3 Recognition



is the coefficient of filter.

The methods of voiceprint recognition consist of template matching, vector quantization (VQ),

Gaussian mixture model (GMM), Hidden Markov Model (HMM) and the combination of the above


3) template matching

Template-based voiceprint recognition is a kind of conventional methods. Text-independent

voiceprint recognition is achieved by matching of template speech sequences and testing speech

sequences. In template-based voiceprint recognition system, the template of feature sequences is

established for each word uttered by each speaker. In the processing of recognition, the testing

feature sequences are compared and matched with the template of feature sequences for each word.

Because the same word pronounced by the same speaker in different time has a certain difference

between the speed and the length of template speech. So we add or cut the testing feature sequences,

the testing feature sequences can be aligned with the template feature sequences. The method

minimizes the matching distance and the distance

4) vector quantization

Vector quantization is an efficient data compression technology. It is widely used in speech

recognition and synthesis, and image data compression. In VQ-based voiceprint recognition system,

we find a codebook from the feature sequences of each speaker. The VQ is to determine the

optimum set of codebook and minimize the squared-error distortion. The voiceprint recognition is

made by calculating the minimum average distance between the new vectors and the centroids of

each speaker and using the minimum to recognize the speaker [8].

5) Gaussian mixture model

The Gaussian mixture model techniques are increasingly being used for voiceprint recognition. In

the vector quantization system, codebook represents the centriods of feature vector of the speaker

frame in the feature space, and is incomplete description of the feature [9]. In the Gaussian mixture

model system, the Gaussian mixture model can present the distribution of speaker’s statistical

characters properly from each speaker with a weighted sum of Gaussian functions, and a large

number of mixture components are usually needed to obtain a good approximation [10].


6) Hidden Markov Model

In recent years, Hidden Markov model method has been a powerful tool for voiceprint recognition

and a kind of statistical models. Hidden Markov Model can represent dynamic changes of speech

signal characteristics and the statistical distribution of speech features [11]. Speech features of each

speaker have varied in time domain. Hidden Markov model is made for each reference speaker and

the Hidden Markov model parameters are estimated using the Baum-Welch algorithm. The

voiceprint recognition system examines speaker identification by comparing the likelihood of

Hidden Markov model for input speech frames of test samples and the reference vectors [12][13].

3. Conclusion

Voiceprint recognition is superior to conventional password for authentication. Voiceprint

recognition may be more convenient for the user, since the password is part of the identification

process and must be remembered. Voiceprint recognition is one of the most natural and economical

methods for authentication. The recognition technology make it possible to use speaker’s voice to

recognize their identity and control access to services such as voice dialing, banking accounts, online

shopping, database access services, information services, voice mail, security control for confidential

information areas, and remote access to computers.

4. Acknowledgment

This work is supported by jilin provincial education foundation (20080360) and jilin provincial

technology foundation (20090356).


[1] Herbert Gish and Michael Schmidt, “Text-independent speaker identification”, IEEE Signal

Processing Magazine, 1994

[2] Soo-young Lee and Xin Yao,“Voice articulator for thai speaker recognition system”,

Proceeding of the 9 th International Conference on Neural Information Processing, 2002

[3] Waleed Fakhr, Ahmed Abdelsalam, and Nadder Hamdy,“Enhancement of mismatched

conditions in speaker recognition for multimedia applications”, ICASSP, 2004

[4] Zied SAKKA,“A new method for speech denoinsing and speaker verification using subband

architecture”, IEEE, 2004

[5] Shi-Han Chen and Hsiao-Chuan Wang, “Improvement of speaker recognition by combining

residual and prosodic features with acoustic features”, ICASSP, 2004

[6] Hassen Seddik, “Text independent speaker recognition using the mel frequency cepstral

coefficients and a neural network classifier”, IEEE,2004

[7] Zbynìk Tychtl, “Speech Production Based on the Mel-Frequency Cepstral Coefficients”,


[8] Thomas E. Filgueiras Filho, “Learning vector quantization in text independent automatic

speaker recognition”, IEEE,2005

[9] Li Liu and Jialong He, “On the use of orthogonal GMM in speaker recognition”, IEEE,1999

[10] C.C.T. Chen, “Hybrid KLT GMM approach for robust speaker identification”, Electronics

letters, Vol. 39, No.21,16 th October 2003


[11] Naftali Z. Tishby, “On the Application of Mixture AR Hidden Markov Models to Text

Independent Speaker Recognition”, IEEE Transactions on Signal Processing, Vol. 39, NO. 3,

March 1991

[12] Seiichi Nakagawa, “Text_independent speaker recognition by combining speaker specific

GMM with speaker adapted syllable_based HMM”, ICASSP, 2004

[13] Tomoko Matsui and Sadaoki Furui, “Comparison of Text-Independent Speaker Recognition

Methods Using VQ-Distortion and Discrete/Continuous HMMs”, IEEE Transactions on Speech

and Audio Processing, Vol 2, No. 3, July 1994


More magazines by this user
Similar magazines