CN102543073A

CN102543073A - Shanghai dialect phonetic recognition information processing method

Info

Publication number: CN102543073A
Application number: CN2010105833672A
Authority: CN
Inventors: 陈开�; 许华虎; 阳诚海; 施建刚; 孙弘刚
Original assignee: SHANGHAI SHANGDA HAIRUN INFORMATION SYSTEM CO Ltd
Current assignee: SHANGHAI SHANGDA HAIRUN INFORMATION SYSTEM CO Ltd
Priority date: 2010-12-10
Filing date: 2010-12-10
Publication date: 2012-07-04
Anticipated expiration: 2030-12-10
Also published as: CN102543073B

Abstract

The invention relates to a Shanghai dialect phonetic recognition information processing method, which includes steps that: 1) a voice input device inputs Shanghai dialect signals; 2) a preprocessing module preprocesses the input Shanghai dialect phonetic signals; 3) a feature extraction module extracts feature parameters reflecting signal features; 4) a training module performs preprocessing and feature parameter extraction on training phonetic signals input by users for a plurality of times to obtain feature vector parameters and then a feature modeling module builds a reference model base for training voice; 5) a recognition module carries out similarity comparison on feature vector parameters of the input voice and models in the reference module base and outputs the input model with the highest similarity as a recognition candidate result; 6) a postprocessing module performs phonetic knowledge processing on the recognition candidate result in step 5) to obtain the final recognition result; and 7) the final recognition result is output through a voice output device. Compared with the prior art, the Shanghai dialect phonetic recognition information processing method has the advantages of being high in recognition speed and the like.

Description

A kind of Shanghai language voice recognition information disposal route

Technical field

The present invention relates to a kind of audio recognition method, especially relate to a kind of Shanghai language voice recognition information disposal route.

Background technology

What aspect speech recognition, carry out the earliest is speaker's identification; Mainly concentrating on simple people's ear listens and distinguishes; Real speech recognition is that research adopts voice signal linear forecast coding technology and dynamic time warping technological, mainly is for isolated word, employing be the technology of template matches.China just carried out the research of speech recognition aspect to mandarin since 1987, and then for dialectal accent, the identification of dialect development relatively lags behind.Speak in the phonetic system structure in Shanghai, prosodic features, and the language syntax aspect all is different from mandarin.Can not simply use the method for identification mandarin and discern the Shanghai language.And the model of cognition of mandarin adopted classical H MM, and this method can cause the high problem of space-time complexity.

Summary of the invention

The object of the invention is exactly to provide a kind of recognition speed high Shanghai language voice recognition information disposal route for the defective that overcomes above-mentioned prior art existence.

The object of the invention can be realized through following technical scheme:

A kind of Shanghai language voice recognition information disposal route is characterized in that, may further comprise the steps:

1) audio input device input Shanghai language signal;

2) pre-processing module is carried out pre-service to the Shanghai language voice signal of input;

3) characteristic extracting module extracts the characteristic parameter of reflected signal characteristic;

4) training module is imported several times training utterance signal with the user; Through obtaining character vector after pre-service and the characteristic parameter extraction; Set up the reference model storehouse of training utterance then through the feature modeling module, or the reference model in the model bank is done the adaptability correction;

5) identification module character vector and the model in the reference model storehouse that will import voice carries out similarity and compares, and the input of the model that similarity is the highest is as the output of identification candidate result;

6) post-processing module obtains final recognition result to the identification candidate structure in the step 5) through the voice knowledge processing;

7) final recognition result is through audio output device output.

Described step 2) pre-service in comprises carries out end-point detection to noisy speech signal, and voice divide frame and pre-emphasis to handle.

The characteristic parameter step that extracts the reflected signal characteristic in the described step 3) is following:

1) choose pitch period, resonance peak and based on the Mel frequency cepstral coefficient of auditory properties as characteristic parameter;

2) voice signal is carried out LPF after, sample to set sampling frequency, to calculate related coefficient in short-term by frame the retardation time of setting, obtain pitch period at last;

3) directly voice signal is asked discrete Fourier transformation, compose the formant parameter that extracts voice signal with DFT;

4) carry out filtering with M Mel BPF., the output of each wave filter is taken the logarithm, obtain the log power spectrum of frequency band, and carry out inverse discrete cosine transformation, obtain L dimension Mel frequency cepstral coefficient, get preceding 12 dimension Mel frequency cepstral coefficients.

Reference model in the described step 4) is GMM and semicontinuous HMM model; This model comprises the tranining database of Shanghai language voice and the code book that is generated by database; In conjunction with code book and tranining database, calculate the mixed weighting value of acoustic model, generate GMM and semicontinuous HMM model at last.

Voice knowledge processing in the described step 6) comprises language model, morphology, sentence structure processing.

Compared with prior art; The present invention has the Shanghai phonics model based on multichannel GMM and semicontinuous HMM, and it is high that this model has solved HMM model space-time complexity to a certain extent, problems such as complicacy; Based on hyperchannel more accurate the estimation of each additional weights, improved recognition speed.

Description of drawings

Fig. 1 is a process flow diagram of the present invention;

Fig. 2 is a hardware configuration synoptic diagram of the present invention.

Embodiment

Below in conjunction with accompanying drawing and specific embodiment the present invention is elaborated.

Embodiment

As shown in Figure 1, a kind of Shanghai language voice recognition information disposal route is characterized in that, may further comprise the steps:

Step 101, audio input device 1 input Shanghai language signal;

The Shanghai language voice signal of

step

102,21 pairs of inputs of pre-processing module carries out pre-service, and it mainly carries out end-point detection to noisy speech signal, and voice divide frame and pre-emphasis to handle;

Step 103, characteristic extracting module 22 choose pitch period, resonance peak and based on the Mel frequency cepstral coefficient of auditory properties as characteristic parameter; Pitch period contains abundant tone information; Resonance peak and reflected the voice tone color in itself based on the Mel frequency cepstral coefficient of auditory properties is of paramount importance characteristic parameter;

Step 104, since the fundamental frequency of voice signal generally all below 500Hz; Even soprano C transfers the highest 1KHz that also is no more than; Characteristic extracting module 22 use a bandwidth as the low-pass filter of 1KHz to voice signal filtering; Sample with the 2KHz sampling frequency then, to calculate related coefficient in short-term by frame the retardation time of 10ms, every frame length is 20ms at last., obtain pitch period;

Step 105, characteristic extracting module 22 are directly asked discrete Fourier transformation to voice signal; Compose the formant parameter that extracts voice signal with DFT; But directly the spectrum of DFT will receive the influence of fundamental frequency harmonics, and maximal value can only appear on the harmonic frequency, thereby the resonance peak error at measurment is bigger.In order to eliminate the influence of fundamental frequency harmonics, can adopt homomorphism uncoiling technology, obtain level and smooth spectrum through after the homomorphic filtering, detection peak just can directly be extracted formant parameter so simply;

Step 106, characteristic extracting module 22 a usefulness M Mel BPF. carry out filtering, because acting in people's ear of component superposes in each frequency band, therefore the energy in each filter band are superposeed, at this moment k wave filter output power spectrum.The output of each wave filter is taken the logarithm, obtain the log power spectrum of frequency band, and carry out inverse discrete cosine transformation, obtain L dimension MFCC.But, get the MFCC of preceding 12 dimensions usually because the MFCC of preceding several dimensions and last some dimensions is bigger to the differentiation performance of voice.

Step 107, training module 23 are imported several times training utterance signal with the user; Through obtaining character vector after pre-service and the characteristic parameter extraction; Set up the reference model storehouse of training utterance then through the feature modeling module, or the reference model in the model bank is done the adaptability correction, reference model is GMM and semicontinuous HMM model; This model comprises the tranining database of Shanghai language voice and the code book that is generated by database; In conjunction with code book and tranining database, calculate the mixed weighting value of acoustic model, generate GMM and semicontinuous HMM model at last;

Character vector and the model in the reference model storehouse that step 108, identification module 24 will be imported voice carry out similarity and compare, and the input of the model that similarity is the highest is as the output of identification candidate result;

Identification candidate structure in

step

109,25 pairs of steps 108 of post-processing module obtains final recognition result through the voice knowledge processing;

Step 110, final recognition result are exported through audio output device 3.

As shown in Figure 2; Hardware device of the present invention comprises audio input device 1, processor 2, audio output device 3; Described processor 2 comprises pre-processing module 21, characteristic extracting module 22, training module 23, identification module 24, post-processing module 25; Described audio input device 1 is connected with pre-processing module 21; Described characteristic extracting module 22 is connected with training module 23, identification module 24 respectively, and described training module 23 is connected with identification module 24, and described identification module 24, post-processing module 25, audio output device 3 connect successively.

Claims

1. a Shanghai language voice recognition information disposal route is characterized in that, may further comprise the steps:

1) audio input device input Shanghai language signal;

7) final recognition result is through audio output device output.

2. a kind of Shanghai according to claim 1 language voice recognition information disposal route is characterized in that described step 2) in pre-service comprise noisy speech signal carried out end-point detection that voice divide frame and pre-emphasis to handle.

3. a kind of Shanghai according to claim 1 language voice recognition information disposal route is characterized in that the characteristic parameter step that extracts the reflected signal characteristic in the described step 3) is following:

4. a kind of Shanghai according to claim 1 language voice recognition information disposal route; It is characterized in that; Reference model in the described step 4) is GMM and semicontinuous HMM model, and this model comprises the tranining database of Shanghai language voice and the code book that is generated by database, in conjunction with code book and tranining database; Calculate the mixed weighting value of acoustic model, generate GMM and semicontinuous HMM model at last.

5. a kind of Shanghai according to claim 1 language voice recognition information disposal route is characterized in that the voice knowledge processing in the described step 6) comprises language model, morphology, sentence structure processing.