WO2005038774A1 - Systeme et procede d'apprentissage adaptatif par les sons et les images - Google Patents
Systeme et procede d'apprentissage adaptatif par les sons et les images Download PDFInfo
- Publication number
- WO2005038774A1 WO2005038774A1 PCT/NZ2004/000264 NZ2004000264W WO2005038774A1 WO 2005038774 A1 WO2005038774 A1 WO 2005038774A1 NZ 2004000264 W NZ2004000264 W NZ 2004000264W WO 2005038774 A1 WO2005038774 A1 WO 2005038774A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- adaptive
- input
- recognition system
- layer
- base layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
Definitions
- the invention relates to a methodology and system for adaptive sound and image learning and recognition.
- the invention is intended to be applied to the recognition and learning of sounds.
- sounds is intended to include both speech and voice in an input signal. References are made to "speech" by way of examples of sounds.
- the invention comprises an adaptive sound recognition system comprising an input layer comprising one or more input nodes configured to receive input data; a rule base layer comprising one or more rule nodes associated with respective functions, the rule base layer configured to receive input from the input layer; and an output layer comprising one or more output nodes associated with respective functions, the output layer configured to receive input from the rule base layer.
- the invention comprises an adaptive image recognition system comprising an input layer comprising one or more input nodes configured to receive input data; a rule base layer comprising one or more rule nodes associated with respective functions, the rule base layer configured to receive input from the input layer; and an output layer comprising one or more output nodes associated with respective functions, the output layer configured to receive input from the rule base layer.
- the invention comprises an adaptive combination image and sound recognition system comprising an input layer comprising one or more input nodes configured to receive input data; a rule base layer comprising one or more rule nodes associated with respective functions, the rule base layer configured to receive input from the input layer; and an output layer comprising one or more output nodes associated with respective functions, the output layer configured to receive input from the rule base layer.
- the invention further provides related methods of recognising an input and/or image signal(s).
- Figure 1 shows a methodological framework on which the invention may be implemented
- Figure 2 is a simplified graphical representation of a neural network in accordance with the invention
- Figure 3 shows a further methodological framework in which the invention may be implemented
- FIG. 4 shows a preferred form end point detector in accordance with the invention
- Figure 5 shows a preferred form feature extraction and noise suppression process
- Figure 6 shows an implementation of an image recognition feature of the invention
- Figure 7 shows a speaker identification component of the invention
- Figure 8 shows a person identifier feature of the invention
- Figure 9 shows a voice activated home automation system implemented using the invention
- Figure 10 shows an example of adaptive person verification
- Figure 11 shows performance of the invention on a data set
- Figure 12 shows performance of the invention on a further data set
- Figure 13 shows performance of the invention on a further data set
- Figure 14 shows a graphical user interface of one embodiment of the invention
- Figure 15 shows a schematic block diagram of one form of the recognition algorithm
- Figure 16 shows an example of fingerprint recognition in accordance with one aspect of the invention.
- FIG. 1 illustrates an overview of a framework 100 for adaptive speech and image recognition systems. Inputs to the system come from two primary sources of speech and image, a continuous speech signal 102 and a continuous image signal 104.
- Two separate specialised adaptive modules process the incoming inputs, a speech processing module 106 and an image processing module 108.
- the outputs from each module are further processed in another module 110 (integration/language module) to generate final outputs 112.
- the adaptive structure of the framework allows the system to correct itself at any level of its operation through adaptation process by checking 114 the result.
- a preferred technique is a simplified version of the Evolving Connectionist System (ECoS) paradigm, a specific neural network for the purpose of fast adaptive learning, called Adaptive Connectionist Classifier (ACC).
- Evolving Connectionist System Evolving Connectionist System
- ACC Adaptive Connectionist Classifier
- ACC is used as a specific model for the implementation of all the methods and systems described in this invention thus making the framework an integrated and uniformly implemented.
- ACC Adaptive Connectionist Classifier
- FIG. 2 is a simplified graphical representation of the ACC architecture.
- An ACC 200 consists of three layers of neurons, the input layer 202, with linear transfer functions, an evolving layer 204 or rule base layer based upon the rule layer of the EFuNN model, and an output layer 206 with a simple saturated linear activation function.
- FIG. 3 shows an alternative overview of a framework 300 for adaptive sound recognition systems.
- Input to the system comes from a primary source of sound data, a continuous sound signal 305.
- a pre-processing module 310 processes the incoming sound data.
- the processed and segmented sound data is then passed to module 315 for adaptive endpoint detection.
- Control is then optionally passed to an adaptive background noise suppression component 320 resulting in a denoised word signal 325.
- Adaptive modelling and recognition is then performed with module 330 and the correct identification of a word is tested as indicated at 335. If the correct identification of the word has been performed, then control is passed to retrieve further sound data 305. If the correct identification has not been made, adaptation of components of the system is carried out as indicated at 340 followed by adaptive noise suppression and/or adaptive modelling.
- the evolving layer is the layer that will grow and adapt itself to the incoming data, and is the layer with which the learning algorithm is most concerned.
- the meaning of the incoming connections, activation and forward propagation algorithms of the evolving layer all differ from those of classical connectionist systems. If a linear activation function is used, the activation A of an evolving or rule base layer node n is determined by Equation (1)
- a n ⁇ ⁇ D n (1)
- D n the normalised distance between the input vector and the incoming weight vector for that node.
- Rule nodes in the rule base layer are associated with respective activation functions, preferably linear activation functions. Other activation functions, such as a radial basis function could be used.
- activation functions such as a radial basis function
- the preferred form of learning algorithm or function is based on accommodating, within the evolving layer, new training examples by either modifying the connection weights of the evolving layer nodes, or by adding a new node.
- the algorithm employed is described below. • propagate the input vector / through the network • IF the maximum activation a mm is less than a coefficient called sensitivity threshold S thr add a node • ELSE - evaluate the error between the calculated output vector O c and the desired output vector O d o IF the error is greater than an error threshold E thr OR the desired output node is not the most highly activated, add a node o ELSE - update the connections to the winning node in the evolving layer • repeat the above procedure for each training vector
- Equation 2 The incoming weights to the winning node j are modified according to Equation 2, while the outgoing weights from node j are modified according to Equation 3.
- J (' + 1) J (0 + ? ⁇ ( , x J (0) (2)
- W t j (t) is the connection weight from input i toy at time t
- W t j (t + 1) is the connection weight from input i toy at time t + 1
- x is the learning rate one parameter /.
- W (t) is the connection weight from j to output p at time t
- W t (t + 1) is the connection weight from j to p at time t + 1
- ⁇ 2 is the learning rate
- E p is the error at p O d ⁇ p) is the desired output at p O , , is the calculated output at p
- the distance measure D n in equation (1) above is preferably calculated as the normalised
- / is the number of input nodes in the ACC
- E is the input vector
- W is the input to evolving layer weight matrix
- the ACC architecture is similar to the Zero Instruction Set Computer (ZISC) architecture.
- ZISC Zero Instruction Set Computer
- ZISC is based on RBF ANN and requires several training iterations over input data.
- Aggregation of nodes in the evolving layer can be employed to control the size of the evolving layer during the learning process.
- the principle of aggregation is to merge those nodes which are spatially close to each other. Aggregation can be applied for every (or after every n ) training examples. It will generally improve the generalisation capability of ACC.
- Node aggregation is an important regularization that is not present in ZISC. It is highly desirable in some application areas, such as speech or image recognition systems.
- speech recognition vocabulary of recognition systems needs to be customised to meet individual needs. This can be achieved by adding words to the existing recognition system or removing words from existing vocabulary.
- ACC is suitable for on-line output expansion because it uses local learning which tunes only the connection weights of the local node, so all the knowledge that has been captured in the nodes in the evolving layer will be local and only covering a "patch" of the input- output space.
- adding new classes or new inputs does not require re-training of the whole system on both the new and old data as does with traditional neural networks.
- the task is to introduce an algorithm for on-line expansion and reduction of the output space in ACC.
- the ACC is a three layer network with two layers of connections. Each node in the output layer represents a particular class in the problem domain. This local representation of nodes in the evolving layer enables ACC to accommodate new classes or remove an already existing class from its output space.
- the structure of the existing ACC first needs to be modified to encompass the new output node. This modification affects only the output layer and the connections between the output layer and the evolving layer.
- the graphical representation of this process is shown in Figure 2.
- the connection weights between the new output 208 in the output layer 206 and the evolving layer 204 are initialised to zero (the dotted line in Figure 2).
- the new output node 208 is set by default to classify all previously seen classes as negative.
- the ACC is further trained on the new data.
- new nodes are created in the evolving layer 204 to represent the new class.
- the process of adding new output nodes to ACC is carried out in a supervised manner.
- a new output node will be added only if it is indicated that the given input vector is a new class.
- the output expansion algorithm is as follows:
- the ACC Output Deletion Algorithm This is an inverse process of the output expansion algorithm described above. The process of removing an output class from ACC is performed in a supervised manner. It only affects the output and evolving layer of ACC architecture.
- the above algorithm is equivalent to dis-allocating a part of the problem space which had been allocated for the removed output class. In this manner, there will be no space allocated for the deleted output class in the problem space. In other words the network is unlearning a particular output class.
- SI Speaker Independent
- SD Speaker Dependent
- the main principle of speaker adaptation is to modify the recognition system to accommodate the acoustic variation of a new speaker.
- the first step in any speech recognition system is to prepare the captured speech into a suitable format so that they can be used to extract necessary feature vectors.
- the raw speech signal is often captured with noise and non-speech (silence) which is not always desirable.
- Endpoints detection is one of the problems in the pre-processing part of speech recognition systems. Endpoints detection refers to determination of the exact moment in which speech begins or ends.
- One of the problems in speech processing is to accurately determine the beginning and end of speech signals. This is known as endpoints detection problem.
- the simplest method of making a speech/non-speech decision is to use a combination of zero-crossings and RMS energy. Together these two features provide a reasonable separation of speech and silence since low energy speech (fricatives) tends to have high zero-crossing rates and low zero- crossing speech (vowels) tend to have high energy.
- this method is not reliable or accurate enough, especially when the signal to background-noise ratio is high.
- a method based on neural network is proposed to determine speech and non-speech segments of speech signals.
- Speech is a non-stationary (time-varying) signal; silence (background noise) is also typically non-stationary. Background noise may consist of mechanical noises such as fans or conversations, movements, and door slams that are difficult to characterize.
- the invention involves classifying these two non-stationary signals. Since the speech and silence patterns are highly variable, it is desirable to use an adaptive solution to classify these two signals.
- the task is to detect the silence/background-noise between words (inter- word) or within a word (intra-word).
- Figure 4 illustrates an implementation of a neural network model for speech/non-speech detection.
- Speech signals uttered by different speakers were captured using various microphones in different environments. Training and testing sets were prepared from the manually segmented and labelled of the speech signals. The acoustic features were calculated every 10 ms, with a frame length of 20 ms. For each frame, 12 Mel-Frequency Cepstrum Coefficients (MFCC), log energy and their corresponding first and second order derivatives were computed. In addition, zero-crossing rate for each frame was also included as one of the elements in the input feature vector.
- MFCC Mel-Frequency Cepstrum Coefficients
- the task is to train an ACC to classify the input vectors into two classes of speech or non- speech (silence).
- the outputs of ACC correspond to the given input frame.
- Figure 4 illustrates a model for speech/non-speech detection.
- the output of neural network is a sequence of binary numbers, 0 representing non-speech and 1 representing speech for each frame of speech.
- a rule based system is applied to determine the sequence of speech and non-speech frames to distinguish signal frames representing background noise from other signal frames representing data of interest. The following are a few of these design making rules:
- frame n is signal instance of speech frame and preceding c or more frames are silence frame
- ELSE IF subsequent c or more frames are also non-speech.
- frame n is non-speech.
- Figure 5 illustrates the feature extraction procedure used to obtain input feature vectors.
- Speech signal obtained from the pre-processing is divided into windows of 20 msec, duration. Each window is overlapped 50% with its previous window. Spectral analysis of the speech signal is performed for every window to calculate Mel-Frequency Cepstrum Coefficients (MFCC). Discrete Cosine Transformation (DCT) is applied on the MFCC of the whole word in the following manner. For an m frame segment, DCT transformation will result in a set of m DCT coefficients. This sequence is truncated to achieve a fixed- size input vector consisting of 20 x d , where J is the dimensionality of the feature space.
- DCT Discrete Cosine Transformation
- Speaker adaptation is needed when words within the existing vocabulary of ACC is not recognised correctly. Speaker adaptation is achieved through a fast and straightforward procedure. Once a given word pronounced by a new speaker is not recognised correctly, the new speaker can adapt the ACC by additionally training network on his/her speech. This is achieved through a supervised adaptation procedure. New speakers required to pronounce the misrecognised word and adapt the recognition system by structurally evolving the ACC to accommodate the new speakers. Experimental results show that only few instances of data from new speakers are adequate for the ACC to learn the acoustic variation of the new speakers. In addition unlike traditional neural networks, further training does not degrade the performance of the ACC on its previous knowledge.
- one of the desirable features is to enable adding new words into existing vocabulary of the system.
- the online expansion algorithm described above allows new words to be added to the existing vocabulary of the system.
- the vocabulary expansion takes place in a supervised manner. Once a new word is presented to the system, the user is required to provide a label for the new word.
- the expansion algorithm creates an additional output node and adapts the ACC to the new word.
- the speech data was recorded in a quiet room environment to obtain clean speech signals.
- 23 native New Zealand English speakers participated in the recording sessions using close- mouth microphone.
- the speech was sampled at 22.05 kHz and quantised to a 16 bit signed number. Each word was uttered five times with distinct pauses in between. Each of these words was then carefully manually segmented and labelled.
- speech data was prepared for the English digits "zero" to "nine".
- the speech data used in these experiments obtained from two groups of speakers; group A (8 male and 8 female) and group B (3 male and 4 female). The data from each group was divided into two sets; training set A, testing set A, training set B and testing set B.
- Training set A with 480 examples was obtained from 3 utterances of each word in group A. The remaining 2 utterances were used in the testing set A for a total of 320 examples. Training set B with 140 examples was obtained from 2 utterances of each word in group B. Testing set B with 210 examples was obtained from the remaining 3 utterances of each word in group B.
- Spectral analysis of the speech signal was performed according to feature extraction procedure described above to obtain input feature vectors.
- An ACC was initialised with 100 nodes in the input layer and 10 nodes in the output layer (one for each word). In phase one of the experiments the ACC was trained on the training set A with the parameters shown in Table 1. Aggregation was used to control the size of the evolving layer during the learning process. Table 1. ACC's Training and Aggregation Parameters
- the testing set B (ie: new speakers), the trained ACC was additionally trained on training set B. 43 additional nodes were added to the new ACC to accommodate the variations in the new speakers. The performance of the adapted ACC on its training set B was 100%. The adapted ACC was then tested on the original training (training set A) and testing sets A and B. Performance of the adapted ACC on training set A was remained unchanged. Table 3 illustrates the performance of the ACC on testing sets A and B after adaptation on new speakers. ACC retained its recognition ability on old data while achieving excellent adaptation on new data. Table 3. Performance of ACC on NZ English Digits on "Testing Set A" and "Testing Set B" After Adaptation on "Testing Set B".
- the second phase of the experiment was designed to expand the current ACC to recognise seven new words.
- the same procedure was applied to prepare training and testing sets.
- For the 7 additional words there were a total of 336 examples in training set A, 224 examples in testing set A and 147 examples in testing set B.
- 130 new nodes were added to the evolving layer of the expanded ACC during the training process.
- Table 4 illustrates the performance of the new ACC on the original and additional testing set A and testing set B.
- Addition of new classes (output nodes) to the network does not cause any disturbance to the recognition rate of the existing classes.
- the worst recognition rate for any of the existing words was 93.75%, for the word 'five', which is identical to the recognition rate before addition of the new classes.
- the added words only the word 'divides' had less than perfect positive accuracy, with a performance of 93.75%.
- the addition of new outputs does not cause any forgetting of existing classes, nor does it disturb the memorisation ability of the network: the classes that were added are recognised as well as the existing classes.
- the mild aggregation has improved the generalisation capability of the network, it highlights a problem with aggregation during training: as the distribution of the data for each class is different, it is dangerous to apply the same aggregation threshold to nodes that represent different classes. Ideally, the aggregation threshold should be set for each class. Thus, it would be possible to set the threshold for optimal performance for each class. The manner in which these thresholds are optimised is a matter of further research.
- Training and testing data were prepared from speech and non- speech segments of speech signals. A total of 42 input features were computed for every input frame. The task is to classify these input frames into speech or non-speech classes.
- An ACC is trained on 1900 training samples and 293 nodes were created during the learning process. Table 5 shows the results of ACC on the training and testing samples. Table 5: Performance of ACC on speech and non-speech training and testing data
- ACC is suitable for the task of classifying speech and silence because, it can be adapted to new background noises through a supervised adaptation.
- Figure 4 illustrates the performance of connectionist based end points detection on a spoken word. It also allows on-line adaptation of both silence and speech segments of a speech signal. This is done by user selecting the silence or speech segment and retraining the system to learn these as new data. This an essential feature of the adaptive end points detection system since the nature of background noise varies depending on the recording environment.
- the aim of the integration module 110 from Figure 1 is to combine the speaker identification and image recognition modules to recognise identity of a person. This study attempts to increase the robustness of person identification system by applying integration module presented in this section.
- Figure 1 illustrates the overall proposed integration module for speech and Image.
- This module receives inputs from image and speaker recognition modules.
- the preprocessing of the contribution from the image and speaker recognition modules are performed by assigning appropriate weights to their recognition rates.
- the weights have values that are proportional to the probability of each module producing correct classification results. These weights are determined experimentally based on the performances of each module over some testing data.
- the method of assigning weights for the contribution from image and speaker speech recognition modules is computed in the following manner. Let us assume that the image and speaker recognition modules are trained to be recognised n persons. The performance of each module is tested on a number of new samples. For instance, if 85 percent of images of output category j are correctly recognized, then the reliability of this neural network for output j is 0.85. Thus, 0.85 is assigned as the weight for the contribution from image recognition module for output j and denoted as WI j . The similar process is performed to assign weights for the contribution from speaker recognition module and denoted as WS j .
- the final result for the integration module is made by applying a statistical specialization with the contribution from image and speaker recognition modules.
- the contribution from image (or speaker) recognition module is an array of output activations for each class.
- a y is the activation of image recognition module for output j
- a . is the activation of speaker recognition module for output j
- the output class with the highest probability is the final recognition result.
- class B The class with the highest activation value is the winner (the final result of classification).
- the invention in one form is configured to identify a person based on two separate sources of information.
- One module is designed to identify a person according to the characteristics of his/her voice and the other module recognises a person based on image of his/her face.
- Image recognition consists of locating and identifying known patterns in an image. Depending on the application, the patterns can be expected to appear within pre-defined zones in the field of view or anywhere.
- the object or pattern to identify is defined by a region including the edges of the objects along with a layer of background pixels, so that the recogmtion engine can learn variations of these edges and be able to differentiate anomalies from similarities.
- the same object can be represented with different feature vectors.
- Several feature vectors can be combined to build a robust engine.
- Image recognition module uses evolving neural networks to recognise a given image.
- Figure 6 illustrates an implementation of image recognition module.
- Feature extraction is the process of extracting a defined area of an image, and transforming the information into a "feature" vector. There are a number of different feature extraction methods to choose from, all of which will generate different values for the vector. The generated vector can then be sent to the neural network engine for either learning or recognition (classification).
- the available feature extraction options are: Pixel values, Gradient values, Standard histogram, Cumulative histogram, Gradient histogram, Vertical profile, Horizontal profile, Composite profile, Vertical gradient profile, Horizontal gradient profile, Composite gradient profile. The most frequently used one is Composite profile. If the image of interest has a size of m x n , denoted as I mxn .
- the composite profile vector V is an array of m + n components, denoted as V m+n .
- the m components represent the vertical profile (the average pixel value of each column), and calculated according to Equation 9.
- the image recognition module allows various evolving connectionist systems to be used. These includes; Zero Instruction Set Computing (ZISC), Evolving Classifier Function (ECF) and Adaptive Connectionist Speech and Image Classifier (ACC). They all allow structural adaptation by modifying their internal structure during the learning process. It enables output expansion and reduction which is well suited for image recognition applications.
- ZISC Zero Instruction Set Computing
- ECF Evolving Classifier Function
- ACC Adaptive Connectionist Speech and Image Classifier
- a neural network is trained to recognise images of faces.
- the number of outputs is corresponds to the number of faces to be identified.
- input feature vector representing an image is presented to the network recognises the image. If the image is not recognised correctly, the network can be adapted by retraining the network on that image. A person's image can be added (learned) or removed (unlearned) from the recognition systems without affecting the existing knowledge of the recognition system.
- Speech signals contain two types of information; individual and phonetic. They have mutual effects and not easy to separate. This represents one of the main problems the area of speaker and speech recognition systems.
- speaker recognition systems can fall into two categories; text-dependent and text independent. Text-dependent means that the text used in training the system is the same as that used in testing the system. However in text-independent the text used in testing the system is limited to the text used is not limited to the text used to train the system.
- the speaker identification system proposed here is text dependent.
- a neural network is trained on a key word or phrase from various speakers.
- the number of output nodes in the output layer of neural network corresponds to the number of known speakers.
- the network can be adapted to identify a new speaker using expansion algorithm and further adapted to the new speaker.
- output deletion algorithm can be used to remove an output (speaker) from the output layer of the neural network.
- the database of 6 speakers (3 males, 3 females) was recorded in a laboratory environment. Each speaker was asked to utter the same word "Security" six times. Two instance of each word from each speaker were used for training and the remaining four were used for testing the network. The process of feature extraction from speech signal is described above.
- An ACC is initialised with 100 input and 6 output (one for each speaker) nodes. The network is then trained on the training data and 9 nodes was created in the evolving layer during learning process. Table 6 illustrates the performance of ACC on both training and testing data.
- FIG 7 illustrates a Graphical User Interface (GUI) developed for connectionist speaker identification systems.
- GUI Graphical User Interface
- the GUI is developed in MATLAB and allows online speaker identification, registration and adaptation.
- Figure 8 illustrates a GUI for person identification.
- Home automation systems refer to systems that permit remote control of electronic devices in the immediate surroundings.
- a person for example, can turn lights, fan, and television on and off through a voice command.
- any aspect of the environment can be controlled depending upon the system's complexity.
- the aim of this case study is to develop a small speech recognition system to control certain devices at various locations of a private home.
- the recognition system used in this study is based on the adaptive speech recognition systems described above.
- a simple language module is proposed for the task of operation certain devices trough voice commands.
- the language module requires a command sequence of "Location” + “Device” + “Function” to execute a given command.
- Figure 9 illustrates the layout of the commands to operate certain home appliance. For instance, the command sequence “kitchen fan on” turns on the fan in the kitchen. However, it is possible to use the general command sequence partially. For instance if there is only a TV in the house, a command sequence of " Device” + “Function” will be adequate (i.e. "Television on”).
- An ACC is employed to create an isolated word recognition system. As it is described above, adaptation of the system at every level is its main feature. The system enables accurate word spotting, speaker adaptation and vocabulary expansion. The following section describes the experiments performed to evaluate the performance of the ACC on recognition, adaptation and vocabulary expansion.
- the experiments were carried out in two distinct phases. In the first phase, an ACC is trained to recognise 10 voice commands. In the second phase, the output space of the ACC was expanded to recognise 5 additional words as alternative voice commands.
- the speech data was recorded in a quiet room environment to obtain clean speech signals. For the sake of this case study, speech data collected from 6 native New Zealand English speakers. The speech was sampled at 22.05 kHz and quantised to a 16 bit signed number. Each word was uttered 6 times with distinct pauses in between. Each of these words was then carefully manually segmented and labelled.
- speech data was prepared for the 10 basic voice commands.
- the speech data used in these experiments obtained from two groups of speakers; group A ⁇ (2 males and 2 females) and group 5, (1 male and 1 female).
- the data from each groups were divided into two sets; training set A l , testing set A l , training set B x and testing set B .
- Training set A with 160 examples was obtained from 4 utterances of each word in group A .
- the remaining 2 utterances were used in the testing set A t for a total of 80 examples.
- Training set R, with 40 examples was obtained from 2 utterances of each word in group B ⁇ .
- Testing set R, with 80 examples was obtained from the remaining 4 utterances of each word in group R, .
- An ACC was initialised with 100 nodes in the input layer and 10 nodes in the output layer (one for each word). In phase one of the experiments the ACC was trained on the training set Al. There were 33 nodes created during the training period. The trained ACC was then tested on its training set A and both testing sets A ⁇ and 2?, . The ACC performance on training set A was 100%. Table 7 shows the percentage of positive (true positive) accuracy and negative (true negative) accuracy of the ACC on testing sets A andR, .
- the second phase of the experiment was designed to expand the current ACC to recognise 5 new commands (words).
- the same procedure was applied to prepare a training and testing sets.
- For the 5 additional words there were a total of 80 examples in training set A 2 , 40 examples in testing set A 2 , 20 examples in training set B 2 and 40 examples in testing set B 2 .
- the structure of the existing ACC was expanded to accommodate 5 additional outputs.
- the expanded ACC was further trained on the training set A 2 of the additional words. 38 new nodes were added to the evolving layer of the expanded ACC during the training process.
- Table 8 illustrates the performance of the expanded ACC on the original and on the additional testing set A x and on testing set B X 2 .
- the trained ACC was additionally trained on training set B X 2 (ie. new group of speakers). 52 additional nodes were added to the new ACC to accommodate the variations in the new speakers. The performance of the adapted ACC on its training set B 2 was 100%. The adapted ACC was then tested on the original training set A x and testing sets A X 2 and 5, 2 . Performance of the adapted ACC on training set A x remained unchanged. Table 9 illustrates the performance of the ACC on testing sets A 2 and B 2 after adaptation on new speakers. ACC retained its recognition ability on old data while achieving excellent adaptation on new data. Table 9: Performance of the expanded ACC on both initial and additional commands after adaptation
- ECF Evolving Classifier Function
- the recognition algorithm of ECF was modified for the task of person verification. Accordingly it is referred to as a verification algorithm.
- FIG. 10 illustrates the overall process of adaptive person verification system.
- Speech and face image information were used for the person verification task. Individual ECF modules were built for both speech and face image sub-network. In addition, features obtained from speech and face image of clients were merged to form integrated input features. There are various strategies of combining multimodal sources of information. In this approach, speech and face image information were integrated at the feature level. There are 100 input features in a speech sample and 64 input features in a face image sample. These two set of features were concatenated to form the integrated input features.
- Person verification system is essentially a two-class decision task where the system can make two types of errors.
- the first error is a false acceptance, where an impostor is accepted.
- the second error is false rejection, where a true claimant is rejected.
- FRR - ⁇ ⁇ - (12 ) where ⁇ A the number of impostors classified as true claimants, ⁇ ⁇ is the total number of impostor presented, c R is the number of true claimant classified as impostors and c ⁇ is the total number of true claimant presented. The trades off between these errors are adjusted using the acceptance threshold ⁇ .
- An adaptive speaker verification module An ECF neural network engine was built based on the speech training dataset. Each speaker was modelled by allocating rule nodes during training session. The number of rule nodes assigned for each speaker is determined by the maximum influence field. Figure 11 illustrates the performance of ECF on testing dataset. As shown in Figure 11, the smaller the Maximum Inference Field, the more rule nodes are allocated for each client. This leads to a high correct acceptance rate of 92% and small FRR and FAR errors of 1%.
- Figure 11 shows ECF performance on a speaker verification task.
- Graph (A) shows the number of rule nodes created versus the various influence field values.
- Graph (B) shows the correct acceptance rate versus acceptance thresholds.
- Graph (C) shows FAR versus acceptance thresholds, and graph (D) shows FRR versus acceptance thresholds.
- Figure 12 shows ECF performance on a face image verification task.
- Graph (A) shows the number of rule nodes created versus the various influence field values.
- Equation (B) shows the correct acceptance rater versus acceptance thresholds.
- Graph (C) shows FAR versus acceptance thresholds and graph (D) shows FRR versus acceptance thresholds.
- a person verification module based on integrated voice and face features.
- the training dataset for this experiments obtained by concatenating the speech training dataset and face image training dataset. Each integrated sample has 164 input features.
- An ECF model was built using the integrated training dataset and test on the integrated testing dataset A and B. The test results are shown in Figure 13.
- Equation (A) shows the number of rule nodes created versus various influence field values.
- Equation (B) shows the correct acceptance rate versus acceptance thresholds.
- Graph (C) shows FAR versus acceptance thresholds and graph (D) shows FRR versus acceptance thresholds.
- biometrics e.g., fingerprints, face images, voice, etc.
- Using multiple biometric modalities not only improved security can be achieved, but also a verification system can be used by more people.
- the system is composed using multiple biometric modalities. Two biometric modalities, speech and face image are used for person verification task. The methodologies for feature extraction, feature integration and modelling are described in the above sections.
- Figure 14 show the graphical user interface of a multiple biometric modalities for person verification implemented on the iPAC.
- the Evolving Classifier Function used as classifier facilitates: Adaptation: this feature allows the system to adapt to new samples (face images or speech) online. Expansion: new users may be added to the system through expansion of the number of output classes. Deletion: existing users may be removed from system through deletion of the output classes.
- FPGA Field Programmable Gate Array
- Evolving connectionist system can be used to detect faces by sub-sampling different regions of the image to a standard-sized sub-image and then passing it through a neural network.
- Biometrics deals with identifying individuals with help of their biological data. Fingerprint scanning is the most common method of the biometric methods available today. It can be used as another biological trait in biometric systems based on multiple-modalities.
- fingerprints can be divided into the three major pattern types arches, loops, and whorls, depicted in Figure 16. Loops are the most common fingerprint pattern. These major pattern types can appear in different variations. For example, you can find plain or tented (narrow) arches, right or left loops, and spiral or concentric circles as whorls. Also, the different pattern types can be combined to form a fingerprint, e.g. a double loop, or an arch with a loop.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| NZ52908103 | 2003-10-22 | ||
| NZ529081 | 2003-10-22 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2005038774A1 true WO2005038774A1 (fr) | 2005-04-28 |
Family
ID=34464954
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/NZ2004/000264 Ceased WO2005038774A1 (fr) | 2003-10-22 | 2004-10-22 | Systeme et procede d'apprentissage adaptatif par les sons et les images |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2005038774A1 (fr) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP2979267B1 (fr) | 2013-03-26 | 2019-12-18 | Dolby Laboratories Licensing Corporation | Appareils et procédés de classification et de traitement d'élément audio |
| CN111292764A (zh) * | 2018-11-20 | 2020-06-16 | 新唐科技股份有限公司 | 辨识系统及辨识方法 |
| CN112966739A (zh) * | 2021-03-04 | 2021-06-15 | 南方科技大学 | 图像分类模型自学习的方法、装置、电子设备及存储介质 |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030149676A1 (en) * | 2000-04-10 | 2003-08-07 | Kasabov Nikola Kirilov | Adaptive learning system and method |
-
2004
- 2004-10-22 WO PCT/NZ2004/000264 patent/WO2005038774A1/fr not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030149676A1 (en) * | 2000-04-10 | 2003-08-07 | Kasabov Nikola Kirilov | Adaptive learning system and method |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP2979267B1 (fr) | 2013-03-26 | 2019-12-18 | Dolby Laboratories Licensing Corporation | Appareils et procédés de classification et de traitement d'élément audio |
| EP3598448B1 (fr) | 2013-03-26 | 2020-08-26 | Dolby Laboratories Licensing Corporation | Appareils et procédés de classification et de traitement audio |
| CN111292764A (zh) * | 2018-11-20 | 2020-06-16 | 新唐科技股份有限公司 | 辨识系统及辨识方法 |
| CN111292764B (zh) * | 2018-11-20 | 2023-12-29 | 新唐科技股份有限公司 | 辨识系统及辨识方法 |
| CN112966739A (zh) * | 2021-03-04 | 2021-06-15 | 南方科技大学 | 图像分类模型自学习的方法、装置、电子设备及存储介质 |
| CN112966739B (zh) * | 2021-03-04 | 2024-06-25 | 南方科技大学 | 图像分类模型自学习的方法、装置、电子设备及存储介质 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| AU711496B2 (en) | Speaker verification system | |
| US6519561B1 (en) | Model adaptation of neural tree networks and other fused models for speaker verification | |
| US20040236573A1 (en) | Speaker recognition systems | |
| Devi et al. | Automatic speaker recognition from speech signals using self organizing feature map and hybrid neural network | |
| Khdier et al. | Deep learning algorithms based voiceprint recognition system in noisy environment | |
| WO2003015078A1 (fr) | Procede et systeme d'enregistrement vocal, et procede et systeme de reconnaissance vocale reposant sur le procede et le systeme d'enregistrement vocal | |
| Soleymani et al. | Prosodic-enhanced siamese convolutional neural networks for cross-device text-independent speaker verification | |
| Kadhim et al. | Enhancement and modification of automatic speaker verification by utilizing hidden Markov model | |
| WO2005038774A1 (fr) | Systeme et procede d'apprentissage adaptatif par les sons et les images | |
| Gambhir et al. | A run-through: Text independent speaker identification using deep learning | |
| Wahab et al. | Speaker authentication system using soft computing approaches | |
| Sas et al. | Gender recognition using neural networks and ASR techniques | |
| Abd El-Moneim et al. | Effect of reverberation phenomena on text-independent speaker recognition based deep learning | |
| Amrutha et al. | Multi-level Speaker Authentication: An Overview and Implementation | |
| KR20040028790A (ko) | 화자 인식 시스템 | |
| Manor et al. | Voice trigger system using fuzzy logic | |
| Hassan et al. | Robust Speaker Identification System Based on Variational Bayesian Inference Gaussian Mixture Model and Feature Normalization | |
| Nallagatla et al. | Sequential decision fusion for controlled detection errors | |
| Abdiche et al. | A hybrid of deep neural network and eXtreme gradient boosting for automatic speaker identification | |
| Ghobakhlou et al. | A Methodology and a System for Adaptive, Integrated Speech and Image Learning and Recognition. | |
| Abd Al-Rahman et al. | Using Deep Learning Neural Networks to Recognize and Authenticate the Identity of the Speaker | |
| Yakovenko et al. | Text-independent speaker recognition using radial basis function network | |
| Segărceanu et al. | SPEECH RECOGNITION SYSTEM | |
| Manfron | Speaker Recognition for Door Opening Systems | |
| Manfron et al. | Speaker Verification on Small Datasets with ResNet50 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| DPEN | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed from 20040101) | ||
| 122 | Ep: pct application non-entry in european phase |