[go: up one dir, main page]

CN111599376A - Sound event detection method based on cavity convolution cyclic neural network - Google Patents

Sound event detection method based on cavity convolution cyclic neural network Download PDF

Info

Publication number
CN111599376A
CN111599376A CN202010483079.3A CN202010483079A CN111599376A CN 111599376 A CN111599376 A CN 111599376A CN 202010483079 A CN202010483079 A CN 202010483079A CN 111599376 A CN111599376 A CN 111599376A
Authority
CN
China
Prior art keywords
convolution
neural network
void
audio
cyclic neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010483079.3A
Other languages
Chinese (zh)
Other versions
CN111599376B (en
Inventor
李艳雄
刘名乐
王武城
江钟杰
陈昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010483079.3A priority Critical patent/CN111599376B/en
Publication of CN111599376A publication Critical patent/CN111599376A/en
Application granted granted Critical
Publication of CN111599376B publication Critical patent/CN111599376B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a sound event detection method based on a cavity convolution cyclic neural network, which comprises the following steps: extracting logarithmic Mel spectrum characteristics of each sample; building a cavity convolution cyclic neural network, wherein the cavity convolution cyclic neural network comprises a convolution neural network, a bidirectional long-time and short-time memory neural network and a Sigmoid output layer; training a cavity convolution cyclic neural network by taking logarithmic Mel spectrum features extracted from training samples as input; and identifying the sound event in the test sample by adopting the trained hole convolution cyclic neural network to obtain a sound event detection result. The method introduces the cavity convolution into the convolutional neural network and optimally combines the convolutional neural network and the cyclic neural network to obtain the cavity convolution cyclic neural network. Compared with the traditional convolutional neural network, under the condition that the network parameter sets are the same in size, the cavity convolutional cyclic neural network has a larger receptive field, the context information of the audio sample can be more effectively utilized, and a better sound event detection result is obtained.

Description

Sound event detection method based on cavity convolution cyclic neural network
Technical Field
The invention relates to the technical field of audio signal processing and pattern recognition, in particular to a sound event detection method based on a cavity convolution cyclic neural network.
Background
The goal of Sound Event Detection (SED) is to accurately identify various types of target Sound events in an audio recording. Sound event detection can be applied in many areas related to machine monitoring, such as traffic monitoring, intelligent conference rooms, automated assisted driving, and multimedia analysis. The classifier for sound event detection includes a depth model and a shallow model. The depth model mainly comprises a convolution cyclic neural network, a cyclic neural network and a convolution neural network. The shallow layer model mainly comprises a random regression forest, a support vector machine, a hidden Markov model and a Gaussian mixture model.
The existing mainstream sound event detection method based on the convolutional neural network has the following defects: in order to increase the receptive field and capture the context information with longer input audio features, the number of convolutional layers of the network needs to be increased, so that the network parameter scale is very large, and the overfitting problem (the generalization capability of the network is reduced) is easy to cause.
In the course of the present invention, at least the following technical suggestions have been found: under the condition that the scale of the network parameters is the same, the convolutional cyclic neural network with the hole convolution has a larger receptive field and can capture the context information with longer input audio features. In order to obtain the same size of receptive field, the network layers used by the convolutional cyclic neural network of the cavity convolution are much smaller than those used by the convolutional cyclic neural network of the conventional convolution, and the overfitting problem caused by the large-scale neural network parameters is effectively avoided. Therefore, a method for detecting a sound event based on a hole convolution cyclic neural network is urgently needed to be provided at present, and the performance of detecting the sound event is effectively improved.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a sound event detection method based on a hole convolution cyclic neural network, which comprises the following steps: firstly, extracting logarithmic Mel spectrum characteristics: pre-emphasis, framing and windowing are carried out on the audio samples, and logarithmic Mel spectrum features of each audio frame are respectively extracted; and step two, building a cavity convolution cyclic neural network: the system comprises a convolutional neural network, a bidirectional long-time and short-time memory neural network and a Sigmoid output layer; thirdly, training a hole convolution cyclic neural network: training a cavity convolution cyclic neural network by taking logarithmic Mel spectrum features extracted from training samples as input; fourthly, sound event detection: and identifying the sound event in the test sample by adopting the trained hole convolution cyclic neural network to obtain a sound event detection result.
The purpose of the invention can be achieved by adopting the following technical scheme:
a sound event detection method based on a hole convolution cyclic neural network comprises the following steps:
s1, extracting logarithmic Mel spectrum characteristics: pre-emphasis, framing and windowing are carried out on the audio samples, and then the logarithmic Mel spectrum of each audio frame is respectively extracted;
s2, building a cavity convolution cyclic neural network, wherein the cavity convolution cyclic neural network comprises a convolution neural network, a bidirectional long-time and short-time memory neural network and a Sigmoid output layer;
s3, training a cavity convolution cyclic neural network, and training the cavity convolution cyclic neural network by taking the logarithmic Mel spectrum features extracted from the training samples as input;
s4, sound event detection: and identifying the sound event in the test sample by adopting the trained hole convolution cyclic neural network to obtain a sound event detection result.
Further, the process of extracting the log mel-frequency spectrum feature in step S1 is as follows:
s1.1, pre-emphasis, reading audio samples, pre-emphasis using a digital filter having a transfer function h (z) 1- α z-1Wherein α is the filter coefficient and the value is 0.9- α -1;
s1.2, framing and windowing: the read-in audio samples are divided into frames, the frame length is 0.02s,the frame shift is 0.01s, and x 'is obtained as each frame signal't(n), the window function is a Hamming window ω (n), and is given as the signal x 'per frame'tMultiplying (n) with Hamming window omega (n) to obtain t frame audio signal x after windowingt(n);
S1.3, extracting logarithmic spectrum features: for the t-th frame audio signal xt(n) performing a discrete Fourier transform to obtain a linear spectrum Xt(k) Then linear spectrum X is appliedt(k) Obtaining Mel frequency spectrum by Mel frequency filter bank, and finally performing logarithm operation to obtain logarithm frequency spectrum St(m);
S1.4, performing the operation of the step S1.3 on each frame of audio signal to obtain the log spectrum S of all audio framest(m) finally, the log spectrum S of all audio framest(m) arranging the feature matrices in the order of frames, wherein the rows of the feature matrices are the order of the frames and the columns of the feature matrices are the feature dimensions.
Furthermore, the convolutional neural network consists of a hole convolution module or more than two cascaded hole convolution modules, wherein each hole convolution module comprises a hole convolution unit, a pooling unit, an excitation unit and a batch standardization unit,
the expression of the void convolution unit is as follows:
Figure BDA0002517982080000031
wherein,
Figure BDA0002517982080000032
a feature vector representing the ith audio sample at layer l, f (-) represents an activation function, kiAnd biRespectively representing convolution kernel parameters and bias terms which are subjected to convolution operation with the feature vector of the ith audio sample;
the pooling unit adopts a maximum pooling method; the excitation function adopted in the excitation unit is a linear rectification function and is used for increasing the nonlinear relation among all layers of the neural network;
the batch standardization unit is used for solving the problem of gradient explosion of the network and accelerating the convergence speed of the network, and the calculation process comprises the following steps:
approximate whitening preprocessing:
Figure BDA0002517982080000033
transformation and reconstruction:
Figure BDA0002517982080000034
wherein, E (x)(i)) Feature vector x representing the ith audio sample(i)Is determined by the average value of (a) of (b),
Figure BDA0002517982080000035
feature vector x representing the ith audio sample(i)The standard deviation of (a) is determined,
Figure BDA0002517982080000036
as feature vector x(i)Approximate the result of whitening pre-processing, y(i)Representing the feature vector after reconstruction, gamma(i)And β(i)Indicating adjustable reconstruction parameters.
Further, the bidirectional long-time and short-time memory network fully utilizes context information and maps the feature representation obtained by the learning of the convolutional neural network to a sample mark space.
Further, the Sigmoid output layer adopts a loss function, and the expression is as follows:
Figure BDA0002517982080000041
wherein N represents the number of samples, l(i)A real label representing the ith audio sample,
Figure BDA0002517982080000042
a prediction tag representing the ith audio sample.
Further, the specific process of training the hole convolution cyclic neural network in step S3 is as follows:
inputting logarithmic Mel spectrum features extracted from training samples of different audio databases into a hole convolution cyclic neural network, and respectively adjusting the number of hole convolution modules and the size of a hole rate;
when the number of the void convolution modules is 1, setting two groups of void ratio values, wherein one group of void ratios is 1, namely the void ratios of all the convolution layers in the void convolution modules are all set to be 1, and the other group of void ratios is 2, namely the void ratios of all the convolution layers in the void convolution modules are all set to be 2;
when the number of the void convolution modules is 2, two groups of void ratio values are set, wherein one group of void ratios is 1-1, namely the void ratios of all the convolution layers in the void convolution modules are set to be 1, and the other group of void ratios is 2-4, namely the void ratios of the convolution layers in the first convolution module and the second convolution module are respectively 2 and 4;
when the number of the void convolution modules is 3, two groups of void ratio values are set, wherein one group of void ratios is 1-1-1, namely the void ratios of all the convolution layers in the void convolution modules are set to be 1, and the other group of void ratios is 2-4-8, namely the void ratios of the convolution layers in the first convolution module, the second convolution module and the third convolution module are respectively 2, 4 and 8;
when the number of the void convolution modules is 4, setting two groups of void ratio values, wherein one group of void ratio values is 1-1-1-1, namely the void ratios of all convolution layers in the void convolution modules are all set to be 1; the other group of void rates is 2-4-8-16, namely the void rates of the convolution layers in the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are 2, 4, 8 and 16 respectively;
when the number of the void convolution modules is 5, setting two groups of void ratio values, wherein one group of void ratio values is 1-1-1-1-1, namely the void ratios of all convolution layers in the void convolution modules are all set to be 1; the other group of void rates has a value of 2-4-8-16-32, i.e., the void rates of the convolution layers in the first, second, third, fourth and fifth convolution modules are 2, 4, 8, 16 and 32, respectively.
Further, the process of detecting the sound event in step S4 is as follows:
s4.1, extracting audio features in each test data set, and identifying each audio frame by using a trained cavity convolution cyclic neural network;
and S4.2, splicing the identification results of the audio frames according to the time sequence to obtain the identification results of the audio segments, and then calculating the detection precision of the sound event based on the audio frame level and the audio segment level.
Compared with the prior art, the invention has the following advantages and effects:
the sound event detection method based on the cavity convolution cyclic neural network disclosed by the invention obtains higher detection precision under the condition of capturing context information with the same length of input audio features, reduces the parameter scale of the neural network, avoids the over-fitting problem of the neural network and improves the generalization capability of the neural network.
Drawings
FIG. 1 is a flowchart of a method for detecting a sound event based on a hole convolution cyclic neural network according to an embodiment of the present invention;
fig. 2 is a structural diagram of a hole convolution cyclic neural network disclosed in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Examples
FIG. 1 is a flow diagram of one embodiment of a method for sound event detection based on a hole convolutional recurrent neural network, the method comprising the steps of:
s1, extracting logarithmic Mel spectrum characteristics: pre-emphasis, framing and windowing are carried out on the audio samples, and then the logarithmic Mel spectrum of each audio frame is respectively extracted;
in this embodiment, the extracting the log mel-frequency spectrum feature in the step S1 specifically includes the following steps:
s1.1, pre-emphasis, reading audio samples, pre-emphasis using a digital filter having a transfer function h (z) 1- α z-1Wherein α is the filter coefficient and the value is 0.9- α -1;
s1.2, framing and windowing: dividing the read audio sample into frames with the frame length of 0.02s and the frame shift of 0.01s to obtain each frame signal of x't(n), the window function is a Hamming window ω (n), and is given as the signal x 'per frame'tMultiplying (n) with Hamming window omega (n) to obtain t frame audio signal x after windowingt(n);
S1.3, extracting logarithmic spectrum features: for the t-th frame audio signal xt(n) performing a discrete Fourier transform to obtain a linear spectrum Xt(k) Then linear spectrum X is appliedt(k) Obtaining Mel frequency spectrum by Mel frequency filter bank, and finally performing logarithm operation to obtain logarithm frequency spectrum St(m);
S1.4, performing the operation of the step S1.3 on each frame of audio signal to obtain the log spectrum S of all audio framest(m) finally, the log spectrum S of all audio framest(m) arranging the feature matrices in the order of frames, wherein the rows of the feature matrices are the order of the frames and the columns of the feature matrices are the feature dimensions.
S2, building a cavity convolution cyclic neural network, wherein the cavity convolution cyclic neural network comprises a convolution neural network, a bidirectional long-time and short-time memory neural network and a Sigmoid output layer;
the cavity convolution cyclic neural network comprises a cascade convolution neural network, a bidirectional long-time memory network and a Sigmoid output layer, and is shown in fig. 2.
The convolutional neural network consists of one cavity convolution module or more than two cascaded cavity convolution modules, wherein each cavity convolution module comprises a cavity convolution unit, a pooling unit, an excitation unit and a batch standardization unit;
(1) the expression of the hole convolution unit is as follows:
Figure BDA0002517982080000071
wherein,
Figure BDA0002517982080000072
a feature vector representing the ith audio sample at layer l, f (-) represents an activation function, kiAnd biRespectively representing convolution kernel parameters and bias terms which are subjected to convolution operation with the feature vector of the ith audio sample;
(2) pooling unit and excitation unit:
the pooling Unit adopts a maximum pooling method, and the excitation function adopted in the excitation Unit is a Linear rectification function (ReLU) which is used for increasing the nonlinear relation among each layer of the neural network;
(3) a batch standardization unit:
the batch standardization unit is mainly used for solving the gradient explosion problem of the network and accelerating the convergence speed of the network, and the main calculation process comprises the following steps:
approximate whitening preprocessing:
Figure BDA0002517982080000073
transformation and reconstruction:
Figure BDA0002517982080000074
wherein, E (x)(i)) Feature vector x representing the ith audio sample(i)Average value of (d);
Figure BDA0002517982080000075
feature vector x representing the ith audio sample(i)The standard deviation of (a) is determined,
Figure BDA0002517982080000076
as feature vector x(i)Approximate the result of whitening pre-processing, y(i)Representing the feature vector after reconstruction, gamma(i)And β(i)Indicating adjustable reconstruction parameters.
The bidirectional long-time and short-time memory network fully utilizes context information and maps the feature representation obtained by the learning of the convolutional neural network to a sample mark space;
wherein, the Sigmoid output layer adopts a loss function, and the expression is as follows:
Figure BDA0002517982080000081
wherein N represents the number of samples, l(i)A real label representing the ith audio sample,
Figure BDA0002517982080000082
a prediction tag representing the ith audio sample.
S3, training a hole convolution cyclic neural network: training a cavity convolution cyclic neural network by taking logarithmic Mel spectrum features extracted from training samples as input;
in this embodiment, the specific process of training the void convolution cyclic neural network is as follows:
inputting logarithmic Mel spectrum features extracted from training samples of different audio databases into a hole convolution cyclic neural network, and respectively adjusting the number of hole convolution modules and the size of a hole rate;
when the number of the void volume modules is 1, two groups of void ratio values are set: the void rate group takes a value of 1, namely the void rates of all the convolution layers in the void convolution module are all set to be 1; the other group of void rates is 2, that is, the void rates of all convolution layers in the void convolution module are set to be 2.
When the number of the void volume modules is 2, two groups of void ratio values are set: the group of void ratios takes a value of 1-1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1; the other set of void rates is 2-4, i.e., the void rates of the convolution layers in the first and second convolution modules are 2 and 4, respectively.
When the number of the void volume modules is 3, two groups of void ratio values are set: the group of void ratios takes the value of 1-1-1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1; the other group of void ratios has a value of 2-4-8, i.e., the void ratios of the convolution layers in the first, second and third convolution modules are 2, 4 and 8, respectively.
When the number of the void volume modules is 4, two groups of void ratio values are set: the group of void ratios takes the value of 1-1-1-1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1; the other group of void ratios has a value of 2-4-8-16, i.e., the void ratios of the convolution layers in the first, second, third and fourth convolution modules are 2, 4, 8 and 16, respectively.
When the number of the void volume modules is 5, two groups of void ratio values are set: the group of void ratios takes the value of 1-1-1-1-1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1; the other group of void rates has a value of 2-4-8-16-32, i.e., the void rates of the convolution layers in the first, second, third, fourth and fifth convolution modules are 2, 4, 8, 16 and 32, respectively.
S4, sound event detection: and identifying the sound event in the test sample by adopting the trained hole convolution cyclic neural network to obtain a sound event detection result.
In this embodiment, the sound event detection specifically includes the following steps:
s4.1, extracting audio features in each test data set, and identifying each audio frame by using a trained cavity convolution cyclic neural network;
and S4.2, splicing the identification results of the audio frames according to the time sequence to obtain the identification results of the audio segments, and then calculating the detection precision of the sound event based on the audio frame level and the audio segment level.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (7)

1. A sound event detection method based on a hole convolution cyclic neural network is characterized by comprising the following steps:
s1, extracting logarithmic Mel spectrum characteristics: pre-emphasis, framing and windowing are carried out on the audio samples, and then the logarithmic Mel spectrum of each audio frame is respectively extracted;
s2, building a cavity convolution cyclic neural network, wherein the cavity convolution cyclic neural network comprises a convolution neural network, a bidirectional long-time and short-time memory neural network and a Sigmoid output layer;
s3, training a cavity convolution cyclic neural network, and training the cavity convolution cyclic neural network by taking the logarithmic Mel spectrum features extracted from the training samples as input;
s4, sound event detection: and identifying the sound event in the test sample by adopting the trained hole convolution cyclic neural network to obtain a sound event detection result.
2. The method for detecting a sound event based on the hole convolution cyclic neural network as claimed in claim 1, wherein the process of extracting log mel-frequency spectrum features in step S1 is as follows:
s1.1, pre-emphasis, reading audio samples, pre-emphasis using a digital filter having a transfer function h (z) 1- α z-1Wherein α is the filter coefficient and the value is 0.9- α -1;
s1.2, framing and windowing: dividing the read audio sample into frames with the frame length of 0.02s and the frame shift of 0.01s to obtain each frame signal of x't(n), the window function is a Hamming window ω (n), and is given as the signal x 'per frame'tMultiplying (n) with Hamming window omega (n) to obtain t frame audio signal x after windowingt(n);
S1.3, extracting logarithmic spectrum features: for the t-th frame audio signal xt(n) performing a discrete Fourier transform to obtain a linear spectrum Xt(k) Then linear spectrum X is appliedt(k) Obtaining Mel frequency spectrum by Mel frequency filter bank, and finally performing logarithm operation to obtain logarithm frequency spectrum St(m);
S1.4, performing the operation of the step S1.3 on each frame of audio signal to obtain the logarithm of all audio framesFrequency spectrum St(m) finally, the log spectrum S of all audio framest(m) arranging the feature matrices in the order of frames, wherein the rows of the feature matrices are the order of the frames and the columns of the feature matrices are the feature dimensions.
3. The method according to claim 1, wherein the convolutional neural network comprises one or more cascaded hole convolutional modules, each of which comprises a hole convolutional unit, a pooling unit, an excitation unit and a batch normalization unit,
the expression of the void convolution unit is as follows:
Figure FDA0002517982070000021
wherein,
Figure FDA0002517982070000022
a feature vector representing the ith audio sample at layer l, f (-) represents an activation function, kiAnd biRespectively representing convolution kernel parameters and bias terms which are subjected to convolution operation with the feature vector of the ith audio sample;
the pooling unit adopts a maximum pooling method; the excitation function adopted in the excitation unit is a linear rectification function and is used for increasing the nonlinear relation among all layers of the neural network;
the batch standardization unit is used for solving the problem of gradient explosion of the network and accelerating the convergence speed of the network, and the calculation process comprises the following steps:
approximate whitening preprocessing:
Figure FDA0002517982070000023
transformation and reconstruction:
Figure FDA0002517982070000024
wherein, E (x)(i)) Feature vector x representing the ith audio sample(i)Is determined by the average value of (a) of (b),
Figure FDA0002517982070000025
feature vector x representing the ith audio sample(i)The standard deviation of (a) is determined,
Figure FDA0002517982070000026
as feature vector x(i)Approximate the result of whitening pre-processing, y(i)Representing the feature vector after reconstruction, gamma(i)And β(i)Indicating adjustable reconstruction parameters.
4. The method as claimed in claim 1, wherein the bidirectional long-and-short term memory network maps the feature representation learned by the convolutional neural network to the sample label space by fully utilizing the context information.
5. The method for detecting the sound event based on the hole convolution cyclic neural network as claimed in claim 1, wherein the Sigmoid output layer adopts a loss function, and the expression is as follows:
Figure FDA0002517982070000031
wherein N represents the number of samples, l(i)A real label representing the ith audio sample,
Figure FDA0002517982070000032
a prediction tag representing the ith audio sample.
6. The method for detecting the sound event based on the hole convolution cyclic neural network according to claim 1, wherein the specific process of training the hole convolution cyclic neural network in the step S3 is as follows:
inputting logarithmic Mel spectrum features extracted from training samples of different audio databases into a hole convolution cyclic neural network, and respectively adjusting the number of hole convolution modules and the size of a hole rate;
when the number of the void convolution modules is 1, setting two groups of void ratio values, wherein one group of void ratios is 1, namely the void ratios of all the convolution layers in the void convolution modules are all set to be 1, and the other group of void ratios is 2, namely the void ratios of all the convolution layers in the void convolution modules are all set to be 2;
when the number of the void convolution modules is 2, two groups of void ratio values are set, wherein one group of void ratios is 1-1, namely the void ratios of all the convolution layers in the void convolution modules are set to be 1, and the other group of void ratios is 2-4, namely the void ratios of the convolution layers in the first convolution module and the second convolution module are respectively 2 and 4;
when the number of the void convolution modules is 3, two groups of void ratio values are set, wherein one group of void ratios is 1-1-1, namely the void ratios of all the convolution layers in the void convolution modules are set to be 1, and the other group of void ratios is 2-4-8, namely the void ratios of the convolution layers in the first convolution module, the second convolution module and the third convolution module are respectively 2, 4 and 8;
when the number of the void convolution modules is 4, setting two groups of void ratio values, wherein one group of void ratio values is 1-1-1-1, namely the void ratios of all convolution layers in the void convolution modules are all set to be 1; the other group of void rates is 2-4-8-16, namely the void rates of the convolution layers in the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are 2, 4, 8 and 16 respectively;
when the number of the void convolution modules is 5, setting two groups of void ratio values, wherein one group of void ratio values is 1-1-1-1-1, namely the void ratios of all convolution layers in the void convolution modules are all set to be 1; the other group of void rates has a value of 2-4-8-16-32, i.e., the void rates of the convolution layers in the first, second, third, fourth and fifth convolution modules are 2, 4, 8, 16 and 32, respectively.
7. The method for detecting the acoustic event based on the hole convolution cyclic neural network as claimed in claim 1, wherein the acoustic event detection in step S4 is as follows:
s4.1, extracting audio features in each test data set, and identifying each audio frame by using a trained cavity convolution cyclic neural network;
and S4.2, splicing the identification results of the audio frames according to the time sequence to obtain the identification results of the audio segments, and then calculating the detection precision of the sound event based on the audio frame level and the audio segment level.
CN202010483079.3A 2020-06-01 2020-06-01 Sound event detection method based on cavity convolution cyclic neural network Expired - Fee Related CN111599376B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010483079.3A CN111599376B (en) 2020-06-01 2020-06-01 Sound event detection method based on cavity convolution cyclic neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010483079.3A CN111599376B (en) 2020-06-01 2020-06-01 Sound event detection method based on cavity convolution cyclic neural network

Publications (2)

Publication Number Publication Date
CN111599376A true CN111599376A (en) 2020-08-28
CN111599376B CN111599376B (en) 2023-02-14

Family

ID=72192486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010483079.3A Expired - Fee Related CN111599376B (en) 2020-06-01 2020-06-01 Sound event detection method based on cavity convolution cyclic neural network

Country Status (1)

Country Link
CN (1) CN111599376B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133326A (en) * 2020-09-08 2020-12-25 东南大学 Gunshot data amplification and detection method based on antagonistic neural network
CN112529152A (en) * 2020-12-03 2021-03-19 开放智能机器(上海)有限公司 System and method for detecting watermelon maturity based on artificial intelligence
CN112951242A (en) * 2021-02-02 2021-06-11 华南理工大学 Phrase voice speaker matching method based on twin neural network
CN113658607A (en) * 2021-07-23 2021-11-16 南京理工大学 Ambient sound classification method based on data augmentation and convolutional recurrent neural network
CN113990303A (en) * 2021-10-08 2022-01-28 华南理工大学 Environmental sound identification method based on multi-resolution cavity depth separable convolution network
CN116519049A (en) * 2023-04-12 2023-08-01 青岛派科森光电技术股份有限公司 Distributed optical cable detection device and method for tunnel
CN119335478A (en) * 2024-09-29 2025-01-21 武汉大学 A convolutional recurrent neural network multi-sound source detection and localization method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160367179A1 (en) * 2015-06-16 2016-12-22 Upchurch & Associates Inc. Sound association test
CN109065030A (en) * 2018-08-01 2018-12-21 上海大学 Ambient sound recognition methods and system based on convolutional neural networks
CN109767785A (en) * 2019-03-06 2019-05-17 河北工业大学 Environmental noise recognition and classification method based on convolutional neural network
CN110223715A (en) * 2019-05-07 2019-09-10 华南理工大学 It is a kind of based on sound event detection old solitary people man in activity estimation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160367179A1 (en) * 2015-06-16 2016-12-22 Upchurch & Associates Inc. Sound association test
CN109065030A (en) * 2018-08-01 2018-12-21 上海大学 Ambient sound recognition methods and system based on convolutional neural networks
CN109767785A (en) * 2019-03-06 2019-05-17 河北工业大学 Environmental noise recognition and classification method based on convolutional neural network
CN110223715A (en) * 2019-05-07 2019-09-10 华南理工大学 It is a kind of based on sound event detection old solitary people man in activity estimation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEIRAN YAN: "AUDIO-BASED AUTOMATIC MATING SUCCESS PREDICTION OF GIANT PANDAS", 《ARXIV》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133326A (en) * 2020-09-08 2020-12-25 东南大学 Gunshot data amplification and detection method based on antagonistic neural network
CN112529152A (en) * 2020-12-03 2021-03-19 开放智能机器(上海)有限公司 System and method for detecting watermelon maturity based on artificial intelligence
CN112951242A (en) * 2021-02-02 2021-06-11 华南理工大学 Phrase voice speaker matching method based on twin neural network
CN113658607A (en) * 2021-07-23 2021-11-16 南京理工大学 Ambient sound classification method based on data augmentation and convolutional recurrent neural network
CN113990303A (en) * 2021-10-08 2022-01-28 华南理工大学 Environmental sound identification method based on multi-resolution cavity depth separable convolution network
CN113990303B (en) * 2021-10-08 2024-04-12 华南理工大学 Environmental sound identification method based on multi-resolution cavity depth separable convolution network
CN116519049A (en) * 2023-04-12 2023-08-01 青岛派科森光电技术股份有限公司 Distributed optical cable detection device and method for tunnel
CN119335478A (en) * 2024-09-29 2025-01-21 武汉大学 A convolutional recurrent neural network multi-sound source detection and localization method and system

Also Published As

Publication number Publication date
CN111599376B (en) 2023-02-14

Similar Documents

Publication Publication Date Title
CN111599376A (en) Sound event detection method based on cavity convolution cyclic neural network
CN111986699B (en) Sound event detection method based on full convolution network
Huang et al. Development and validation of a deep learning algorithm for the recognition of plant disease
CN111027452B (en) Microseismic signal arrival time and phase identification method and system based on deep neural network
CN110033756B (en) Language identification method and device, electronic equipment and storage medium
CN114863937B (en) Hybrid bird song recognition method based on deep transfer learning and XGBoost
CN101493890B (en) Dynamic vision caution region extracting method based on characteristic
CN115147641B (en) Video classification method based on knowledge distillation and multi-mode fusion
CN109859771B (en) Sound scene clustering method for jointly optimizing deep layer transformation characteristics and clustering process
CN117877516A (en) Sound event detection method based on cross-model two-stage training
CN116861303A (en) Digital twin multisource information fusion diagnosis method for transformer substation
CN118506846B (en) Hard disk testing device, system and method
CN115761888A (en) Tower crane operator abnormal behavior detection method based on NL-C3D model
CN109002529A (en) Audio search method and device
CN113990303B (en) Environmental sound identification method based on multi-resolution cavity depth separable convolution network
CN116548979A (en) Physiological signal segment analysis method based on time-frequency information fusion attention
CN115497507A (en) Cross-library speech emotion recognition method and device based on progressive migration neural network
CN112951242B (en) Phrase voice speaker matching method based on twin neural network
Chatterjee et al. South Asian Sounds: Audio Classification
CN115376555B (en) Method and device for rapid identification of explosion source information based on acoustic characteristics
CN119513717A (en) Drone recognition method based on ResNet deep learning network
CN113345427A (en) Residual error network-based environmental sound identification system and method
CN119360891A (en) An abnormal sound detection method for belt conveyor fault diagnosis
CN119068912A (en) Industrial diagnosis method and related equipment based on voiceprint recognition
CN119541545B (en) Voice identification method based on distinguishing characterization loss and attention convolution network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20230214