CN117765936A

CN117765936A - Method for generating underwater sound target recognition model and underwater sound target recognition method

Info

Publication number: CN117765936A
Application number: CN202311817696.2A
Authority: CN
Inventors: 陈果; 曾怀望; 汪浩鹏; 周雪梅; 张楚婷
Original assignee: United Microelectronics Center Co Ltd
Current assignee: United Microelectronics Center Co Ltd
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-03-26

Abstract

The present disclosure provides a method of generating an underwater sound target recognition model. The method comprises the following steps: training a first deep learning model based on source domain training data, the source domain training data comprising sound data, and the first deep learning model comprising a first binarized deep convolutional neural network and a first classifier, the first classifier comprising a fully connected layer and a classification layer; migrating model parameters of a first binarized deep convolutional neural network of the trained first deep learning model to a second binarized deep convolutional neural network of a second deep learning model, the first binarized deep convolutional neural network and the second binarized deep convolutional neural network having the same structure, the second deep learning model including a second classifier, and the second classifier including a fully connected layer, a recurrent neural network, and a classification layer; and training the second deep learning model based on target domain training data, wherein the target domain training data comprises underwater target sound data, and in the process of training the second deep learning model, model parameters of a second binarization deep convolutional neural network are fixed and model parameters of a second classifier are updated. The embodiment of the disclosure also provides a method for identifying the underwater sound target.

Description

Method for generating underwater sound target recognition model and underwater sound target recognition method

Technical Field

The present disclosure relates to a method of generating a hydroacoustic target recognition model and a hydroacoustic target recognition method.

Background

The underwater sound target recognition technology can be widely applied to the fields of underwater defense, resource survey, ship recognition, fish activity detection and the like, and plays an important role in marine defense and economic construction. The artificial intelligent algorithm such as deep learning is widely used in image recognition and voiceprint recognition, and has strong adaptability and high accuracy. However, computing chips for surface, underwater platforms (e.g., smart buoys) typically have limited computing power and less memory space, and underwater platforms performing underwater acoustic target recognition tasks typically require edge computation, and thus require a lightweight, low power consumption device adapted acoustic target recognition model. In addition, underwater sound target recognition faces the problem of "small samples" with a small sample size of underwater sound target data, and the underwater sound target recognition model is easy to overfit by using a deep learning algorithm, which also brings challenges to training and generating the underwater sound target recognition model.

In order to cope with this trend, there is a need to provide an improved underwater sound target recognition technique based on deep learning.

Disclosure of Invention

One aspect of the present disclosure relates to a method of generating an underwater sound target recognition model, comprising: training a first deep learning model based on source domain training data, the source domain training data comprising sound data, and the first deep learning model comprising a first binarized deep convolutional neural network and a first classifier, the first classifier comprising a fully connected layer and a classification layer; migrating model parameters of a first binarized deep convolutional neural network of the trained first deep learning model to a second binarized deep convolutional neural network of a second deep learning model, the first binarized deep convolutional neural network and the second binarized deep convolutional neural network having the same structure, the second deep learning model including a second classifier, and the second classifier including a fully connected layer, a recurrent neural network, and a classification layer; and training the second deep learning model based on target domain training data, wherein the target domain training data comprises underwater target sound data, and in the process of training the second deep learning model, model parameters of a second binarization deep convolutional neural network are fixed and model parameters of a second classifier are updated.

Another aspect of the present disclosure relates to a method of underwater sound target identification, comprising: the underwater sound data to be identified is identified using an underwater sound target identification model generated according to the method of generating an underwater sound target identification model of the present disclosure to determine the category of the underwater sound data to be identified.

Another aspect of the present disclosure relates to an electronic device. The electronic device includes a processor and a memory. The memory is communicatively coupled to the processor and stores computer readable instructions. The computer readable instructions, when executed by a processor, cause an electronic device to perform the method as described previously.

Another aspect of the disclosure relates to a computer-readable storage medium storing computer-readable instructions that, when executed by a processor of an electronic device, cause the electronic device to perform the method as previously described.

Another aspect of the disclosure relates to a computer program product comprising computer readable instructions which, when executed by a processor of an electronic device, implement a method as described above.

Drawings

The foregoing and other objects and advantages of the disclosure are further described below in connection with the following detailed description of the embodiments, with reference to the accompanying drawings. In the drawings, the same or corresponding technical features or components will be denoted by the same or corresponding reference numerals.

FIG. 1 is a flow chart of a method of generating an underwater sound target recognition model according to an embodiment of the present disclosure.

Fig. 2 is a block diagram of a first deep learning model according to an embodiment of the present disclosure.

Fig. 3 is a block diagram of a second deep learning model according to an embodiment of the present disclosure.

FIG. 4 is a schematic diagram of migrating model parameters of a trained first deep learning model to a second deep learning model according to an embodiment of the present disclosure.

Fig. 5 is a schematic diagram of the structure of one compact according to an embodiment of the present disclosure.

Fig. 6 is a schematic diagram of the structure of another compact according to an embodiment of the present disclosure.

Fig. 7 is a schematic diagram of a structure of a transition block according to an embodiment of the present disclosure.

Fig. 8 is a schematic diagram of the structure of a biglu network according to an embodiment of the disclosure.

Fig. 9 is a schematic diagram of training a second deep learning model according to an embodiment of the present disclosure.

Fig. 10 is a flowchart of MFCC feature extraction processing according to an embodiment of the present disclosure.

Fig. 11 is a flowchart of a wavelet analysis process according to an embodiment of the present disclosure.

Fig. 12 is a flowchart of a method of underwater sound target identification according to an embodiment of the present disclosure.

Fig. 13 shows a block diagram of a configuration of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The following detailed description is made with reference to the accompanying drawings and is provided to assist in a comprehensive understanding of various example embodiments of the disclosure. The following description includes various details to aid in understanding, but these are to be considered merely examples and are not intended to limit the disclosure, which is defined by the appended claims and their equivalents. The words and phrases used in the following description are only intended to provide a clear and consistent understanding of the present disclosure. In addition, descriptions of well-known structures, functions and configurations may be omitted for clarity and conciseness. Those of ordinary skill in the art will recognize that various changes and modifications of the examples described herein can be made without departing from the scope of the disclosure.

FIG. 1 is a flow chart of a method 100 of generating an underwater sound target recognition model according to an embodiment of the present disclosure.

As shown in FIG. 1, at step 110, a first deep learning model is trained based on source domain training data.

The source domain training data includes various sound data. The source domain training data may be various existing or available sound data. In one embodiment of the present disclosure, the source domain training data may include existing underwater target sound data. Existing underwater target sound data may be data of the waters/sea areas where tags are collected and marked by the user or a published data set such as the shipear data set. The source domain training data may also include any other type of voiceprint data. In addition, the source domain training data may include non-underwater target sound data. The non-underwater target sound data may be, for example, voice data, music data, living environment data, and the like. For example, the source domain training data may include published data sets such as an audio speech data set, a CASIA emotion speech data set, and the like. Various other voiceprint datasets are also contemplated by those skilled in the art and are applicable in the teachings of the present disclosure. Additionally, the source domain training data may include data generated using an emulator. Those skilled in the art will appreciate that since the simulation data may differ from the actual underwater sound data, the simulation data may not be suitable in some cases for training the first deep learning model alone as source domain training data. Thus, the simulation data may be used in combination with and may be based on existing underwater target sound data. In this case, the simulation data may be added in a different scene from the underwater target sound data. In addition, the data generated by the simulation by the simulator may also include interference data as an interference type.

In an embodiment of the present disclosure, the source domain training data includes at least one of: the simulator simulates sound data and interference data, existing non-underwater target sound data and existing underwater target sound data.

Thus, the sound data used as the source domain training data may include various sound data sets existing in the art, which have a larger data amount than the underwater target sound data, thereby being able to provide a large number of sound samples for training of the first deep learning model.

Those skilled in the art will appreciate that the non-underwater target sound data affects the accuracy of the final underwater sound target recognition model to some extent according to the degree of similarity between the non-underwater target sound data and the sound data of the actual underwater sound target, and thus those skilled in the art can select more suitable non-underwater target sound data according to actual needs.

Fig. 2 is a block diagram of a first deep learning model according to an embodiment of the present disclosure. As shown in fig. 2, the first deep learning model 200 includes a first binarized deep convolutional neural network 210 and a first classifier 220. The first binarized deep convolutional neural network 210 is used for feature extraction of data. The first classifier 220 is used to receive the extracted features and determine the probability that the data belongs to each category.

In the field of machine learning, convolutional neural networks (Convolutional Neural Network, CNN) are one of the most commonly used deep learning models. Deep convolutional neural networks typically have a multi-layer structure and millions of parameters, so deep convolutional neural networks have strong learning capabilities and are capable of extracting deep features from real complex underwater sound data. However, the conventional deep convolutional neural network requires a large amount of computation and a large amount of memory space to store the computation results, and cannot be applied to a scene of edge computation of an underwater platform performing an underwater sound target recognition task.

For low power algorithm designs, implementing low bit quantization for weights and stimuli is the primary method to mitigate power consumption limitations. A more extreme possible implementation is to employ binary quantization for both, i.e. the case represented by the 1-bit binary neural network BNN (Binary Neural Network). In comparison with a common 32-bit floating point neural network model, after compression by using binarization, the weight and excitation in the neural network can be represented by 1 bit (+1 or-1, or 0 or 1), and the occupied memory can be reduced to about 1/32 theoretically. Furthermore, by binarization, the neural network can use lightweight XNOR (exclusive or) bit operation instead of heavy floating-point multiply-add operation, and in some cases, the CPU operation speed can be increased up to about 58 times. Thus, the binary neural network has many hardware-friendly characteristics including saving memory, improving energy efficiency, and significantly accelerating. Therefore, the binary neural network can be applied to the low-power-consumption edge computing scene.

In view of the above, in the embodiments of the present disclosure, a binary deep convolutional neural network (BCNN) is adopted, and since both the weight and the activation are binarized, the calculation burden and the storage burden are greatly reduced, and the generated underwater sound target recognition model is suitable for meeting the requirements of light weight and edge calculation suitable for low-power consumption equipment, and can be suitable for an artificial intelligent calculation chip with low power consumption.

In a binarized deep convolutional neural network, weights are represented by a weight matrix of the respective convolutional layers, and an activation function adopts a binarized activation function, for example, a sign function or a random binarization function.

In embodiments of the present disclosure, a binarized deep convolutional neural network similar to a dense network (DenseNet) structure is employed. The convolutional neural network of the dense network structure comprises one or more dense blocks (dense blocks), wherein the dense blocks comprise a plurality of dense layer units (dense layers), and each dense layer unit is in dense connection with the following dense layer units (dense connection), i.e. each dense layer unit receives the output of all previous dense layer units as an additional input. The advantage of such dense connections is that the problem of gradient extinction is alleviated, feature propagation is enhanced, feature reuse is encouraged, and the number of parameters is greatly reduced.

Since the dense layer units of the dense network structure are directly cascaded with the inputs of the upper layers to the outputs, binarization is easy to implement. In addition, the weight number of the dense network structure is proper, so that the classification requirement can be well realized on the premise of ensuring proper weight number, and the dense network structure can be suitable for low-power-consumption artificial intelligent computing chips.

Those skilled in the art will appreciate that other configurations of binarized deep convolutional neural networks may be employed in the disclosed solution in addition to the DenseNet configuration. The neural network architecture adopted should meet the classification requirement, be easy to realize by a binarization quantization method and avoid excessive weight.

As shown in fig. 2, the first binarized deep convolutional neural network 210 includes one input convolutional layer 211 and n+1 dense blocks 212 and n transition blocks 213 alternately arranged. As shown in fig. 2, the dense blocks 212 and the transition blocks 213 are alternately arranged in the order of the dense blocks and the transition blocks, and the number of dense blocks is one more than the number of transition blocks.

In an embodiment of the present disclosure, the first binarized deep convolutional neural network 210 may include five dense blocks and four transition blocks.

The specific structure of the dense block and the transition block of the binarized deep convolutional neural network according to the embodiment of the present disclosure will be described hereinafter.

The first classifier 220 includes a full connection layer 221 and a classification layer 222. The fully connected layer 221 is used to flatten the received feature map, i.e. to convert the feature map into a one-dimensional vector. The fully connected layer may integrate the extracted features together for processing by the classification layer 222. Classification layer 222 may apply logistic regression, softmax regression, etc. to process the one-dimensional vectors output by full connectivity layer 221 to output the probability that each sample belongs to each classification. The content of logistic regression, softmax regression is known to those skilled in the art and is therefore not described in detail herein.

In the training process of the first deep learning model, source domain training data are provided for the first deep learning model, the data are subjected to feature extraction by a first binarization deep convolutional neural network, and a first classifier determines the prediction category to which the data belong according to the extracted features. Then, a prediction error is obtained from the known class of source domain training data compared to a prediction class, the prediction error being represented by a loss function. If the prediction error is greater than the set minimum error and the number of training iterations does not reach the set maximum number of iterations, updating model parameters of the first binary depth convolutional neural network and the first classifier in a direction in which the loss function decreases by applying a gradient descent algorithm in a back propagation manner. And then, repeatedly extracting and classifying the features of the source domain training data by the first deep learning model until the prediction error is smaller than the set minimum error or the training iteration number reaches the set maximum iteration number. Thereafter, the above training process is repeated for other data of the source domain training data until training is completed for a desired number of source domain training data. Thereby, a trained first deep learning model is obtained.

Next, returning to FIG. 1, at step 120, model parameters of a first binarized deep convolutional neural network of the trained first deep learning model are migrated to a second binarized deep convolutional neural network of the second deep learning model.

Fig. 3 is a block diagram of a second deep learning model according to an embodiment of the present disclosure. As shown in fig. 3, the second deep learning model 300 includes a second binarized deep convolutional neural network 310 and a second classifier 320. The second binarized deep convolutional neural network 310 is used for feature extraction of data, and the second classifier 320 is used for receiving the extracted features and judging the probability that the data belongs to each category.

In the embodiment of the present disclosure, the second binarized deep convolutional neural network 310 has the same structure as the first binarized deep convolutional neural network 210, i.e., includes one input convolutional layer 311 and n+1 dense blocks 312 and n transition blocks 313 alternately arranged.

The second classifier 320 of the second deep learning model 300 includes a full connection layer 321, a recurrent neural network 322, and a classification layer 323.

In the second classifier 320 of the second deep learning model 300, the fully connected layer 321 is used to flatten the received feature map, i.e. to convert the feature map into a one-dimensional vector. The structure and parameters of the fully connected layer 321 in the second classifier 320 of the second deep learning model 300 may be different from the fully connected layer 221 in the first classifier 220 of the first deep learning model 200.

The recurrent neural network (Recurrent Neural Networks, RNN) adds a memory function relative to conventional neural networks so that neurons can hold historical computation information. In other words, each neuron has a memory function, and the calculation of each time of the neuron depends not only on the current input but also on the history information stored by the neuron, so that the data and the calculation result can be transmitted in the time dimension. Convolutional neural networks may extract features of the spatial dimension, while recurrent neural networks may pass and extract information in the temporal dimension.

In a fully connected or convolutional neural network, the network results are from the input layer to the hidden layer to the output layer, with the layers being fully or partially connected, but the nodes between each layer being connectionless. The purpose of the recurrent neural network is to characterize a sequence of current outputs versus previously received inputs. From the network results, the RNN will memorize the previous information and use the previous information to influence the subsequent output. That is, the nodes between hidden layers of the RNN are connected, and the input of the hidden layers includes not only the output of the input layer but also the output of the hidden layer at the previous time.

Examples of recurrent neural networks may include Long Short-Term Memory (LSTM) networks, gated loop unit (Gated Recurrent Unit, GRU) networks, bi-directional LSTM (BiLSTM) networks, bi-directional GRU (BiGRU) networks, and the like.

Hereinafter, a description will be given of taking a recurrent neural network as a biglu network as an example.

The classification layer 323 may apply logistic regression, softmax regression, etc. to process the feature map output by the recurrent neural network 322, thereby outputting the probability that each sample belongs to each classification. FIG. 4 is a schematic diagram of migrating model parameters of a trained first deep learning model to a second deep learning model according to an embodiment of the present disclosure. The left side of fig. 4 shows the second deep learning model 300 as shown in fig. 3, and the right side of fig. 4 shows the first deep learning model 200 as shown in fig. 2.

Since the second binarized deep convolutional neural network 310 is identical in structure to the first binarized deep convolutional neural network 210, model parameters of the first binarized deep convolutional neural network of the trained first deep learning model may be directly migrated to the second binarized deep convolutional neural network of the second deep learning model without any adjustments. The migrated model parameters include at least one of: the weight matrix of each convolution layer, the bias matrix of each convolution layer and the parameters of the batch normalization layers.

Furthermore, because of the difference in the structures of the first classifier 220 and the second classifier 320, the model parameters of the first classifier 220 may not be migrated to the second classifier 320. In another embodiment of the present disclosure, a portion of the model parameters of the first classifier 220 may be adaptively adjusted before migrating to the second classifier 320 as initial parameters of the second classifier 320.

Although the sound data used in training the first deep learning model may not be underwater target sound data, the model parameters of the first deep learning model obtained through training may be trained to obtain basic knowledge of the sound data, the ability of which to extract features corresponding to sound information is common to different classification tasks, and thus may migrate to the second deep learning model for underwater sound target recognition.

The training is carried out on the first deep learning model by using sound data with large sample size, and the model parameters of the first binarization deep convolutional neural network of the trained first deep learning model are migrated to the second binarization deep convolutional neural network of the second deep learning model by migration learning, so that the problem that the training cannot be carried out due to the fact that the sample size of the underwater sound target data is small can be solved, the problem that the trained second deep learning model is fitted is avoided, and the accuracy of the second deep learning model is improved.

Thereafter, returning to fig. 1, at step 130, the second deep learning model is trained based on the target domain training data, and the trained second deep learning model is obtained as the underwater sound target recognition model.

Although the model parameters of the second binarized deep convolutional neural network 310 of the second deep learning model also obtain basic knowledge of sound data through model parameter migration, further training of the second deep learning model with the target domain training data is required due to the difference between the source domain training data for training the first deep learning model and the target domain training data for underwater sound target recognition.

In an embodiment of the present disclosure, the target domain training data comprises underwater target sound data. The underwater target sound data is corresponding underwater target sound data respectively acquired for each category of underwater sound target, and may be data of an existing water area/sea area. Furthermore, the underwater target sound data may also be underwater (under water/sea) target sound data collected by a sensor (e.g., microphone, sonar, etc.) and labeled with a tag. These data may be collected in the area where the generated underwater sound target recognition model is to be applied, and thus the pertinence and accuracy of the generated model may be improved.

The process of training the second deep learning model is similar to the process of training the first deep learning model, and differs therefrom in that the model parameters of the second binarized deep convolutional neural network 310 are fixed and the model parameters of the second classifier 320 are updated during the process of training the second deep learning model. In other words, during the training of the second deep learning model, the model parameters of the second binarized deep convolutional neural network 310 are not updated and remain as model parameters migrated from the trained first deep learning model, and the model parameters of the fully connected layer, the recurrent neural network, and the classification layer of the second classifier 320 are continuously updated until the training is completed.

As shown in fig. 4, the second binarized convolutional neural network 310 in the second deep learning model 300 is a migrated and fixed part, and the second classifier 320 is a trained part.

According to the underwater sound target recognition model generated by the embodiment of the disclosure, the binarization depth convolution neural network is adopted, so that application requirements such as low power consumption can be met, and the classifier of the circulation neural network can better classify time sequence data, so that the problem of accuracy reduction caused by binarization and the like is solved. In addition, the problem of small samples with smaller number of aquatic samples can be solved by migrating model parameters of a source domain training model to a convolutional neural network and performing secondary training, so that the generalization capability of the model is improved. In addition, the scheme combines a binarization convolutional network and a cyclic neural network, and a deep learning model for processing data from the angles of a spatial domain (convolutional neural network) and a time domain (cyclic neural network) is provided. Thus, the underwater sound target recognition model generated according to the embodiments of the present disclosure can be applied to an artificial intelligence computing chip with low power consumption.

Next, the structure of the dense blocks 212 and 312 in the first and second binarized deep convolutional neural networks 210 and 310 of the embodiment of the present disclosure will be described. Since the dense blocks 212 and 312 have the same structure, they are not distinguished hereinafter.

Fig. 5 is a schematic diagram of the structure of one compact according to an embodiment of the present disclosure. Fig. 5 (a) shows a structure within one dense block. As shown, each compact block 500 includes a plurality of interconnected compact layer units. In the figure 8 dense layer units are shown. The dense layer cells are connected in sequence and the outputs of the preceding dense layer cells are connected to the inputs of the following dense blocks.

Fig. 5 (B) shows the structure within one dense layer unit. As shown, each dense layer unit includes a scaling factor 510, a binarized convolution layer 520, a bulk normalization layer 530, and a binarized activation layer 540.

As shown in fig. 5 (B), in each dense layer cell, the input feature map of the dense layer cell is multiplied by a scaling factor 510 to obtain a scaled input feature map, which is connected in series to the output feature map of the dense layer cell as the input feature map of the next dense layer cell. Thus, the output of a preceding compact block is connected to the input of a following compact block. Because the computing modules of each dense layer unit can be reused, the structure is particularly suitable for customized ultra-low power consumption intelligent application.

In the binarized convolution layer 520, a convolution operation (weighted sum) is performed on the input feature map using a convolution kernel (weight matrix) having a smaller size than the input feature map, so that one feature value can be obtained, and as the convolution kernel slides on the input feature map, a plurality of feature values can be obtained and thus a new feature map can be obtained. In an embodiment of the present disclosure, the size of the convolution kernel employed is 3*3. In embodiments of the present disclosure, the weight matrix (convolution kernel) of the binarized convolution layer 520 is binarized, i.e., each weight in the weight matrix is only +1 or-1 (or only 0 or 1).

The function of the batch normalization layer 530 is to normalize the received data. The convergence rate of the model can be increased by the batch normalization layer processing.

The function of the binarized activation layer 540 is to activate the features extracted by the convolutional layer. Since the convolution operation is a linear variation relationship of phase difference between the input matrix and the convolution kernel matrix, the activation layer is required to perform nonlinear mapping on the input matrix and the convolution kernel matrix. The activation layer mainly comprises an activation function, namely a nonlinear function is nested on the basis of the output result of the convolution layer, so that the output characteristic diagram has a nonlinear relation. In embodiments of the present disclosure, binarization activation layer 540 employs a binarization activation function, such as a sign function or a random binarization function.

Fig. 6 is a schematic diagram of the structure of another compact according to an embodiment of the present disclosure. Fig. 6 (a) shows a structure within one dense block. As shown, each compact block 600 includes an even number of interconnected compact layer units. In the figure 8 dense layer units are shown. The difference from the compact block 500 shown in fig. 5 is that only the inputs of odd-numbered compact-layer cells are scaled and concatenated in the compact block 600 shown in fig. 6. Specifically, as shown in (a) of fig. 6, only inputs of odd-numbered dense layer units (dense layer units of numbers 1, 3, 5, and 7) are scaled and connected in series, while inputs of even-numbered dense layer units (dense layer units of numbers 2, 4, 6, and 8) are not scaled and connected in series to other dense layer units after other than being processed by the even-numbered dense layer units.

Fig. 6 (B) shows the structure of a pair of dense layer units of odd-numbered and then adjacent even-numbered. As shown, the odd numbered dense layer units include a binarized convolution layer 621, a bulk normalization layer 631, and a binarized activation layer 641. The even numbered dense layer cells include a binarized convolutional layer 622, a bulk normalization layer 632, and a binarized activation layer 642. The function of the individual parts in each dense layer unit is similar to that shown in fig. 5 (B) and will not be repeated here.

As shown in fig. 6 (B), in each pair of odd-numbered and then adjacent even-numbered dense layer units, there is one scaling factor 611, and the input feature map of the odd-numbered dense layer unit of the pair of dense layer units is multiplied by the scaling factor, resulting in a scaled input feature map. The scaled input feature map is then concatenated onto the output feature map of the even-numbered dense layer cells of the pair of dense layer cells as the input feature map of the odd-numbered dense layer cells of the next pair of dense layer cells. Similarly, an odd-numbered dense layer cell of the next pair of dense layer cells scales the input feature map of the dense layer cell by a scaling factor and concatenates the input feature map of the odd-numbered dense layer cell of the next pair of dense layer cells to the output feature map of the numbered dense layer cell of the next pair of dense layer cells. Hereby it is achieved that only the input feature map of the odd dense layer cells is scaled and concatenated, whereas the input of the even dense layer cells is not scaled and concatenated to other dense layer cells afterwards, except for being processed by the even numbered dense layer cells.

In the embodiment of the disclosure, by scaling and concatenating only the inputs of the odd dense layer units, but not the inputs of the even dense layer units, information can be retained to a certain extent, and some buffering and calculation amounts can be reduced compared with the dense block structure shown in fig. 5, so that the method is more suitable for voice signals.

The structure of the transition blocks 213 and 313 in the first and second binarized deep convolutional neural networks 210 and 310 of the embodiment of the present disclosure is explained next. Since the transition blocks 213 and 313 have the same structure, they are not distinguished hereinafter.

In the dense block unit, as the number of layers of the dense layer unit increases, the number of channels to be input is also increasing. With guaranteed accuracy, the output data needs to be downsampled (downsampled) after calculation by several convolutional layers.

One downsampling scheme is to use pooling (pooling) for processing, such as maximum pooling or average pooling. However, the pooling layer process has two major problems for low power consumption binary network computing. The first problem is that the pooling layer can cause loss of information. This loss of information is acceptable for convolutional networks based on floating point number calculations. But for binarized networks this loss can have a large impact on the accuracy of the identification. The second problem is that the pooling layer needs to use additional hardware resources to perform the computation as opposed to convolution computation. Additional hardware resources may result in more power consumption requirements.

Thus, in order not to affect the accuracy of recognition and reuse hardware resources to the maximum extent, in embodiments of the present disclosure, downsampling units (i.e., transition blocks) of a low-power binarized convolutional neural network are employed that are suitable for maintaining accuracy.

Fig. 7 is a schematic diagram of a structure of a transition block according to an embodiment of the present disclosure. As shown in fig. 7, each transition block includes one transition layer unit, each transition layer unit including a binarized convolution layer, a batch normalization layer, and a binarized activation layer. In contrast to dense blocks, transition blocks do not employ scaling factors. As shown in fig. 5, 6, and 7, the transition block of embodiments of the present disclosure employs a similar structure to the dense layer elements of a binarized deep convolutional neural network, thereby enabling reuse of the same computation module, avoiding the use of additional hardware and avoiding increasing power consumption requirements.

In an embodiment of the present disclosure, the convolution step m of the binarized convolution layer is an integer greater than 1. The convolution step length m is the number of rows and columns of the convolution kernel sliding each time when the convolution layer carries out convolution operation. The size of the output feature map can be reduced by setting the step size to a value greater than 1, enabling downsampling. After the downsampling unit, the channel number of the input data is readjusted, so that the next dense layer module group can be used for subsequent calculation.

In an embodiment of the present disclosure, the recurrent neural network of the second classifier is a biglu network.

In classical recurrent neural network RNNs, the transmission of hidden states is unidirectional from front to back. However, in the Bidirectional RNN (BiRNN), the output at the present time is related not only to the past state but also to the future state, that is, the output may be commonly determined in consideration of the context of the input at the present time. BiRNNs are composed of two unidirectional RNNs in opposite directions superimposed together. At each time t, the inputs are simultaneously provided to the two opposite RNNs, while the outputs are determined by both unidirectional RNNs. The BiLSTM and BiGRU are formed by replacing RNN in BiRNN with LSTM or GRU structure.

Fig. 8 is a schematic diagram of the structure of a biglu network according to an embodiment of the disclosure. As shown in fig. 8, the biglu network consists of two GRUs: one is a forward GRU ₂ The model takes the hidden state calculated at the current moment as the input of the next moment; the other is a rearward GRU ₁ The model takes the hidden state calculated at the next moment as the input of the current moment. The biglu network includes: an input layer, a forward hidden layer, a backward hidden layer, and an output layer. The forward hidden layer corresponds to the forward GRU ₂ Model, backward hidden layer corresponds to backward GRU ₁ And (5) a model. The input layer is used for respectively hiding the layer and the back in the forward directionData is input to the hidden layer. In FIG. 8, x ^(t) For input at the current time, x ^(t-1) For input of the previous moment, x ^(t+1) Is the input of the next moment. The output layer is used for receiving output from the front hidden layer and the rear hidden layer respectively, and splicing the output of the front hidden layer and the rear hidden layer to be used as the output of the current moment. In FIG. 8, y ^(t) For the output at the current time, y ^(t-1) For the output of the previous moment, y ^(t+1) Which is the output of the next instant. The content of the biglu network is known to those skilled in the art and thus is not specifically described herein.

In the embodiment of the disclosure, the input parameters of the biglu network are adjustable, and as the input parameters are reduced, the calculation amount and calculation burden can be reduced, but the recognition accuracy of the model is correspondingly reduced. For example, by going from 1024 to 256 inputs to the biglu network, the 4-fold parameters can be reduced and the workload can be reduced accordingly. In actual operation, the input parameters of the BiGRU network can be adjusted according to actual situation requirements, so that an ideal compromise between the calculated amount and the accuracy is determined.

The retraining portion of the transfer learning involving biglu can capture the characteristics of the underwater acoustic signal well.

According to one or more embodiments of the present disclosure, prior to providing the target domain training data to the second deep learning model, the target domain training data may be sliced for a predetermined length of time to obtain slices of the target domain training data, and a time-series relationship of the input data may still be maintained between the slices. For long-time, high-sampling-rate underwater acoustic signals, the learning model may be overburdened if raw data is directly taken as input. By slicing the target domain training data and providing the sliced input to the second deep learning model, learning speed can be increased and power consumption can be reduced. Furthermore, in embodiments of the present disclosure, the biglu network is based on input data having a time-series relationship. By slicing the target domain training data and providing the sliced input to the second deep learning model, an input having a time series relationship may be provided to the biglu network for learning by the biglu network.

In an embodiment of the present disclosure, in order to maintain the size consistency of the input data, the time length of each slice may be made the same, for example, the time length of each slice may be all 2 seconds. Also, the length of time of each slice is not suitable to be too small in order to have enough features in each slice. For example, each slice has a time length of not less than 1 second.

FIG. 9 is a schematic diagram of training a second deep learning model, showing slicing of input data, according to an embodiment of the present disclosure. As shown in fig. 9, the target domain data is first sliced. In this embodiment, the target domain data as an input signal is divided into five slices, that is, slice 1 to slice 5, in time series through slice processing. The time series relationship is still maintained between slice 1 through slice 5.

Then, the obtained slices of the target domain training data are sequentially and respectively input into the second binarized deep convolutional neural network 310 for feature extraction, and the extracted features are provided to the second classifier 320 for classification processing. Specifically, slice 1 is first provided to the second binarized deep convolutional neural network 310, processed by the second binarized deep convolutional neural network 310, and the resulting feature map is provided to the full connection layer 321 of the second classifier 320. The fully connected layer 321 of the second classifier 320 flattens the received feature map, i.e. converts the feature map into a one-dimensional vector x ⁽¹⁾ . Thereafter, x ⁽¹⁾ Is provided as an input to the biglu network at the first instant. Thereafter, similarly to slice 1, slice 2 is input into the second binarized deep convolutional neural network 310 for feature extraction, and the extracted features are provided to the full-connected layer 321 of the second classifier 320 and converted into one-dimensional vectors x ⁽²⁾ . Thereafter, x ⁽²⁾ Is provided to the biglu network as an input at a second time immediately after the first time. Similarly, by sequentially processing the slices 1 to 5, a one-dimensional vector x is obtained ⁽¹⁾ 、x ⁽²⁾ 、x ⁽³⁾ 、x ⁽⁴⁾ 、x ⁽⁵⁾ Are provided in turn to the biglu network as inputs from the first time to the fifth time. The BiGRU network respectively processes the vector x through the forward hidden layer and the backward hidden layer ⁽¹⁾ 、x ⁽²⁾ 、x ⁽³⁾ 、x ⁽⁴⁾ 、x ⁽⁵⁾ And splicing the outputs of the forward hidden layer and the backward hidden layer to be used as the output of the current moment. As shown in fig. 9, for the vector x from the first time to the fifth time ⁽¹⁾ 、x ⁽²⁾ 、x ⁽³⁾ 、x ⁽⁴⁾ 、x ⁽⁵⁾ Processing to obtain output of y in turn ⁽¹⁾ 、y ⁽²⁾ 、y ⁽³⁾ 、y ⁽⁴⁾ 、y ⁽⁵⁾ . Thereafter, the obtained output y ⁽¹⁾ 、y ⁽²⁾ 、y ⁽³⁾ 、y ⁽⁴⁾ 、y ⁽⁵⁾ Is provided to the classification layer and is classified by the classification layer to obtain a classification result of the input data. Thereafter, similar to that discussed above, prediction errors are determined and the model parameters of the fully connected, recurrent neural network and classification layers in the second classifier 320 are continuously updated in the direction of decreasing loss function by applying a gradient descent algorithm in a back-propagation manner until the training is completed.

The process of underwater target recognition based on the trained second deep learning model is similar to the above process shown in fig. 9, but ends where the classification results are output by the classification layer of classifier 320 without back propagation to update the model parameters.

The underwater sound signal time sequence characteristics can be captured better by performing slicing processing on the data, sequentially inputting the slices into a second binarization depth convolution neural network to extract the characteristics, and training based on the characteristics of each slice with a time sequence relationship by using a BiGRU model. Then, the underwater sound target can be better identified by using the thus obtained underwater sound target identification model.

According to embodiments of the present disclosure, the original input data is a sound signal, and such a sound signal cannot be directly input as input data into the deep learning model in some cases. In this case, the input data may be subjected to time-frequency analysis to obtain input data suitable for the deep learning model. According to an embodiment of the present disclosure, mel-frequency cepstral coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) feature extraction processing may be performed on input data to obtain MFCC feature data of the data as input data. The mel frequency is proposed according to the auditory characteristics of the human ear and has a nonlinear correspondence with the frequency of the original signal. The MFCC features are coefficients obtained by converting the audio spectrum into mel spectrum non-linearly and then performing secondary conversion from mel spectrum to cepstrum by using the human non-linear auditory system.

Fig. 10 is a flowchart of MFCC feature extraction processing according to an embodiment of the present disclosure. As shown in fig. 10, the MFCC feature extraction process includes pre-emphasis, framing, windowing, fast fourier transform, mel frequency analysis, logarithmic operation, discrete cosine transform, and the like.

Pre-emphasis refers to the fact that by increasing the frequency spectrum of the high frequency part of the signal, the frequency spectrum of the signal becomes more gradual, remains in the whole frequency band from low frequency to high frequency, and can be found with the same signal-to-noise ratio. Framing may divide a signal into several short periods of time, during which the signal may be considered a stationary process. The basic principle of this step is that the frequency in the signal varies over time, so in most cases it is not meaningful to fourier transform the whole signal, since the frequency profile of the signal is lost over time. To avoid this, it can be safely assumed that the frequency in the signal is stationary for a very short time. Thus, by performing a fourier transform within this short time frame, a good approximation of the frequency profile of the signal can be obtained by connecting adjacent frames. In the framing process, an overlapped segmentation method is adopted, so that frames are excessively smoothed.

Windowing is the application of a window function (e.g., hamming window) to each frame after slicing the signal into frames, thereby reducing the truncation effect of the signal, increasing the continuity at the left and right ends of the frame, and reducing spectral leakage. Since the transformation of a signal in the time domain is generally difficult to see the characteristics of the signal, it is generally converted into an energy distribution in the frequency domain for observation, and different energy distributions can represent the characteristics of different voices. After multiplication by the hamming window, each frame must also undergo a fast fourier transform to obtain the energy distribution over the spectrum. And performing fast Fourier transform on the frame signals subjected to framing and windowing to obtain the frequency spectrum of each frame. And obtaining the power spectrum of the voice signal by modular squaring the frequency spectrum of the voice signal.

Mel frequency analysis is the passing of the energy spectrum through a mel filter bank, which is a triangular filter bank with M filters, typically 22-40. The triangular band-pass filter is densely distributed in the low-frequency part and sparsely distributed in the high-frequency part, and the distribution is used for better meeting the auditory characteristics of human ears and highlighting formants of original sounds.

The logarithmic operation is used to calculate the logarithmic energy of each filter bank output. The reason for the logarithmic operation is that the perception of sound by the human ear is not linear, the nonlinear relationship is better described by logarithms, and cepstrum analysis can be performed after the logarithmic operation is performed. And finally, carrying the logarithmic energy into discrete cosine transform to obtain MFCC characteristic data.

The MFCC underwater acoustic signal characteristic data can be obtained by converting an input signal into mel frequency and performing cepstrum analysis on the signal at the mel frequency.

In addition, because the sound signal is continuous in time domain, the characteristic information extracted by framing only reflects the characteristic of the sound of the frame, and in order to make the characteristic more embody the continuity in time domain, the dimension of the frame information before and after the characteristic dimension can be increased. It is common to differential the MFCC coefficients first and second order, which may be input together as a feature vector into a subsequent neural network.

Thus, those skilled in the art will appreciate that in addition to performing MFCC feature extraction to obtain MFCC feature data, other processing may be performed on the input sound signal to obtain other feature data. Such as mel frequency, chromatograms, spectral contrast, tonal networks, gammatine frequency cepstral coefficient (Gammatone Frequency Cepstral Coefficients, GFCC) characteristics, and the like. One or more of these feature data may be provided together to a deep learning model for training and recognition.

In the case of performing MFCC feature extraction processing on data, the slicing processing may include slicing MFCC feature data into a plurality of slices in the direction of the time dimension of the MFCC feature data. For example, assuming that the MFCC feature obtained through the MFCC extraction process is a three-dimensional vector of 32×160×3 and 160 is the direction of the time dimension, the MFCC feature may be sliced into, for example, five vectors of 32×32×3 in the direction of the time dimension. Then, feature extraction is performed through the second binarized deep convolutional neural network 310, so as to obtain, for example, five feature vectors of 5×256 dimensions. The feature vectors are provided to a classifier 320 for classification processing.

Returning to fig. 1, in an embodiment of the present disclosure, the data may be pre-processed 105 before being provided to the first and second deep learning models.

In accordance with an embodiment of the present disclosure, preprocessing 105 may include wavelet analysis processing of the data to process baseline distortions and random noise. The frequency band of the underwater sound signal is 10Hz to 10kHz, and is highly matched with the low-frequency electronic noise. By simple filtering, it is difficult to eliminate the electronic interference without damaging the original underwater acoustic signal. Thus, to address baseline distortion and random noise, wavelet analysis processing is employed in embodiments of the present disclosure.

Fig. 11 is a flowchart of a wavelet analysis process according to an embodiment of the present disclosure. It is known to extract the high frequency and low frequency parts of a signal separately by wavelet transformation. In the underwater acoustic signal, the base line is different from the frequency band in which the useful signal is located, and in the wavelet domain of the signal to be processed, the base line is located in the low frequency band and the underwater acoustic signal is located in the wavelet coefficient of the higher frequency band. Therefore, based on the characteristics, the principle of wavelet transformation can be applied, the zero setting operation is performed on the low-frequency wavelet to remove the base line, and meanwhile, the threshold processing is performed on the high-frequency wavelet to remove the random noise. Thereafter, a signal reduced in baseline distortion and random noise is obtained by wavelet reconstruction of the low-frequency wavelet and the high-frequency wavelet.

Preprocessing 105 may also include other processing of the data, according to embodiments of the present disclosure. For example, those skilled in the art will recognize that various processes for preprocessing a signal may be included in embodiments of the present disclosure, such as FIR low-pass filtering processes, flipping processes, clipping processes, interpolation processes, signal enhancement processes, and the like.

According to an embodiment of the present disclosure, there is provided an underwater sound target recognition method. Fig. 12 is a flow chart of a method 1200 of underwater sound target identification according to an embodiment of the present disclosure.

As shown in fig. 12, in step 1210, underwater sound data to be recognized is recognized using an underwater sound target recognition model to determine a category of the underwater sound data to be recognized. The underwater sound target recognition model is generated according to the method of generating an underwater sound target recognition model (e.g., method 100) of the above embodiments of the present disclosure. This step is done on the current computing device (e.g., FPGA, AI computing chip, etc.) that performs the underwater sound target recognition. Current computing devices that perform underwater sound target recognition, such as FPGAs, AI computing chips, etc., can implement the underwater sound target recognition model of embodiments of the present disclosure with limited computational power, small memory space, and low energy consumption. The obtained underwater sound target recognition model may be generated on other computing devices (e.g., cloud computing devices) and then sent to the current computing device performing underwater sound target recognition. In embodiments of the present disclosure, steps 110, 120, and 130 in the method of generating an underwater sound target recognition model according to embodiments of the present disclosure, i.e., the step of training a first deep learning model, the step of model parameter migration, and the step of training a second deep learning model, may be performed on other computing devices. Other computing devices may have powerful computing power and large memory space and may be provided with sufficient energy and thus have the ability to perform complex training and parameter migration operations.

Furthermore, in the underwater sound target recognition method according to the embodiment of the present disclosure, other processing may be performed on the input data, including but not limited to: wavelet analysis processing, MFCC feature extraction processing, and slicing processing. The steps of the various other processes performed on the input data in the present method are similar to those correspondingly described above and are not repeated here.

Through testing, the accuracy of the second deep learning model trained by adopting the transfer learning in the present disclosure is about 98.97%, while the accuracy of the deep learning model directly trained without adopting the transfer learning is 93.85%. Therefore, the migration learning scheme can effectively improve accuracy. In addition, in the case of the same transfer learning, the accuracy of the deep learning model not employing binarization is 99.94%, which is only slightly higher than that of the second deep learning model employing transfer learning and binarization in the present disclosure. Therefore, the accuracy reduction by employing binarization in the embodiments of the present disclosure is very limited, but the computational burden and storage burden can be greatly reduced. In addition, for the scheme of the present disclosure, by reducing the input of the biglu network from 1024 to 256 (4-fold reduction of parameters), the accuracy is reduced from 98.97% above to only 97.44%. Therefore, in actual operation, the input parameters of the biglu network can be adjusted according to actual situation requirements, so as to determine the optimal parameters.

Methods of generating a hydroacoustic target recognition model and hydroacoustic target recognition methods according to embodiments of the present disclosure may be implemented in the form of computer readable instructions in an electronic device.

Fig. 13 shows a block diagram of a configuration of an electronic device 1300 according to an embodiment of the disclosure. The electronic device 1300 may be used to perform a method of generating a hydroacoustic target recognition model and a hydroacoustic target recognition method, such as methods 100 and 1200, according to embodiments of the present disclosure. The electronic device 1300 may be any type of general-purpose or special-purpose computing device, such as a desktop computer, a laptop computer, a server, a mainframe computer, a cloud-based computer, a tablet computer, a wearable device, the electronics of a watercraft, the electronics of a smart buoy, and the like. The electronic device 1300 may also be incorporated into a surface, an underwater detector, a surface, an underwater vehicle, and the like.

Those skilled in the art will appreciate that an electronic device having a powerful computing capability and a large memory space and that can be provided with sufficient energy may be used to perform the method 100 according to an embodiment of the present disclosure, i.e., the method of generating a hydroacoustic target recognition model. These electronic devices include desktop computers, laptop computers, servers, mainframe computers, cloud-based computers, tablet computers, and the like.

In addition, an electronic device with limited computing power, small memory space, and low energy consumption may be used to perform the method 1200 according to embodiments of the present disclosure, i.e., a method of identifying underwater sound data to be identified using the generated underwater sound target identification model. These include wearable devices, electronic devices of ships, smart buoys, and electronic devices incorporated into water surfaces, underwater detectors, and water surfaces, underwater vehicles, and the like.

In other words, the method 100 of the embodiments of the present disclosure is computationally intensive, typically accomplished using powerful computing devices (cpu\gpu and server of a computer, etc.). The method 1200 of the embodiments of the present disclosure may be implemented at the edge by ASIC, DSP, FPGA, chip, etc. with very limited performance and very low power consumption. Those skilled in the art will appreciate that the method 1200 of embodiments of the present disclosure may of course be implemented on more powerful computing devices, to perform testing, etc.

As shown in fig. 13, an electronic device 1300 may include an Input/Output (I/O) interface 1301, a network interface 1302, a memory 1304, and a processor 1303.

I/O interface 1301 is a collection of components that may receive input from a user and/or provide output to a user. The I/O interface 1301 may include, but is not limited to, buttons, a keyboard, a keypad, an LCD display, an LED display, or other similar display devices, including display devices having touch screen capabilities enabling interaction between a user and an electronic device.

The communication interface 1302 may include various adapters and circuitry implemented in software and/or hardware to enable communication with external devices using wired or wireless protocols. The wired protocol is, for example, any one or more of a serial port protocol, a parallel port protocol, an ethernet protocol, a USB protocol, or other wired communication protocol. The wireless protocol is, for example, any IEEE 802.11Wi-Fi protocol, cellular network communication protocol, or the like.

Memory 1304 includes a single memory or one or more memories or storage locations including, but not limited to, random Access Memory (RAM), dynamic Random Access Memory (DRAM), static Random Access Memory (SRAM), read Only Memory (ROM), EPROM, EEPROM, flash memory, logic blocks of an FPGA, a hard disk, or any other layer of a memory hierarchy. The memory 1304 may be used to store any type of instructions, software, or algorithms, including instructions 1305 for controlling the general functions and operations of the electronic device 1300.

The processor 1303 controls the general operation of the electronic device 1300. The processor 1303 may include, but is not limited to, a CPU, a hardware microprocessor, a hardware processor, a multi-core processor, a single-core processor, a microcontroller, an Application Specific Integrated Circuit (ASIC), a DSP, or other similar processing device capable of executing any type of instructions, algorithms, or software for controlling the operation and functions of the electronic device 1300 according to embodiments described in this disclosure. The processor 1303 may be various implementations of digital circuitry, analog circuitry, or mixed-signal (a combination of analog and digital) circuitry that performs functions in a computing system. The processor 1303 may include, for example, a portion or circuit such as an Integrated Circuit (IC), an individual processor core, an entire processor core, an individual processor, a programmable hardware device such as a Field Programmable Gate Array (FPGA), and/or a system including multiple processors.

Internal bus 1306 may be used to establish communications between components of electronic device 1300.

Although electronic device 1300 is described using particular components, in alternative embodiments, different components may be present in electronic device 1300. For example, electronic device 1300 may include one or more additional processors, memory, network interfaces, and/or I/O interfaces. In addition, one or more of the components may not be present in the electronic device 1300. Additionally, although separate components are shown in fig. 13, in some embodiments, some or all of a given component may be integrated into one or more of the other components in electronic device 1300.

The underwater sound target recognition method and the underwater sound target recognition model according to the embodiment of the present disclosure may be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general purpose processor, a digital signal processor (digital signal processing, DSP), an ASIC, an off-the-shelf programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps and logic blocks of the disclosure in the embodiments of the disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present disclosure may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

As described above, the underwater sound target recognition model and the underwater sound target recognition method generated according to the embodiments of the present disclosure can be applied to an artificial intelligence computing chip with low power consumption.

Embodiments of the present disclosure may be implemented as any combination of computer programs or program products on devices, systems, integrated circuits, and non-transitory computer readable media.

It should be understood that computer-executable instructions in a computer-readable storage medium or program product according to embodiments of the present disclosure may be configured to perform operations corresponding to the above-described apparatus and method embodiments. Embodiments of a computer readable storage medium or program product will be apparent to those skilled in the art when referring to the above-described apparatus and method embodiments, and thus the description will not be repeated. Computer readable storage media and program products for carrying or comprising the above-described computer-executable instructions are also within the scope of the present disclosure. Such a storage medium may include, but is not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.

In addition, it should be understood that the series of processes and devices described above may also be implemented in software and/or firmware. In the case of implementation by software and/or firmware, a corresponding program constituting the corresponding software is stored in a storage medium of the relevant device, and when the program is executed, various functions can be performed.

For example, a plurality of functions included in one unit in the above embodiments may be implemented by separate devices. Alternatively, the functions realized by the plurality of units in the above embodiments may be realized by separate devices, respectively. In addition, one of the above functions may be implemented by a plurality of units. Such configurations are included within the technical scope of the present disclosure.

In the present disclosure, the steps described in the flowcharts include not only processes performed in time series in the order described, but also processes performed in parallel or individually, not necessarily in time series. Further, even in the steps of time-series processing, the order may be appropriately changed.

The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The term "or" in this disclosure means an inclusive "or" rather than an exclusive "or". References to a "first" component do not necessarily require the provision of a "second" component. Furthermore, unless explicitly indicated otherwise, reference to "a first" or "a second" component does not mean that the referenced component is limited to a particular order. The term "based on" means "based at least in part on.

Claims

1. A method of generating an underwater sound target recognition model, comprising:

training a first deep learning model based on source domain training data, the source domain training data comprising sound data, and the first deep learning model comprising a first binarized deep convolutional neural network and a first classifier, the first classifier comprising a fully connected layer and a classification layer;

migrating model parameters of a first binarized deep convolutional neural network of the trained first deep learning model to a second binarized deep convolutional neural network of a second deep learning model, the first binarized deep convolutional neural network and the second binarized deep convolutional neural network having the same structure, the second deep learning model including a second classifier, and the second classifier including a fully connected layer, a recurrent neural network, and a classification layer; and

Training the second deep learning model based on target domain training data, wherein the target domain training data comprises underwater target sound data, and in the process of training the second deep learning model, model parameters of a second binarization deep convolutional neural network are fixed and model parameters of a second classifier are updated.

2. The method of claim 1, wherein the first and second binarized deep convolutional neural networks comprise one input convolutional layer and n+1 dense blocks and n transition blocks, respectively, alternately arranged.

3. The method of claim 2, wherein,

each dense block includes a plurality of interconnected dense layer units, each dense layer unit including a scaling factor, a binarized convolution layer, a bulk normalization layer, and a binarized activation layer, an

In each dense block, the input feature map of each dense layer unit is multiplied by a scaling factor to obtain a scaled input feature map, which is serially connected to the output feature map of the dense layer unit as the input feature map of the next dense layer unit.

4. The method of claim 2, wherein,

each dense block comprises an even number of interconnected dense layer units, each dense layer unit comprises a binarization convolution layer, a batch normalization layer and a binarization activation layer,

in each pair of odd-numbered and subsequently adjacent even-numbered dense layer cells, there is a scaling factor by which the input feature map of the odd-numbered dense layer cell of the pair is multiplied to obtain a scaled input feature map which is concatenated to the output feature map of the even-numbered dense layer cell of the pair as the input feature map of the odd-numbered dense layer cell of the next pair.

5. The method of claim 2, wherein each transition block comprises one transition layer unit, each transition layer unit comprising a binarized convolution layer, a batch normalization layer, and a binarized activation layer, the convolution step m of the binarized convolution layer being an integer greater than 1.

6. The method of claim 2, wherein the first and second binarized deep convolutional neural networks comprise five dense blocks and four transition blocks, respectively, that are alternately arranged.

7. The method of any of claims 3 to 6, wherein the migrated model parameters include at least one of: the weight matrix of the convolution layer, the bias matrix of the convolution layer and the parameters of the batch normalization layer.

8. The method of any one of claims 1 to 6, wherein the recurrent neural network of the second classifier is a biglu network.

9. The method of any of claims 1 to 6, further comprising: slicing the target domain training data for a predetermined length of time to obtain slices of the target domain training data before providing the target domain training data to the second deep learning model, and

training the second deep learning model based on the target domain training data comprises sequentially and respectively inputting the obtained slices of the target domain training data into a second binarization deep convolutional neural network of the second deep learning model for feature extraction, and providing the extracted features for a second classifier for classification processing.

10. The method of any of claims 1 to 6, further comprising:

before the data is supplied to the first deep learning model and the second deep learning model, the data is subjected to mel-frequency cepstral coefficient MFCC feature extraction processing to obtain MFCC feature data of the data.

11. The method of claim 10, further comprising:

the MFCC feature data is sliced in the direction of the time dimension of the MFCC feature data.

12. The method of any of claims 1 to 6, further comprising:

the data is subjected to a wavelet analysis process to process baseline distortion and random noise before being provided to the first and second deep learning models.

13. The method of any of claims 1 to 6, wherein the source domain training data comprises at least one of: the simulator simulates sound data and interference data, existing non-underwater target sound data and existing underwater target sound data.

14. The method of any of claims 1 to 6, wherein the target domain training data comprises at least one of: and respectively acquiring corresponding underwater target sound data aiming at various types of underwater sound targets.

15. A method of underwater acoustic target identification comprising:

identifying the underwater sound data to be identified using an underwater sound target identification model generated according to the method of any of claims 1 to 14 to determine the category of the underwater sound data to be identified.

16. An electronic device, comprising:

a processor; and

a memory communicatively coupled to the processor and storing computer readable instructions that, when executed by the processor, cause the electronic device to perform the method of any of claims 1-15.

17. A computer readable storage medium storing computer readable instructions which, when executed by a processor of an electronic device, cause the electronic device to perform the method of any one of claims 1-15.

18. A computer program product comprising computer readable instructions which, when executed by a processor of an electronic device, cause the electronic device to perform the method of any of claims 1-15.