CN119919932B

CN119919932B - Agricultural product classification method integrating dual-stream attention integration and cross-modal fusion

Info

Publication number: CN119919932B
Application number: CN202510415356.XA
Authority: CN
Inventors: 任晓鹏; 卞小艺; 王永梅; 方国武; 后睿晗; 侯智超; 吴红松
Original assignee: Anhui Agricultural University AHAU
Current assignee: Anhui Agricultural University AHAU
Priority date: 2025-04-03
Filing date: 2025-04-03
Publication date: 2025-08-22
Anticipated expiration: 2045-04-03
Also published as: CN119919932A

Abstract

The invention is suitable for the technical field of artificial intelligence, and particularly provides an agricultural product classification method integrating double-flow attention integration and cross-mode integration, which comprises the following steps of integrating a plurality of text feature vectors through self-adaptive layer weights, integrating a plurality of image feature vectors through a self-attention mechanism, and obtaining integrated text features and image features through weighted summation; the method comprises the steps of respectively carrying out high-power transformation on integrated text features and image features, splicing the transformed features with original features to obtain enhanced text features and enhanced image features, constructing a relation matrix between the text and the image features by calculating Pearson correlation coefficients, weighting and fusing the text and the image features according to the relation matrix to obtain final fused features, and inputting fused feature vectors into an MLP classifier. The method improves modeling and learning capabilities of the model on complex modes, and further improves accuracy and generalization capabilities of agricultural product classification by using cross-modal fusion.

Description

Agricultural product classification method integrating double-flow attention integration and cross-mode integration

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to an agricultural product classification method integrating double-flow attention integration and cross-mode integration.

Background

Currently, in agricultural product classification research, the existing method generally adopts a computer vision technology to extract the characteristics of agricultural products, and classification is realized through image information. However, the agricultural product related images have some remarkable characteristics, such as that the illumination and shooting environments have a large influence on the image characteristics, and the characteristics make it difficult for the image data to accurately reflect the agricultural product information, so that the classification difficulty is further increased.

Therefore, analysis of a single module by only an image lacks robustness.

The related technology of the agricultural product classification method based on the image has the defects that firstly, the agricultural product is classified only through a single visual mode, factors such as illumination, shadow, shooting angles and the like have great influence on visual characteristics, external factors can cause errors on classification results, and the requirements of modern agricultural product classification are difficult to meet, secondly, the characteristics extracted based on a single model usually have single characteristics, lack of diversity and cannot fully reflect all related information of the agricultural product, information loss is easy to occur, thirdly, the characteristics extracted in the prior art are original characteristics, higher-level semantic information of the agricultural product is difficult to capture, the interactive relation among different characteristics is not utilized, and the relevance of multi-mode information cannot be effectively modeled.

Disclosure of Invention

The embodiment of the invention aims to provide an agricultural product classification method integrating double-flow attention integration and cross-mode integration, and aims to solve the technical problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions.

An agricultural product classification method integrating double-flow attention integration and cross-mode integration comprises the following steps:

extracting features of the preprocessed text data by using a text feature model to obtain a plurality of text feature vectors;

extracting features of the preprocessed image data by using an image feature model to obtain a plurality of image feature vectors;

Integrating a plurality of text feature vectors through self-adaptive layer weights, integrating a plurality of image feature vectors through a self-attention mechanism, and obtaining integrated text features and image features through weighted summation;

Respectively carrying out high-power transformation on the integrated text features and the image features, and splicing the transformed features with the original features to obtain enhanced text features and enhanced image features;

Constructing a relation matrix between the text and the image features by calculating the Pearson correlation coefficient, and weighting and fusing the text and the image features according to the relation matrix to obtain final fused features;

Inputting the fused feature vector into an MLP classifier, and mapping the fused feature to a corresponding agricultural product classification category by the MLP classifier to obtain an agricultural product classification result.

Further, the step of extracting features from the preprocessed text data by using the text feature model includes:

Using TF-IDF model to extract text feature vector T _TF-IDF of statistic information layer;

Converting words into high-dimensional vectors by using a Word2Vec model, wherein a continuous Word bag model CBOW is adopted to predict central words through the context of an agricultural product data set, each Word is mapped into a vector with fixed dimension after CBOW model training, and a text feature vector T _Word2Vec is captured;

The BERT model is used for extracting the context information feature vector T _BERT of the depth semantics.

Further, the step of extracting features from the preprocessed image data by using the image feature model includes:

The method comprises the steps of constructing images with different scales through a Gaussian pyramid by adopting a SIFT model, wherein the SIFT calculates local extremum of the images through a differential Gaussian pyramid DOG, the SIFT distributes the directions of key points, calculates gradient directions around each key point, calculates gradient direction histograms of 16 sub-areas of the key points by using gradient amplitude values and gradient directions, and finally forms 128-dimensional feature vectors V _SIFT;

Capturing edge and shape characteristics by adopting an HOG model, wherein the HOG calculates gradients of the image in the X direction and the Y direction, calculates gradient amplitude and gradient direction, divides the image into 8 multiplied by 8 areas, and counts gradient histograms of 9 directions in each area to form a 3780-dimensional HOG feature vector V _HOG;

And extracting local texture features by using an LBP model to form a low-dimensional feature vector, wherein LBP calculates an LBP value by comparing the relation between pixel points and neighbor pixels thereof, and calculates an LBP histogram to form a P+2-dimensional feature vector V _LBP.

Further, the step of integrating the plurality of text feature vectors by the adaptive layer weights includes:

the text feature vector T _TF-IDF、T_Word2Vec、T_BERT is mapped to the same dimension d, denoted as:

;

Wherein W _i represents a learnable parameter for mapping different features to the same dimension, T _i represents a text feature vector, T' _i represents a dimension mapped text feature vector;

The weight is calculated, and the weight w is expressed as:

;

Wherein W is a three-dimensional vector, each feature has a normalized weight, W _attn represents a learnable attention weight matrix;

After the normalized weight of each feature is obtained, the fused feature vector T _text is calculated by a weighted fusion method and is expressed as:

;

Where w ₁、w₂ and w ₃ each represent normalized weights, and T' _TF-IDF、T'_Word2Vec、T'_BERT each represent dimension-mapped feature vectors.

Further, the step of integrating the plurality of image feature vectors by a self-attention mechanism includes:

The Query, key and Value in the self-attention mechanism are respectively from V _SIFT、V_HOG、V_LBP, the obtained Query vectors are respectively expressed as Q _SIFT、Q_HOG、Q_LBP, the obtained Key vectors are expressed as K _SIFT、K_HOG、K_LBP, and the obtained Value vectors are expressed as V' _SIFT、V'_HOG、V'_LBP;

and obtaining similarity scores between each pair of features by calculating the dot product of the Query and the Key, wherein the similarity scores are expressed as follows:

;

wherein i and j represent indexes of the features, d _k is the size of the feature dimension, Q _i represents a Query vector of the i feature after extraction, K _j represents a Key vector of the j feature, and T represents transposition operation;

performing softmax operation on each score to obtain a normalized weight a _ij which is expressed as a _ij=softmax（Attention Score_ij, wherein a _ij weight represents the attention degree of the ith feature to the jth feature;

The features are weighted summed by the calculated attention weights, expressed as:

;

wherein, alpha _SIFT、α_HOG、α_LBP represents the contribution degree of three image features of SIFT, HOG and LBP in the final fusion feature, and the Value vectors of the three features are weighted and summed to obtain a weighted fusion image feature vector V _image.

Further, the step of performing high power transformation on the integrated text feature and the image feature and splicing the transformed feature with the original feature includes performing element-by-element power operation on the text feature vector T _text and the image feature vector V _image, respectively, which are expressed as:

And ;

Wherein, as follows, the addition of the element by element power operation, the addition of k power to each element of T _text and V _image, the feature after high power transformation;

Text feature vector T _text=[t₁,t₂,t₃],t₁ is Word weight extracted by TF-IDF, T ₂ is Word vector component calculated by Word2Vec, T ₃ is context semantic feature extracted by BERT, k=2, and it is available: ;

Thus, element-by-element power operation is performed, and high power characteristic expansion is obtained:

;

wherein T ^(k) represents the kth text feature vector, V ^(k) represents the kth image feature vector;

Splicing the high-order features obtained after transformation with the original features, wherein the splicing process is expressed as follows:

;

where T _text and V _image are original features and T '_power and V' _power are power transform features;

The enhancement features T '_text and V' _image are obtained after stitching.

Further, the step of constructing a relation matrix between the text and the image features by calculating pearson correlation coefficients, and weighting and fusing the text and the image features according to the relation matrix to obtain final fused features comprises the following steps:

The cross-modal relation matrix R is constructed and expressed as follows by calculating pearson correlation coefficient ρ (e _i,e_j):

;

Wherein cov (e _i,e_j) is the covariance of e _i and e _j, σ _ei and σ _ej are the standard deviations of e _i and e _j, and μ _ei and μ _ej are the mean of e _i and e _j, respectively;

T _text and V _image are substituted into e _i and e _j respectively, using Constructing a relation fusion matrix;

obtaining a text feature and image feature relation fusion matrix by using pearson correlation coefficient The method comprises the steps of substituting V _image and T _text into e _i and e _j, and repeating the steps to obtain a matrix R _IT of the image characteristics and the text characteristics, wherein the value of m _L is the number of the text and the image characteristics, the dimension of the matrix R is m _L×m_L,R_TI, and the correlation between the text characteristics and the image characteristics is reflected;

After the relationship matrices R _TI and R _IT are obtained, the image features and the text features are weighted respectively, wherein the image features are weighted through a text-to-image weighted relationship matrix R _TI, expressed as: I _fused represents the image features generated by weighting the text-to-image relationship matrix, and the text features are weighted by the image-to-text weighted relationship matrix R _IT, expressed as: t _fused represents the text feature generated by weighting the image-to-text weighting relation matrix;

t '_text and V' _image are enhanced text features and image features, respectively;

finally, the text features and the image features are fused together through a weighted average method and expressed as F _fused=αT_fused+βI_fused, wherein F _fused is a final fusion feature vector, alpha and beta are super-parameters for controlling the contribution degree of the text features and the image features in the final fusion features, and the values of the super-parameters can be dynamically adjusted according to the importance of each mode.

Further, in the step of inputting the fused feature vector into an MLP classifier, the MLP classifier maps the fused feature to a corresponding agricultural product classification category to obtain an agricultural product classification result, the MLP classifier is based on a neural network design and comprises three layers, namely an input layer, a hidden layer and an output layer, wherein different layers of the neural network of the MLP are fully connected, and the MLP is characterized in that:

The input layer receives the fused feature vector F _fused and transmits the feature vector F _fused to the first full-connection layer;

The first fully connected layer maps the input feature vector F _fused to the hidden layer space, denoted as h ₁=W₁F_fused+b₁, where W ₁ is the weight matrix of the first layer, Is a bias term, h ₁ represents the feature of the feature vector F _fused after linear transformation;

the non-linearities are then introduced using the activation function ReLU: setting the negative ReLU value to 0, leaving a positive value;

and transmitting the output of the first layer to a second layer of full-connection layer for processing, wherein the neuron number of the hidden layer is h ₂, and the calculation formula is expressed as follows:

;

Wherein W ₂ is the weight matrix of the second layer, Non-linear transformation of the output of the second layer by means of a ReLU activation function:;

The final output layer outputs the result of the multiple transformations, and the original score of the output layer can be obtained: wherein W ₃ represents a weight matrix of the output layer, b ₃ represents an output layer bias term, and z represents an output layer raw score;

the probability of converting the score z of the output layer into a category by a Softmax activation function is expressed as

;

Wherein, the Representing the score after exponential transformation, wherein e is the bottom of natural logarithm, and Z _i is calculated by the output layer of the neural network to obtain the original score of the output layer; representing a normalization factor for summing the scores after all class index transforms are calculated, such that the probability sum for all classes is equal to 1, wherein, C represents the number of classification levels of the agricultural product,Representing the prediction probability of the ith class, wherein the sum of the probabilities of all the classes is 1;

after obtaining the probability of each classification, the cross entropy loss function is calculated between the model predicted class probability and the one-hot coding form of the real class, and the difference between the model output probability and the real label is measured, wherein the cross entropy loss function is as follows:

;

Wherein y _i is the one-hot encoding of the authentic tag, Is the model's predictive probability for class i;

In the training of the MLP model, a cross entropy loss function and a back propagation algorithm are used for obtaining a weight gradient corresponding to each loss, then the weight is dynamically updated, in each training period, a prediction result is calculated through forward propagation, model errors are estimated based on the loss function, then the gradient is calculated through the back propagation algorithm, model parameters are updated through an optimization algorithm, and the weight is continuously optimized until the loss function converges.

Compared with the prior art, the agricultural product classification method integrating double-flow attention integration and cross-mode integration has the beneficial effects that:

Firstly, the invention adopts double-flow attention integration to respectively extract multi-level characteristics of a text mode and an image mode, uses TF-IDF, word2Vec and BERT models to extract three different levels of characteristics in text data processing, uses SIFT, HOG and LBP models to extract three different levels of characteristics in image data processing, and then integrates the text and the image characteristics through a self-attention mechanism and a self-adaptive layer weight to obtain fusion characteristic vectors of the two modes;

The method comprises the steps of obtaining a text feature vector, obtaining a multi-dimensional data feature, obtaining a multi-modal data feature, combining the multi-modal data feature, constructing a relation fusion matrix, enabling the multi-modal data feature to adapt to complex environments with unbalanced categories and high data noise, improving generalization capability of the model, and in the processing process of a large-scale complex data set, effectively calculating importance of the feature, weakening interference of redundant information, improving discrimination capability of feature characterization, and optimizing accuracy and stability of classification decision.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention.

FIG. 1 is a flow diagram of a method of agricultural product classification incorporating dual-flow attention integration and cross-modality fusion of the present invention;

FIG. 2 is a logical framework diagram of the agricultural product classification method of the present invention incorporating dual-flow attention integration and cross-modality fusion;

FIG. 3 is a sub-flowchart of the agricultural product classification method of the present invention incorporating dual-flow attention integration and cross-modality fusion;

FIG. 4 is another sub-flowchart of the agricultural product classification method of the present invention incorporating dual-flow attention integration with cross-modality fusion.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Specific implementations of the invention are described in detail below in connection with specific embodiments.

In one embodiment of the invention, a method for classifying agricultural products is provided that combines dual-flow attention integration with cross-modal fusion, which first requires pre-processing text data and image data of the agricultural products;

in the process of data preprocessing, stop words, processing missing values and unifying case and case formats are removed from the agricultural product text data, and then word segmentation processing is carried out on the text;

In one implementation of text data processing, agricultural product text data D _t is pre-processed, typically to remove stop words, process missing values, format conversions, and the like. Removing stop words in the text can use sklearn tools, sklearn tools provide CountVectorizer and TfidfVectorizer, and stop words can be directly removed when preprocessing is performed, so that noise in the text is reduced; in order to ensure the consistency of data formats, the numerical value fields can be converted into floating point numbers or integers, the date fields are converted into time formats, and blank spaces are removed and the data formats are unified;

In one implementation of image data processing, the agricultural product image data D _v is analyzed for resolution, size, etc., which affects feature extraction, typically by unifying image size, normalization, data enhancement, etc. The unified image size is 224×224 or 512×512, which ensures consistent input, the image pixel value range is [0,255], the data range is larger, the gradient calculation is affected, the pixel value is required to be scaled to [0,1] or [ -1,1], the numerical value difference is reduced, and the image is randomly turned, rotated and transformed to enhance the model robustness.

Referring to fig. 1 and 2, the agricultural product classification method integrating dual-flow attention integration and cross-modal integration provided by the embodiment of the invention comprises the following steps:

s1, extracting features of the preprocessed text data by using a text feature model to obtain a plurality of text feature vectors;

s2, carrying out feature extraction on the preprocessed image data by utilizing an image feature model to obtain a plurality of image feature vectors;

In step S1, a text feature model for feature extraction of the text data D _t, including a TF-IDF model, a Word2Vec model, and a BERT model, may obtain a plurality of feature vectors T _TF-IDF、T_Word2Vec、T_BERT based on the models;

Specifically, referring to fig. 3, the step of extracting features from the preprocessed text data by using the text feature model includes the following steps:

S11, extracting a text feature vector T _TF-IDF of a statistical information layer by using a TF-IDF model;

the TF-IDF model provided by the embodiments of the present disclosure is used to extract text features of a statistical information layer, and a specific calculation formula is expressed as follows: wherein, the calculation formula of tf is expressed as tf (t, d) =1/n, and the calculation formula of idf is expressed as: Where tf represents word Frequency (Term Frequency), idf represents inverse document Frequency (Inverse Document Frequency), i represents the number of times word t occurs in document d, N represents the total number of words in document d, N represents the total number of documents, df (t) represents the number of documents for word t, and finally the available feature vectors are expressed as:

。

S12, converting words into high-dimensional vectors by using a Word2Vec model, wherein a continuous Word bag model CBOW is adopted to predict central words through the context of an agricultural product data set, each Word is mapped into a vector with fixed dimension after being trained by a CBOW model, and a text feature vector T _Word2Vec is captured;

Specifically, word2Vec model is a distributed representation method based on neural network, which can convert words into high-dimensional vectors so that semantically similar words are closer in vector space, CBOW (continuous Word bag model) is adopted to predict central words through the context of agricultural product data sets, CBOW model aims at calculation The formula is:;

wherein, the Is the output word vector of the target word W _t,Is the average of the context word vectors, v is the vocabulary size;

after CBOW model training, each word is mapped into a vector with fixed dimension, so that a feature vector T _Word2Vec is captured;

S13, extracting a context information feature vector T _BERT of the depth semantics by using the BERT model;

The invention can obtain the high-dimension vector representation of each word in the data set by using the BERT model, and embody the text characteristics of the depth context semantics;

for a text dataset, BERT first splits the input text into word segments (tokens) using WordPiece Tokenizer and adds special tags [ CLS ] and [ SEP ] head to tail, i.e ;

Subsequently, the word segment is input into a Embedding layer (word embedding layer), and a final input vector is obtained through embedding;

Then, extracting context features by using a Multi-layer transducer encoder, and mainly using Self-Attention (Self-Attention), multi-Head Attention (Multi-Head Attention) and a feedforward neural network (FFN) to obtain a context-aware representation of the text;

Finally, after passing through a transducer layer, BERT generates a vector representation T _BERT of each word;

In step S2, feature vector V _SIFT、V_HOG、V_LBP is extracted from image data D _v using models such as scale-invariant feature transform (SIFT), direction gradient Histogram (HOG), and Local Binary Pattern (LBP);

specifically, referring to fig. 4, the step of extracting features from the preprocessed image data by using the image feature model includes the following steps:

S21, constructing images with different scales by adopting a SIFT model through a Gaussian pyramid, wherein the SIFT calculates local extremum of the images by adopting a differential Gaussian pyramid DOG, the SIFT distributes the directions of key points, calculates gradient directions around each key point, calculates gradient direction histograms of 16 sub-areas of the key point by using gradient amplitude values and gradient directions, and finally forms 128-dimensional feature vectors V _SIFT;

Illustratively, in step S21 of the present invention, SIFT constructs images of different scales by gaussian pyramid:

;

Wherein I (x, y) is an input image in the image data D _v, G (x, y, sigma) is a Gaussian function, L (x, y) is a smooth image of the input image at different scales, L (x, y, sigma) is a smooth version of the image at the scale sigma, and L (x, y, sigma) is a result of convolution of the input image I (x, y) with the Gaussian function G (x, y, sigma);

subsequently, SIFT calculates the local extremum of the image by differential gaussian pyramid (DOG), D (x, y, σ) =l (x, y, kσ) -L (x, y, σ), where k is a scale-change factor;

subsequently, SIFT assigns a keypoint direction, and calculates a gradient direction around each keypoint as shown in the following formula:

;

Wherein m (x, y) is the gradient magnitude, θ (x, y) gradient direction;

finally, a gradient direction histogram of 16 sub-regions of the keypoint is calculated using the gradient magnitude m (x, y) and the gradient direction θ (x, y):

;

Wherein i represents a direction index of the histogram, h _i represents a cumulative gradient amplitude of the ith direction in the direction histogram, θ _i represents the ith angle in the direction histogram, and δ represents an indication function which is a Kronecker Delta function and is used for judging the gradient direction of a certain pixel point;

Each sub-region has 8 direction components, and finally 128-dimensional characteristic vectors V _SIFT=[h₁,h₂,...,h₁₂₈ are formed;

S22, capturing edge and shape characteristics by adopting an HOG model, wherein the HOG calculates gradients of the image in the X direction and the Y direction, calculates gradient amplitude and gradient direction, divides the image into areas, counts gradient histograms of 9 directions in each area, and forms a 3780-dimensional HOG feature vector V _HOG;

in one implementation, HOG is used primarily to capture edges and shape features for agricultural product target detection and classification;

HOG calculates the gradient of the image in the X-direction and Y-direction as shown in the following equation:

;

Wherein, G _x is the change rate of the gradient calculated in the X direction, namely the horizontal direction, reflecting the edge information of the image in the left and right directions, G _y is the change rate calculated in the Y direction, namely the vertical direction, reflecting the edge information of the image in the up and down directions;

similarly, the gradient magnitude m (x, y) and gradient direction θ (x, y) are calculated, the image is divided into 8×8 regions, and a gradient histogram of 9 directions is counted for each region, which is expressed as follows:

;

Finally, a 3780-dimensional HOG eigenvector V _HOG=[h₁,h₂,...,h₃₇₈₀ is formed;

And S23, extracting local texture features by using an LBP model to form a low-dimensional feature vector, wherein LBP calculates an LBP value by comparing the relation between the pixel points and the neighborhood pixels, and calculates an LBP histogram to form a P+2-dimensional feature vector V _LBP.

Specifically, in one implementation of the present invention, the LBP model is used to extract local texture features to form low-dimensional feature vectors;

LBP calculates LBP value by comparing the relation between pixel point and its neighborhood pixel:

;

wherein, the LBP (x _c,y_c) represents the LBP value calculated by taking (x _c,y_c) as the central pixel, I _p represents the gray value of the neighborhood pixel, and I _c represents the gray value of the central pixel;

the LBP histogram is calculated using the following formula:

;

the p+2-dimensional feature vector (typically 10 in the low dimension) is ultimately formed V _LBP=[h₁,h₂,...,h₁₀.

With continued reference to fig. 1 and fig. 2, the agricultural product classification method integrating dual-flow attention integration and cross-mode integration provided by the invention further comprises the following steps:

S3, integrating a plurality of text feature vectors through the self-adaptive layer weight, integrating a plurality of image feature vectors through a self-attention mechanism, and obtaining integrated text features and image features through weighted summation;

s4, respectively performing high-power transformation on the integrated text features and the integrated image features, and splicing the transformed features with original features to obtain enhanced text features and enhanced image features;

s5, constructing a relation matrix between the text and the image features by calculating the Pearson correlation coefficient, and weighting and fusing the text and the image features according to the relation matrix to obtain final fused features;

s6, inputting the fused feature vector into an MLP classifier, and mapping the fused feature to a corresponding agricultural product classification category by the MLP classifier to obtain an agricultural product classification result.

In the step S3, the specific process is divided into two steps of text feature integration and image feature integration;

Firstly, for complex multi-Layer feature fusion, the invention uses the self-adaptive Layer weight (Layer Attention) to perform feature integration, dynamically distributes weights according to the importance of different layers, and finally performs weighted fusion to obtain a more representative feature T _text;

Then, splicing the plurality of image feature vectors extracted in the step S2, capturing the relevance among different features through a Self-Attention mechanism (Self-Attention), and obtaining the weight value of each feature;

Finally, a fusion feature vector V _image is obtained by using a weighted summation method;

specifically, the step of integrating a plurality of text feature vectors by using the adaptive layer weights includes:

;

The weight is calculated, and the weight w is expressed as:

;

wherein, w ₁、w₂ and w ₃ both represent normalized weights, and T' _TF-IDF、T'_Word2Vec、T'_BERT both represent feature vectors after dimension mapping;

The fusion characteristic T _text of the embodiment of the invention keeps the advantages of three models of TF-IDF, word2Vec and BERT:

The word frequency-inverse document frequency (TF-IDF) reflects sparse information characteristics of the text, and keyword information can be captured;

Providing a dense distributed Word vector representation based on a feature extraction model (Word 2 Vec) of the Word vector;

the BERT model has context understanding capability;

Therefore, the T _text combines the stability of the traditional method, has the deep semantic modeling capability, and can achieve higher classification precision and generalization capability in the agricultural product tired task.

;

Performing softmax operation on each score to obtain a normalized weight a _ij which is expressed as a _ij=softmax（Attention Score_ij, wherein the weight a _ij represents the attention degree of the ith feature to the jth feature and reflects the importance of different features;

;

In step S4 of the present invention, in order to enhance the feature expression capability, the fused features T _text and V _image are subjected to high power transformation, and then spliced with the original features to obtain enhanced feature vectors、。

Specifically, in step S4 of the present invention, the steps of performing high-power transformation on the integrated text features and image features, and splicing the transformed features with the original features include:

The text feature vector T _text and the image feature vector V _image are subjected to element-by-element power operation, expressed as follows:

And ;

;

And (5) splicing to obtain enhanced features T '_text and V' _image for subsequent multi-modal tasks.

In step S5 of the invention, the correlation between the two modes is deeply analyzed by constructing a cross-mode relation matrix of the text features and the image features, and the weight distribution of the features of each mode is quantized. Based on the weight information, carrying out weighted fusion on the enhanced text features and the enhanced image features to finally obtain fusion feature vectors so as to improve the expression capacity and decision performance of the multi-mode information;

Specifically, in step S5 of the present invention, a relationship matrix between text and image features is constructed by calculating pearson correlation coefficients, and the text and image features are weighted and fused according to the relationship matrix, so as to obtain a final fusion feature, which includes:

;

Wherein cov (e _i,e_j) is the covariance of e _i and e _j, σ _ei and σ _ej are the standard deviations of e _i and e _j, μ _ei and μ _ej are the mean values of e _i and e _j, respectively, to AndRespectively substituting into e _i and e _j, usingConstructing a relation fusion matrix;

further, in the embodiment of the invention, the pearson correlation coefficient is used to obtain the text feature and image feature relation fusion matrix The dimension of the matrix is m _L×m_L,R_TI, which reflects the interrelation between the text feature and the image feature;

Further, in the embodiment of the invention, V _image and T _text are substituted into e _i and e _j, and the above steps are repeated, so that an image feature and text feature relation matrix R _IT can be obtained;

In step S6 of the invention, classification of the agricultural product data set is achieved using a multi-layer perceptron (MLP);

In the invention, agricultural product classification systems are divided into four types, namely livestock, poultry, aquatic products, fruits, vegetables, grain oils, special economic crops and other types;

wherein the livestock and aquatic products mainly comprise livestock and poultry meat, poultry eggs, milk products and aquatic products;

the fruit, vegetable, grain and oil are in three subclasses of fruits, vegetables, grains and oil;

the special economic crops are classified into tea leaves, chinese herbal medicines, flower plants and nuts;

finally, other classes, including some processed agricultural products and specialty varieties. The classification mode can ensure the comprehensiveness of classification and improve the management and recognition efficiency.

Specifically, in step S6 of the present invention, in the step of inputting the fused feature vector into the MLP classifier, the MLP classifier maps the fused feature to a corresponding agricultural product classification category to obtain an agricultural product classification result, the MLP classifier is based on a neural network design, and includes three layers, namely an input layer, a hidden layer and an output layer, wherein different layers of the neural network of the MLP are fully connected, and the three layers are fully connected with each other, wherein:

;

Wherein W ₂ is the weight matrix of the second layer, Is an offset term, and similarly, the output of the second layer is subjected to nonlinear transformation through a ReLU activation function:;

The final output layer outputs the result of the multiple transformations, and the original score of the output layer can be obtained: ;

Assume that the agricultural product has C classification levels, wherein W ₃ represents a weight matrix of an output layer, b ₃ represents an output layer bias term, z represents an original score of the output layer, and the probability that the score z of the output layer is converted into the class through a Softmax activation function is expressed as:

;

After the probability of each classification is obtained, the cross entropy loss function calculation is carried out on the classification probability predicted by the model and the one-hot coding form of the real classification, and the difference between the model output probability and the real label is measured;

wherein the cross entropy loss function is:

;

Then further training an MLP model, obtaining a weight gradient corresponding to each loss by using a cross entropy loss function and a back propagation algorithm, and then dynamically updating the weight; in the model training process, carrying out multiple iterations on the data set, calculating a prediction result through forward propagation in each training period, and evaluating model errors based on a loss function;

After training, the generalization capability of the model is further evaluated by using a test set, and specifically, fruit, vegetable and grain oils are selected from the agricultural product classification system for accuracy verification. Firstly, aiming at fruit, vegetable, grain and oil agricultural products, comprehensive text data and image data in the market are widely collected so as to ensure the representativeness and diversity of the data. And secondly, randomly dividing 20% of agricultural product data from the fused feature vectors to serve as an independent test set, and ensuring that the sources of the training set and the test set are consistent. Because of larger agricultural product data sets, the invention uses a direct test method to input test set data into an MLP model, and calculates parameter values such as Accuracy (AC), precision (PE), recall (RE), F1-score and the like. The larger the values of these evaluation indexes, the better the classification performance. The specific definition is as follows:

;

Wherein TP (True Positive) is the number of samples correctly predicted to be of the category, FP (False Positive) is the number of samples not actually belonging to the category but predicted to be of the category, FN (False Negative) is the number of samples actually belonging to the category but not incorrectly predicted to be of other categories, TN (True Negative) is the number of samples not belonging to the category nor predicted to be of the category.

And (3) carrying out multiple rounds of testing on the fruit, vegetable, grain and oil agricultural products, and comparing and analyzing the obtained testing parameters with an original method. Experimental results show that the method has remarkable improvement on key evaluation indexes such as accuracy, precision, recall rate and F1-score. Then, the same test is carried out on the other three agricultural products respectively, and the test results are summarized. From the summary result, the accuracy and the robustness of agricultural product classification can be obviously improved by adopting a double-flow attention mechanism and fusing multi-mode characteristics, and particularly, the method has better performance and stronger generalization capability when processing complex large-scale data sets.

In summary, the invention adopts double-flow attention integration to respectively extract multi-level characteristics of the text mode and the image mode;

In the text data processing, three different layers of features are extracted by using TF-IDF, word2Vec and BERT models;

on image data processing, three different layers of features are extracted by using SIFT, HOG and LBP models;

then integrating the text and the image features through a self-attention mechanism and self-adaptive layer weights to obtain fusion feature vectors of two modes;

the classification method adopts multi-level characteristics to combine shallow layer and deep layer characteristics, thereby greatly improving classification decision capability;

the classification method can still keep higher performance under the condition of partial information missing or lower data quality, acquire the associated characteristics of the context and improve the stability of the system;

In addition, the invention uses a cross-modal relation matrix to fuse text feature vectors and image feature vectors, obtains enhanced text and image features by mapping the feature vectors into a high-dimensional space, improves the understanding capability of a model on multi-dimensional information and captures the association relation between different modes;

the method and the device can effectively calculate the importance of the features and weaken the interference of redundant information in the processing process of the large-scale complex data set, thereby improving the discrimination capability of the feature characterization and optimizing the accuracy and stability of the classification decision.

The above embodiments are merely illustrative of a preferred embodiment, but are not limited thereto. In practicing the present invention, appropriate substitutions and/or modifications may be made according to the needs of the user.

The number of equipment and the scale of processing described herein are intended to simplify the description of the present invention. Applications, modifications and variations of the present invention will be readily apparent to those skilled in the art.

Although embodiments of the invention have been disclosed above, they are not limited to the use listed in the specification and embodiments. It can be applied to various fields suitable for the present invention. Additional modifications will readily occur to those skilled in the art. Therefore, the invention is not to be limited to the specific details and illustrations shown and described herein, without departing from the general concepts defined in the claims and their equivalents.

Claims

1. The agricultural product classification method integrating double-flow attention integration and cross-mode integration is characterized by comprising the following steps of:

The method comprises the steps of mapping text feature vectors to the same dimension, calculating weights, calculating the normalized weight of each feature, calculating the feature vectors after fusion by a weighted fusion method, wherein Query, key and Value in a self-attention mechanism are respectively from V _SIFT、V_HOG、V_LBP, the obtained Query vectors are respectively represented as Q _SIFT、Q_HOG、Q_LBP, the obtained Key vectors are represented as K _SIFT、K_HOG、K_LBP, the obtained Value vectors are represented as V' _SIFT、V'_HOG、V'_LBP, calculating the dot product of the Query and the Key, obtaining similarity scores between each pair of features, and performing softmax operation on each score to obtain the normalized weights;

2. The method for classifying agricultural products by combining dual-stream attention integration and cross-modal fusion according to claim 1, wherein the step of extracting features of the preprocessed text data by using a text feature model comprises:

text feature vector extraction for statistical information layer using TF-IDF model ;

Converting words into high-dimensional vectors by using Word2Vec model, wherein continuous Word bag model CBOW is adopted to predict central words by the context of agricultural product data set, each Word is mapped into a vector with fixed dimension after CBOW model training, and text feature vector is captured;

Extracting context information feature vectors of depth semantics using BERT model。

3. The method for classifying agricultural products by combining dual-flow attention integration and cross-modal fusion according to claim 2, wherein the step of extracting features of the preprocessed image data by using an image feature model comprises:

4. The method for agricultural product classification fusing dual-flow attention integration and cross-modality fusion of claim 3, wherein in the step of mapping text feature vectors to the same dimension, the text feature vectors are mapped to the same dimension、、Mapped to the same dimension d, expressed as:

;

In the formula, Representing a learnable parameter for mapping different features to the same dimension; Representing a text feature vector; representing the text feature vector after dimension mapping;

In the step of calculating the weight, the weight w is expressed as:

;

wherein w is a three-dimensional vector, each feature having a normalized weight; Representing a learnable attention weight matrix;

in the step of calculating the fused feature vector by the weighted fusion method, the fused feature vector T _text is expressed as:

;

5. The method for classifying agricultural products by fusion of dual-flow attention integration and cross-modal fusion as recited in claim 4, wherein in the step of obtaining a similarity score between each pair of features by calculating a dot product of Query and Key, the similarity score is expressed as:

;

In the step of deriving the normalized weights, normalized weights a _ij are denoted as a _ij=softmax（Attention Score_ij), where the a _ij weight represents the attention level of the ith feature to the jth feature;

the weighted summation of the features by the calculated attention weights is expressed as:

;

wherein, the 、、Representing the contribution degree of three image features of SIFT, HOG and LBP in the final fusion feature, and weighting and summing the Value vectors of the three features to obtain a weighted fusion image feature vector。

6. The method for classifying agricultural products by fusing dual-flow attention integration and cross-modal fusion according to claim 5, wherein the step of performing high-power transformation on the integrated text features and image features and splicing the transformed features with the original features comprises the steps of:

And ;

Text feature vector: , is the word weight extracted by TF-IDF; Word vector components calculated by Word2 Vec; is the contextual semantic feature extracted by BERT, k=2, available: ;

;

The enhancement features T '_text and V' _image are obtained after stitching.

7. The agricultural product classification method for fusing dual-flow attention integration and cross-modal fusion according to claim 6, wherein the step of constructing a relation matrix between text and image features by calculating pearson correlation coefficients, and fusing the text and image features by weighting according to the relation matrix to obtain final fusion features comprises the steps of:

Constructing a cross-modal relation matrix R by calculating a pearson correlation coefficient The expression is as follows:

;

wherein, the Is thatAndIs used to determine the covariance of (1),AndIs thatAndIs set in the standard deviation of (2),AndRespectively areAndIs the average value of (2);

Substituting T _text and V _image into AndIn usingConstructing a relation fusion matrix;

obtaining a text feature and image feature relation fusion matrix by using pearson correlation coefficient The method comprises the steps of substituting V _image and T _text into e _i and e _j, and repeating the steps to obtain a matrix R _IT of the image characteristics and the text characteristics;

After the relationship matrices R _TI and R _IT are obtained, the image features and the text features are weighted respectively, wherein the image features are weighted through a text-to-image weighted relationship matrix R _TI, expressed as: , Representing image features generated by weighting a text-to-image relationship matrix by an image-to-text weighted relationship matrix To weight text features, expressed as:, representing text features generated by weighting the image-to-text weighting relation matrix;

And Respectively enhanced text features and image features;

finally, text features and image features are fused together by a weighted average method, and are expressed as: Wherein, the method comprises the steps of, For the final fused feature vector, α and β are hyper-parameters that control how much text features and image features contribute in the final fused feature.

8. The method for classifying agricultural products by fusing dual-flow attention integration and cross-modal fusion according to claim 7, wherein in the step of inputting the fused feature vector into an MLP classifier, the MLP classifier maps the fused feature to a corresponding agricultural product classification category to obtain an agricultural product classification result, the MLP classifier is based on a neural network design and comprises three layers, namely an input layer, a hidden layer and an output layer, wherein different layers of the neural network of the MLP are fully connected, and the method comprises the steps of:

and transmitting the output of the first layer to a second layer of full-connection layer for processing, wherein the neuron number of the hidden layer is h ₂, and the calculation formula is expressed as follows: ;

The final output layer outputs the result of the multiple transformations, and the original score of the output layer can be obtained: Wherein W ₃ represents a weight matrix of the output layer, b ₃ represents an output layer bias term, and z represents an output layer original score;

The probability of converting the score z of the output layer into a category by the Softmax activation function is expressed as:

;

wherein, the Represents the score after exponential transformation, e is the base of the natural logarithm,Calculating an output layer original score by an output layer of the neural network; representing a normalization factor for summing the scores after all class index transforms are calculated, such that the probability sum for all classes is equal to 1, wherein, C represents the number of classification levels of the agricultural product,Representing the prediction probability of the ith class, wherein the sum of the probabilities of all the classes is 1;

after obtaining the probability of each classification, the cross entropy loss function is calculated between the model predicted class probability and the one-hot coding form of the real class, and the difference between the model output probability and the real label is measured, wherein the cross entropy loss function is as follows: ;

wherein, the Is the one-hot encoding of the authentic tag,Is the model's predictive probability for class i;