CN117611922A

CN117611922A - Class incremental learning image classification method, device and medium based on hint fine-tuning

Info

Publication number: CN117611922A
Application number: CN202311801017.2A
Authority: CN
Inventors: 李伟伟; 陈红阳
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-02-27

Abstract

The invention relates to a class increment learning image classification method, equipment and medium based on prompt fine adjustment, wherein the method comprises the following steps: acquiring an input image sample, preprocessing the image sample, and encoding through an input embedding layer to obtain an embedding vector; based on the embedded vector, obtaining a feature vector by using an image encoder, and matching a clustering center based on the feature vector; based on the cluster centers obtained by matching, obtaining corresponding prompt vectors and classifiers; and fusing the prompt vector with the embedded vector, obtaining a new feature vector by using the image encoder, and obtaining a predicted image classification result by using the classifier based on the new feature vector, wherein the clustering center is obtained by clustering in each round of incremental learning process. Compared with the prior art, the method has the advantages of reducing the influence of the problem of forgetting disaster, high classification accuracy and the like.

Description

Class increment learning image classification method, equipment and medium based on prompt fine adjustment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a class increment learning image classification method, equipment and medium based on prompt fine adjustment.

Background

Class incremental learning is also called continuous learning, and its background is derived from the way humans learn. Humans are constantly facing new things, new knowledge, and at the same time, maintain memory and application capabilities for old knowledge. Researchers have therefore come to focus on how machine learning algorithms can learn continuously like humans, so that knowledge can be continuously accumulated while dealing with changing environments and tasks, rather than forgetting previous knowledge. Conventional machine learning models are typically trained on a static data set and once the model training is complete, no further learning is performed. However, in the real world, data is dynamically changing and new tasks and information are continuously emerging.

One of the main challenges of class incremental learning is disaster forgetting (catastrophic forgetting), i.e. the model may forget previously learned knowledge when learning a new task. This is because conventional models update parameters as new data is learned, resulting in old parameters being overridden. Currently, common solutions include: regularization-based, sparing-based, and dynamic network structure-based methods. Regularization-based methods limit the plasticity of the model by limiting the learning rate of important parameters of previous tasks. In particular, it penalizes the difference between the model parameters and their expected values learned on previous tasks by a quadratic loss term. This expected value is calculated from training data of previous tasks and parameter importance weights. A data buffer area is constructed based on the training method and is used for storing samples of old tasks, and the old samples and the new samples are mixed together during training and sent into model training so as to keep old knowledge. The dynamic network structure-based method maintains the plasticity of the model by expanding new network parameters for new tasks without covering old knowledge, and has the best effect in several methods, but the parameters of the model are also increased sharply with the increase of the tasks, which is unacceptable.

In summary, there is currently a lack of an image classification method based on class incremental learning, enabling machine learning models to continuously learn and adapt to new data and tasks.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a class increment learning image classification method, equipment and medium based on prompt fine adjustment, so as to overcome or partially overcome the problem of forgetting disasters.

The aim of the invention can be achieved by the following technical scheme:

the invention provides a class increment learning image classification method based on prompt fine adjustment, which comprises the following steps:

acquiring an input image sample, preprocessing the image sample, and encoding through an input embedding layer to obtain an embedding vector;

based on the embedded vector, obtaining a feature vector by using an image encoder, and matching a clustering center based on the feature vector;

based on the cluster centers obtained by matching, obtaining corresponding prompt vectors and classifiers;

fusing the prompt vector with the embedded vector, obtaining a new feature vector by using the image encoder, obtaining a predicted image classification result by using the classifier based on the new feature vector,

the clustering center is obtained by clustering in each round of incremental learning process.

As an optimal technical scheme, the clustering process for obtaining the clustering center in each round of incremental learning process comprises the following steps:

acquiring a plurality of pairs of sample tuples and tag tuples in the round of learning task;

based on the vector after the sample tuple is encoded, extracting features by using the image encoder, and obtaining a clustering center with the minimum Euclidean distance through hierarchical clustering;

and fusing the vector after the sample tuple is encoded and the prompt vector obtained by initialization, inputting the obtained fused vector into the image encoder, outputting the fused vector based on the image encoder, obtaining a classification result by using a classifier corresponding to the training task of the round, and completing the round of learning based on the classification result and the tag tuple.

As a preferable technical scheme, the clustering center is used as a key, and the prompt vector and the classifier in each round of class increment learning are used as values corresponding to the key.

As a preferred technical solution, the process of obtaining the predicted image classification result by using the classifier based on the new feature vector includes:

and selecting 0 to m-1 dimensionalities of the new feature vectors, averaging, generating vectors required by classification, and inputting the vectors into the classifier to obtain a predicted image classification result, wherein m is the length of the prompt vector.

As a preferred solution, the input embedding layer and the image encoder are trained in advance and keep parameters frozen during class delta learning and prediction.

As a preferred technical solution, the image encoder adopts a transducer architecture based on a multi-head attention mechanism.

As a preferred technical scheme, the preprocessing includes one or more of random clipping, random flipping, image size adjustment, normalization and brightness adjustment.

In another aspect of the present invention, there is provided a class increment learning image classification apparatus based on hint fine adjustment, including:

the data preprocessing module is used for preprocessing an input image sample;

the picture embedding module is used for converting the preprocessed image sample into an embedded vector;

the feature extraction module is used for extracting key features in the image aiming at the embedded vector to obtain a feature vector;

the selector module is used for matching the clustering center based on the feature vector to obtain a corresponding prompt vector and a classifier;

the feature fusion module is used for fusing the prompt vector and the embedded vector to obtain a new feature vector;

an image encoder module for performing feature extraction for the new feature vector;

and the classification prediction module is used for obtaining a predicted image classification result by using the classifier based on the output of the image encoder module.

As a preferred technical scheme, the method further comprises:

the clustering module is used for clustering all sample tuples of the current learning task after coding and feature extraction in the incremental learning process, and storing a clustering center into a list to be used as a key of a follow-up matching prompt vector;

the prompt vector initialization module is used for initializing a prompt vector for a learning task;

and the training module is used for defining the loss function, selecting an optimizer and updating the implementation parameters.

In another aspect of the invention, a computer-readable storage medium is provided that includes one or more programs for execution by one or more processors of an electronic device, the one or more programs including instructions for performing the prompt-fine-tuning-based class delta-learning image classification method described above.

Compared with the prior art, the invention has at least one of the following beneficial effects:

(1) The influence of the problem of forgetting disaster is reduced: in the prediction process, firstly, based on the characteristics of an embedded layer and an image encoder, a cluster center obtained by clustering each learning task in the class increment learning process is matched, so that a prompt vector and a classifier corresponding to the cluster center are obtained, and further prediction is realized. Because each task has independent parameter space and cannot interfere with each other, the problem of forgetting disaster is partially overcome, meanwhile, the prompt learning has fewer training parameters, and even if a prompt vector is newly built for each task, the occupation of the memory space is acceptable.

(2) The classification accuracy is high: in the class increment learning process, a clustering center with the minimum Euclidean distance is obtained through hierarchical clustering, and compared with other clustering methods, the method has higher tolerance to noise and outliers, is not limited by cluster shapes and sizes, can effectively find clusters with different sizes and shapes, and further improves the classification accuracy of pictures.

Drawings

FIG. 1 is a schematic diagram of a training phase of a class increment learning image classification method based on prompt fine adjustment in an embodiment;

FIG. 2 is a schematic diagram of a prediction stage of a class increment learning image classification method based on hinting fine tuning in an embodiment;

FIG. 3 is a schematic diagram of a class increment learning image classification device based on hint fine tuning in a training phase in an embodiment;

FIG. 4 is a schematic diagram of a class increment learning image classification device based on hint fine tuning in a prediction phase in an embodiment;

FIG. 5 is a schematic diagram of a model frame in an embodiment;

fig. 6 is a schematic diagram of an electronic device in an embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Example 1

Aiming at the problems existing in the prior art, the embodiment provides a class increment learning image classification method based on prompt fine adjustment.

Prompt fine tuning, also known as Prompt Learning (Prompt Learning), is a paradigm that follows the pre-training + fine-tuning ("re-train + fine-tune") paradigm. Unlike the traditional pre-training model, the parameters of the pre-training model do not need to be modified, and tasks are completed through training of a prompt token (prompt token) added in an embedded layer. In the pretraining+fine tuning paradigm, downstream tasks are accommodated by fine tuning parameters of the pretraining model. The prompt learning is to adapt to the downstream task by adjusting the input data so that the data distribution of the downstream task is as close as possible to the data distribution of the pre-training model. Prompt learning has fewer training parameters, shorter training time, and efficient model utilization.

The method adopts a prompt learning strategy to independently train a special prompt vector for each task, thereby effectively avoiding the problem of disasters and forgetting and improving the utilization rate of the pre-training model. The introduction of the prompt vector of the new task ensures the adaptability of the model to new knowledge, and freezing the prompt vector of the old task is helpful to maintain learned knowledge, thereby ensuring the overall stability of the model. The comprehensive method provides a powerful and flexible solution for class incremental learning of the deep learning model.

The method can be divided into a training phase and a prediction phase.

The flow chart of the training phase is shown in fig. 1, where it is split into two main branches. The left branch is responsible for acquiring the cluster center of the current task, while the right branch is focused on training the prompt vector and classification head of the current task. The design of the right branch aims at introducing task specific information into the model, and the performance of the model on the current task is improved through fusion of prompt vectors and training of a linear classifier.

The pre-training model was constructed as follows:

the skeleton network is ViT-B/16 network f=f _r of _e Wherein f _e To input the embedded layer, f _r Is an image encoder.

The parameters are as follows:

patch_size: representing the size of the small block into which the input image is divided, typically 16x16.

Num laminates: the number of layers representing the transducer module is typically 12.

Casting_dim: representing the dimension of each patch after it is embedded in vector space, typically 768.

Num_heads: the number of heads representing the multi-head attention mechanism in the transducer module is typically 12.

Mlp_dim: the dimension representing the full link layer in a transducer module is typically 3072.

Dropout: the probability that each neuron is randomly deactivated during the training process is represented, typically 0.1.

The pre-training data set used by the skeleton network is ImageNet-21k, and all parameters of the network remain default and are kept frozen in the whole training process.

The training phase comprises the following steps:

s101, acquiring a new task.

Firstly, an image classification data set is divided into a plurality of sub data sets D= { D ₁ ,…,D _t ' tasks thereinIncludes input sample tuple->And corresponding tag tuplesEach sub-data set represents a task that is learned in turn in the task orchestration order (which may also be out of order).

S102, data preprocessing.

In the left branch, for new task D _t Each sample of (3)And performing data preprocessing, including random clipping, random overturning, image size adjustment, normalization, brightness adjustment and other operations.

And S103, converting the picture into a vector.

Pretreated withSample ofEncoding into a two-dimensional vector x by an embedding layer _e ＝f _e (x)。

And S104, extracting the characteristics by using the pre-training model.

Embedding a picture into a vector x _e Input to an image encoder f _r Thereby extracting the feature vector x _f ＝f _e (x _e ). These extracted feature vectors are saved to a list for use in subsequent clustering. The above steps are repeatedly performed until all samples of the current task have completed feature extraction.

S105, clustering and storing a clustering center.

And clustering the current task by adopting a hierarchical clustering algorithm, and storing a clustering center into a list. These cluster centers are considered keys for subsequent lookup of the corresponding hint vectors.

Compared with other clustering algorithms such as k-means, the hierarchical clustering algorithm adopted by the method has stronger robustness, has higher tolerance to noise and outliers, is not limited by cluster shapes and sizes, and can effectively find clusters with different sizes and shapes. Taking the ImageNet-R dataset as an example, we split it into 10 sub-datasets and use the ViT network pre-trained in ImageNet-21k for feature extraction in each sub-set. And then, clustering each task through a clustering algorithm, and setting the number of clustering centers as 50. Thus, for these 10 subsets, a total of 500 cluster centers need to be saved. In the test set, a KNN algorithm is used for matching each test sample to the nearest cluster center, and the list ID corresponding to the cluster center is the task ID.

Experiments prove that in the test set, the k-means clustering algorithm is used for clustering, and the accuracy of correctly matching task IDs is 90.74%. In contrast, the accuracy of clustering by using the hierarchical clustering algorithm reaches 93.91%, and the accuracy is improved by about 3% compared with the k-means clustering algorithm. The result further proves the robustness of the hierarchical clustering algorithm in complex task clustering, and reliable support is provided for the method.

S106, initializing a prompt vector.

In the right branch, an initialization operation is first performed, including a hint vector P of length m _t ＝[p ₁ ,p ₂ ,…,p _m ]And a linear classifier H specific to the current task _t . These hint vectors and linear classifiers are considered as task specific values (values) corresponding to the previously mentioned keys.

S107, fusing with the picture vector.

For input sample tuplesFirst by embedding an encoder f _e Encoding it to obtain an encoded two-dimensional vector x _e ＝f _e (x) (the same as in S103). Then, remove class_token to obtain x' _e And will hint vector P _t And x' _e Fusion is carried out in a vector splicing mode to obtain a fusion vector x _p ＝[P _t ；x' _e ]。

S108, extracting the characteristics by using the pre-training model.

Will x _p Input image encoder f _r Obtaining a characteristic vector x _pf ＝f _r (x _p ) Selecting a feature vector x _pf And average it to generate class_token (denoted as x) required for subsequent classification _cls )。

S109, calculating classification loss and optimization.

Will x _cls Inputting into a linear classifier, and performing classification operation to obtain a prediction result y _pre ＝H _t (x _cls ). Next, by comparing the predicted results y _pre And a corresponding tag y, calculates a classification loss, and minimizes the classification loss using an optimizer. This process aims at adjusting the task-specific hint vector and linear classifier to better adapt to the characteristics and classification needs of the current task.

Referring to fig. 2, a schematic diagram of a prediction phase is shown, which includes the following steps:

s201, inputting a picture.

Receiving input samples

S202, data preprocessing.

And carrying out data preprocessing on the input sample, including random clipping, random overturning and other operations.

And S203, embedding the picture into a vector.

Pretreated sampleEncoding into a two-dimensional vector x by an embedding layer _e ＝f _e (x)。

S204, extracting features by using the pre-training model.

Embedding a picture into a vector x _e Input to an image encoder f _r Extracting features to obtain feature vector x _f ＝f _e (x _e )。

S205, selecting a hint vector by the extracted features.

The feature vector x _f As a query vector, euclidean distances to all cluster center keys (keys) are calculated, resulting in distances=euclidean (query, keys).

Next, the index of the cluster center with the smallest distance, i.e., index= argmin (distances), is selected by the argmin function, thereby determining the input corresponding hint vector P _index ＝P[index]Linear classifier head _index ＝H[index]。

Will hint vector P _index Embedding vector x with picture _e Fusing, namely vector splicing to obtain a fused vector x _p ＝[P _index ；x _e ]。

S204, extracting features by using the pre-training model.

Will x _p Input image encoder f _r Obtaining a characteristic vector x _f ＝f _r (x _p ) Selecting a feature vector x _f 0 of (2)To the m-1 dimension and averaging it to generate class_token, denoted as x, required for subsequent classification _cls 。

S206, classification prediction.

Will x _cls Inputting the prediction result into a linear classifier for classification to obtain a prediction result y _pre ＝head _index (x _cls )。

The reasoning process fully utilizes the task-specific prompt vector and the linear classifier, and realizes the effective classification of the input samples through fusion with the pre-training model. The reasoning process retains task specific information learned during the training phase and provides reliable support for the performance of the model on new tasks.

Referring to fig. 5, a network architecture diagram for the method is shown, which includes a pre-trained ViT network, and a plurality of task-specific cluster centers, hint vectors, and classifiers. All parameters remain frozen except for the hint vector and classifier, which are trainable parameters. For input samples, features are first extracted by a pre-trained model, and these features are treated as query vectors. Then, euclidean distances between the query vector (query) and all keys are calculated, and a value (value) corresponding to the key closest to the query vector (query) is selected. The key here is the cluster center for each task, and the values include a prompt vector and classifier unique to the task.

The selected hint vector is inserted into the embedded layer, spliced with the embedded patch of the image, and then input into the encoder of the network. Finally, the encoded vector is classified by a linear classifier. The key of the whole network structure is that task specific information is effectively fused, and the freezing state of the pre-training model is kept, so that the stability and learning efficiency of the network are ensured. The network structure design enables the method to realize rapid adaptation to new tasks in the class increment learning task, and simultaneously effectively maintains knowledge of old tasks.

To illustrate the advantages of the method, the method is adopted to test in a high-performance computing environment equipped with a Linux operating system, and the specific configuration is as follows:

hardware environment:

GPU: the experiment uses 8V 100 type Graphic Processing Units (GPU), and the GPU provides excellent parallel computing performance and strong computing power for model training and reasoning.

CPU: the computing environment is equipped with a high-performance Central Processing Unit (CPU) to support the normal operation of the system and to assist in computing tasks.

Memory: the system has enough memory capacity to meet the data loading and storage requirements required for large-scale deep learning model training.

Software environment:

operating system: the Ubuntu 20.04.4 LTS operating system is adopted in the experiment, so that the wide support of a deep learning framework and tools is ensured, and a stable experiment foundation is provided.

Deep learning framework: advanced deep learning framework, pyTorch version 1.13, was used in experiments to achieve efficient training and reasoning on models.

CUDA: the CUDA required for GPU acceleration computation is version 12.1 to fully exploit the parallel computing capabilities of the GPU.

Related software package: other auxiliary tools and software packages, such as NumPy, sklearn, may also be involved in the experiment to support data processing and scientific calculation of the experiment.

Table 1 shows the results of the comparison of the present method with other methods in recent years on three datasets, CIFAR-100, 5-datasets and ImageNet-R. On the CIFAR-100 dataset, we divide the tasks into 10, each task containing 10 categories. On the 5-datacenters dataset we divide 5 tasks, each task being of a different category. On the ImageNet-R dataset we divided 10 tasks, each containing 20 categories. The estimated index is the prediction accuracy of each task, and finally the accuracy of all tasks is averaged to obtain the final accuracy on the dataset.

Table 1 test comparative results

As can be seen from the table, the method exhibits significant advantages in all three data sets, and is superior to the former method in recent years. The back of this result reflects the advantages of the present method in several ways:

(1) Higher accuracy: the higher accuracy achieved by the method over the three data sets indicates that it is more prominent with respect to the accuracy and reliability of the task. Compared with the former method, the invention has more excellent performance in resisting disaster forgetting.

(2) Less memory occupancy: the method is memory-friendly, and no extra memory space is needed to store historical data, thereby reducing the pressure on system resources. Although the use of buffers is eliminated, significant advantages in accuracy are still achieved.

(3) Higher model utilization: the method always keeps the parameters of the pre-training model frozen in the training process, adjusts the data distribution of the downstream tasks by training the prompt vector of the specific task, so that the distribution of the data is closer to the pre-training model, the downstream tasks are better adapted, and the pre-training model is fully utilized.

(4) Stronger generalization ability: the method is simple and effective, and is theoretically suitable for all deep learning tasks, including but not limited to image classification tasks. For example, the method can be migrated to a video understanding task, and the same remarkable effect can be achieved.

Example 2

The embodiment provides class increment learning image classification equipment based on prompt fine adjustment.

Referring to fig. 3, in a training phase, the image classification apparatus includes:

the data preprocessing module is responsible for preprocessing the input image data to ensure that the input image data is suitable for training of a model. Preprocessing may include image resizing, normalization, brightness adjustment, etc. operations to improve the stability and performance of the model.

And the picture embedding module is used for converting the input image into an embedded representation, so that information in the image can be effectively processed by the model. In Vision Transformer (VIT), this is typically done by dividing the image into a series of tiles of fixed size, and concatenating the pixel values of each tile into a vector, as an embedded representation of the image.

And the feature extraction module is used for extracting key features in the image by further processing the embedded representation. In VIT, this module typically includes multiple layers of transducers for self-focusing on image embedding, capturing the relationship between the interior of the image and the tiles.

And the clustering module is used for clustering the feature vectors extracted from all sample data of the current task by using a clustering algorithm, and storing a clustering center into a list to be used as a key (key) of a subsequent matching prompt vector.

The prompt vector initialization module is used for initializing the prompt vector for the current task, maintaining the plasticity of the model by expanding a new network structure for the new task, and simultaneously, maintaining the stability of the model by freezing the prompt vector of the old task.

And the feature fusion module is used for fusing the unique prompt vector of the task with the picture embedded representation, and changing the downstream task data distribution by adjusting the parameters of the prompt vector so that the downstream task data distribution is more fit with the pre-training model data distribution.

The image encoder module is responsible for encoding the hint vector and the picture-embedded fused vector into higher-level features, typically low-dimensional vectors or matrices. This more abstract representation captures key features of the input data, helping the model to better understand and process the input information.

A classification prediction module maps the advanced representation of the image to an output classification. This module typically includes one or more fully connected layers to learn the mapping between features to be extracted and a given class, implementing the classification task of the image.

The training module comprises definition of loss function, selection of optimizer and update of model parameters, and is used for adjusting the model on training data to improve the performance of the model. In deep learning, the training module is a core component for back propagation and gradient descent, and model parameters are optimized through continuous iteration to better adapt to tasks.

Referring to fig. 4, in the prediction phase, the image classification apparatus includes:

And the selector module is used for selecting the task center closest to the input, so as to determine the prompt vector corresponding to the current input.

And the feature fusion module is used for fusing the prompt vector selected by the selector with the picture embedded representation, and changing the current task data distribution by adjusting the parameters of the prompt vector so that the current task data distribution is more fit with the pre-training model data distribution.

The clustering algorithm adopted by the clustering module is hierarchical clustering, the hierarchical clustering algorithm is relatively robust to noise and outliers, the clustering algorithm is not limited by the shape and the size of clusters, and clusters with different sizes and shapes can be found. Compared with other clustering algorithms such as k-means, the hierarchical clustering algorithm is relatively robust to noise and outliers, is not limited by the shape and size of clusters, and can find clusters with different sizes and shapes. On the ImageNet-R dataset, the probability of correctly selecting the hint vector using the hierarchical clustering algorithm is improved by about 3% over k-means.

The principle of the selector module is that a task center closest to the input characteristic is selected, so that a prompt vector corresponding to the current input is determined. In the reasoning process, for a given input, the feature is extracted by using a pre-training model, the feature is used as a query vector to calculate Euclidean distances between the feature and all cluster center keys, and then a prompt vector corresponding to the cluster center with the smallest distance is selected. The selection mechanism ensures that each input can be matched with the most relevant task center in the class increment learning task, and further selects the corresponding prompt vector so as to ensure the effective adaptability of the model to the new task.

Example 3

Referring to fig. 6, the present embodiment provides an electronic device, which includes a processor, an internal bus, a network interface, a memory, and a nonvolatile memory, and may include hardware required by other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs to implement the methods provided in fig. 1 and/or fig. 2 described above.

Example 4

The present embodiment provides a computer-readable storage medium comprising one or more programs for execution by one or more processors of an electronic device, the one or more programs including instructions for performing the prompt-fine-tuning-based class delta learning image classification method of embodiment 1.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The class increment learning image classification method based on prompt fine adjustment is characterized by comprising the following steps of:

2. The method for classifying class increment learning images based on prompt fine adjustment according to claim 1, wherein the clustering process of obtaining a cluster center in each round of class increment learning process comprises the following steps:

3. The class increment learning image classification method based on prompt fine adjustment according to claim 1, wherein the clustering center is used as a key, and the prompt vector and the classifier in each round of class increment learning are used as values corresponding to the key.

4. The method for classifying a class-increment learning image based on hinting fine-tuning of claim 1, wherein the step of obtaining a predicted image classification result using the classifier based on the new feature vector comprises:

5. The prompt fine tuning based class delta learning image classification method as claimed in claim 1, wherein the input embedding layer and the image encoder are pre-trained and parameters are kept frozen during class delta learning and prediction.

6. The method of claim 1, wherein the image encoder employs a multi-head attention mechanism based transform architecture.

7. The prompt fine tuning based class delta learning image classification method as claimed in claim 1, wherein the preprocessing comprises one or more of random cropping, random flipping, image sizing, normalization, brightness adjustment.

8. A class increment learning image classification device based on hint fine tuning, comprising:

the data preprocessing module is used for preprocessing an input image sample;

9. The prompt fine adjustment based class delta learning image classification device of claim 8, further comprising:

10. A computer readable storage medium comprising one or more programs for execution by one or more processors of an electronic device, the one or more programs including instructions for performing the prompt-fine-tuning-based class delta learning image classification method of any of claims 1-7.