CN116012646A

CN116012646A - Image data labeling method, electronic device and computer readable medium

Info

Publication number: CN116012646A
Application number: CN202211678386.2A
Authority: CN
Inventors: 侯志申; 王塑; 周昕宇
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Wuhan Kuangshi Jinzhi Technology Co ltd; Beijing Megvii Technology Co Ltd
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-04-25

Abstract

An image data labeling method, an electronic device and a computer readable medium, the method comprising: obtaining image data to be annotated from a data pool to be annotated, and inputting the image data to be annotated into a prediction model to obtain a prediction result of a target object; adding image data containing the prediction result into a prediction result pool; obtaining image data containing a predicted result from a predicted result pool, inputting the image data containing the predicted result into a calibration module, and obtaining a labeling result obtained by calibrating the predicted result through the calibration module; adding the image data which is output by the calibration module and contains the labeling result into a labeled data pool; and acquiring image data containing the labeling result from the labeled data pool, and training the prediction model by using the image data containing the labeling result. The method and the device can improve the marking speed and the accuracy of the detection model.

Description

Image data labeling method, electronic device and computer readable medium

Technical Field

The present invention relates to the field of deep learning technology, and more particularly, to a method for labeling image data, an electronic device, and a computer-readable medium.

Background

The excellent performance of deep learning depends on a large amount of labeling data, and the labeling of the data is a very time-consuming and labor-consuming task, and especially for some data in special fields, a labeling person must have professional knowledge to accurately label the data, which increases the training process of the labeling person. In addition, the labeling personnel can not avoid the situation of wrong labeling during labeling, and the labeling cost is further consumed for correcting the wrong labeling.

In the image data labeling of classification tasks, a common labeling acceleration mode is to label an image with a trained model, then confirm and modify the pre-label manually, and increase the data labeling speed through alternate model prediction and manual confirmation. However, compared with the data labeling of the classification task, the labeling of the image detection data is more difficult, on one hand, the label of the image detection data not only comprises the target class, but also comprises the position of the target in the image, and more time is consumed for labeling and confirming the position of the prediction frame; on the other hand, there may be multiple targets in an image, each of which may be considered as an independent individual, and which have a certain relationship with each other.

Disclosure of Invention

In the summary, a series of concepts in a simplified form are introduced, which will be further described in detail in the detailed description. The summary of the invention is not intended to define the key features and essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

According to an aspect of the present invention, there is provided a method of labeling image data, the method comprising: obtaining image data to be annotated from a data pool to be annotated, and inputting the image data to be annotated into a prediction model to obtain a prediction result of a target object; adding the image data containing the prediction result into a prediction result pool; obtaining image data containing the predicted result from the predicted result pool, inputting the image data containing the predicted result into a calibration module, and obtaining a labeling result obtained by calibrating the predicted result through the calibration module; adding the image data which is output by the calibration module and contains the labeling result into a labeled data pool; and acquiring image data containing the labeling result from the labeled data pool, and training the prediction model by using the image data containing the labeling result.

In one embodiment, the image data containing the annotation result further comprises a containing box containing at least one target object; the method further comprises the steps of: intercepting the image data in the containing frame to obtain a sub-image, and adding the sub-image into the data pool to be marked; and after the labeling of the sub-image is completed, updating the labeling result of the sub-image into the image data corresponding to the sub-image and containing the labeling result.

In one embodiment, the containing box is a containing box marked by the calibration module, or the containing box is a containing box generated based on a prediction result output by the prediction model.

In one embodiment, the obtaining image data containing the prediction result from the prediction result pool includes: and extracting the image data containing the prediction result from the image data of each level according to a target proportion, wherein the image data of the Nth level is obtained by cutting a containing frame in the image data of the N-1 th level, and N is greater than or equal to 1.

In one embodiment, the prediction result includes a position of the prediction frame and a confidence of the prediction frame; the obtaining the image data containing the prediction result from the prediction result pool comprises the following steps: determining entropy values of a plurality of prediction frames in the image data according to the confidence degrees of the prediction frames; and determining the marking difficulty index of the image data according to the entropy values of the plurality of prediction frames, and obtaining at least one image data with the lowest marking difficulty index.

In one embodiment, before the image data to be annotated is acquired from the data pool to be annotated, and is input into the prediction model, the method further comprises: obtaining image data to be marked from the data pool to be marked, inputting the image data to be marked into the calibration module, and obtaining an initialization marking result obtained by marking the image data to be marked through the calibration module; and adding the image data containing the initialization labeling result output by the calibration module into the labeled data pool, and training the prediction model by using the image data containing the initialization labeling result.

In one embodiment, the prediction model includes a prediction module and a regression module, where the prediction module is configured to identify a target object in the image data to be annotated, so as to obtain a primary prediction result; the regression module is used for acquiring local image data intercepted from the image data based on the primary prediction result and identifying a target object in the local image data so as to obtain a secondary prediction result; and the prediction result output by the prediction model is the secondary prediction result.

A second aspect of an embodiment of the present invention provides an electronic device including a memory and a processor, the memory storing a computer program for execution by the processor, the computer program, when executed by the processor, performing the method of annotating image data as described above.

A third aspect of the embodiments of the present invention provides a computer-readable medium having stored thereon a computer program which, when run, performs the image data annotation method as described above.

A fourth aspect of the embodiments of the present invention provides a computer program product comprising a computer program/instruction which, when executed by a processor, implements a method of labelling image data as described above.

According to the image data labeling method, the electronic equipment and the computer readable medium, a prediction result is generated based on the prediction model, and the prediction result is calibrated to obtain the labeling result, so that the labeling speed can be increased; meanwhile, the image data containing the labeling result is utilized to train the prediction model in the labeling process, so that the accuracy of the prediction model is improved.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following more particular description of embodiments of the present invention, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, and not constitute a limitation to the invention. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 shows a schematic block diagram of an example electronic device for implementing a method of labeling image data according to an embodiment of the invention;

FIG. 2 shows a schematic flow chart of a method of labeling image data according to an embodiment of the invention;

FIG. 3 shows a system configuration diagram of a labeling method of image data according to an embodiment of the present invention;

FIG. 4 shows a schematic diagram of an initialization phase according to an embodiment of the invention;

FIG. 5 shows a schematic diagram of an annotation interface for an initialization stage, according to an embodiment of the invention;

FIG. 6 shows a schematic diagram of a quick annotation stage according to an embodiment of the invention;

FIG. 7 shows a schematic diagram of an annotation interface for a quick annotation stage according to an embodiment of the invention;

FIG. 8 shows a schematic diagram of a predictive model according to an embodiment of the invention;

Fig. 9 shows a schematic block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present invention and not all embodiments of the present invention, and it should be understood that the present invention is not limited by the example embodiments described herein. Based on the embodiments of the invention described in the present application, all other embodiments that a person skilled in the art would have without inventive effort shall fall within the scope of the invention.

In recent years, technology research such as computer vision, deep learning, machine learning, image processing, image recognition and the like based on artificial intelligence has been advanced significantly. Artificial intelligence (Artificial Intelligence, AI) is an emerging scientific technology for studying and developing theories, methods, techniques and application systems for simulating and extending human intelligence. The artificial intelligence discipline is a comprehensive discipline and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning, neural networks and the like. Computer vision is an important branch of artificial intelligence, and particularly, machine recognition is a world, and computer vision technologies generally include technologies such as face recognition, living body detection, fingerprint recognition and anti-counterfeit verification, biometric feature recognition, face detection, pedestrian detection, object detection, pedestrian recognition, image processing, image recognition, image semantic understanding, image retrieval, word recognition, video processing, video content recognition, behavior recognition, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map building (SLAM), computational photography, robot navigation and positioning, and the like. With research and progress of artificial intelligence technology, the technology expands application in various fields, such as security protection, city management, traffic management, building management, park management, face passing, face attendance, logistics management, warehouse management, robots, intelligent marketing, computed photography, mobile phone images, cloud services, intelligent home, wearing equipment, unmanned driving, automatic driving, intelligent medical treatment, face payment, face unlocking, fingerprint unlocking, personnel verification, intelligent screen, intelligent television, camera, mobile internet, network living broadcast, beauty, make-up, medical beauty, intelligent temperature measurement and the like.

The existing image data labeling method mainly has the following problems:

1. for a small target object in image data, labeling is often completed by zooming the image, and the method needs to be completed alternately in a large-scale searching target and a small-scale confirming labeling frame repeatedly, so that the complexity of a labeling process is increased;

2. when the target object is too small, the target object is difficult to accurately detect by the prediction model, and in this case, manual marking is needed, so that the model-assisted advantage cannot be achieved; and, once the accuracy of the prediction model is insufficient, the complexity of modifying the prediction box is higher than the complexity of re-rendering the annotation box.

In view of the above problems, embodiments of the present invention provide an image data labeling method, an electronic device, and a computer-readable medium. First, an example electronic device 100, an electronic device, and a computer-readable medium for implementing an annotation method of image data of an embodiment of the present invention are described with reference to fig. 1.

As shown in fig. 1, electronic device 100 includes one or more processors 102, one or more storage devices 104, an input device 106, an output device 108, and an image sensor 110, which are interconnected by a bus system 112 and/or other forms of connection mechanisms (not shown). It should be noted that the components and structures of the electronic device 100 shown in fig. 1 are exemplary only and not limiting, as the electronic device may have other components and structures as desired.

The processor 102 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 102 to implement client functions and/or other desired functions in embodiments of the present invention as described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, mouse, microphone, touch screen, and the like.

The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like.

The image sensor 110 may take images (e.g., photographs, videos, etc.) desired by the user and store the taken images in the storage device 104 for use by other components.

For example, example electronic devices for implementing the labeling methods, electronic devices, and computer-readable media of image data according to embodiments of the invention may be implemented as, for example, smartphones, tablets, and the like.

Next, an image data labeling method 200 according to an embodiment of the present invention will be described with reference to fig. 2. As shown in fig. 2, the method 200 for labeling image data according to the embodiment of the present invention includes the following steps:

in step S210, obtaining image data to be annotated from a data pool to be annotated, and inputting the image data to be annotated into a prediction model to obtain a prediction result of a target object;

At step S220, adding the image data containing the prediction result to a prediction result pool;

in step S230, image data including the prediction result is obtained from the prediction result pool, the image data including the prediction result is input to a calibration module, and a labeling result obtained by calibrating the prediction result is obtained through the calibration module;

in step S240, the image data including the labeling result output by the calibration module is added to the labeled data pool;

in step S250, image data including the labeling result is obtained from the labeled data pool, and the prediction model is trained using the image data including the labeling result.

According to the image data labeling method 200, a prediction result is generated based on the prediction model, and the prediction result is calibrated through the calibration module to obtain a labeling result, so that the labeling speed of the image data can be increased; meanwhile, the image data containing the labeling result is utilized to train the prediction model in the labeling process, so that the accuracy of the prediction model is improved. The labeling method provided by the embodiment of the invention can be used for labeling image data of license plates, faces, characters and other arbitrary objects.

Referring to fig. 3, a system configuration diagram of a method for labeling image data according to an embodiment of the present invention is shown. The method comprises the steps of obtaining image data to be marked from a data pool to be marked, and obtaining a prediction result of at least one target object by a prediction model. The prediction result includes the position of a prediction frame surrounding the target object, the confidence of the prediction frame, and the like. Since the prediction result generated by the prediction model may have the conditions of omission, false detection and the like, the image data including the prediction result output by the prediction model is added into a prediction result pool, the image data including the prediction result is selected from the prediction result pool and is sent to a calibration module, and the calibration module can push the image data including the prediction result to a user for calibration, for example, adjust the position of a prediction frame, delete the prediction frame not including a target object, increase the labeling frame of the unrecognized target object and the like. The marked data pool is used for acquiring the image data containing the marked result after the calibration of the calibration module, on one hand, the marked image data can be output, and on the other hand, the image data containing the marked result can be used for training the prediction model so as to improve the accuracy of the prediction model. And repeating the steps, and finally finishing the labeling of all the image data to be labeled. The following description will be mainly made by taking a user calibration as an example, but in some embodiments, the calibration module may also calibrate the image data containing the prediction result using other machine learning models.

Illustratively, the overall labeling flow may be divided into two phases: an initialization phase and a quick labeling phase. In the initialization stage, the prediction model is not sufficiently trained, and more manual labeling is needed by a user, so that the labeling speed is slower. In the quick labeling stage, the prediction model has a certain inference prediction capability, and can output relatively accurate prediction results, and only the prediction results output by the prediction model are required to be calibrated at the moment, so that the labeling speed is higher. In some embodiments, the prediction model may be a model with a certain inference prediction capability, and at this time, a stage of label initialization may be omitted, so as to directly obtain a prediction result output by the prediction model, and calibrate the prediction result by a user.

Illustratively, the initialization phase may be divided into two parts, system initialization and label initialization. Initializing a system to acquire all image data to be marked, and initializing a data pool to be marked according to the image data; the annotated data pool and the inference result pool are initialized to null.

After the system initialization is completed, the image data to be annotated is obtained from the data pool to be annotated, the image data to be annotated is input into the calibration module and pushed to a user for annotation, and an initialization annotation result obtained by the user for annotating the image data to be annotated is obtained through the calibration module. And then, adding the image data containing the initialized labeling result into the labeled data pool, and training the prediction model by utilizing the image data containing the initialized labeling result.

In the labeling process, the condition that the target to be labeled is too small or too many exists. For an excessively small object, the detection capability of the prediction model is poor, the prediction frame meeting the conditions is difficult to output, and meanwhile, a user needs to manually perform the image data amplifying operation when labeling a small object, so that the labeling time is consumed; for the case of multiple targets, the time required for labeling one image data is long, and a long time is required to label the data quantity enough for training the prediction model, so that the training of the early-stage prediction model is insufficient, and better detection performance is difficult to obtain. Meanwhile, the probability that all targets in the image data are detected correctly by the prediction model is low, so that the user is required to manually mark all targets which cannot be detected before completing the submission, and the training data amount is slowly increased.

Aiming at the problems, the embodiment of the invention adopts a labeling mode with a containing frame, and in the calibration process, a larger containing frame can be used for labeling an excessively small target, wherein the containing frame contains at least one target object; after the image data with the containing frame is obtained, the image data in the containing frame can be intercepted and added into a data pool to be marked. In the follow-up labeling, on one hand, the region in the containing frame can be amplified and then delivered to the prediction model for detection and delivered to the calibration module for calibration, and on the other hand, the prediction model can be inferred after being trained more fully, so that a more accurate prediction result is obtained. Similarly, in the case of multiple target objects, the user may label only a part of the target objects, the remaining target objects are labeled with the containing frame, and the containing frame may include multiple target objects, so that as much image data as possible is added to the labeled data pool as soon as possible, and the labeled data pool may be labeled when the prediction model performance is better.

Illustratively, the containment box may be labeled by a calibration module. And adding the image data containing the labeling result and the frame into the labeled data pool, meanwhile, intercepting the image data containing the frame to obtain a sub-image, adding the sub-image into the data pool to be labeled, and waiting for subsequent prediction or labeling. After the labeling of the sub-images is completed, the labeling results of the sub-images can be updated into the image data containing the labeling results corresponding to the sub-images until all the sub-images in the containing frame in one image data are labeled, and the whole image data is considered to be labeled. Assuming that the image data added to the data pool to be marked in the system initialization stage is the 0 th level, the sub-image intercepted from the containing frame of the 0 th level image data is the 1 st level, the sub-image intercepted from the 1 st level sub-image is the 2 nd level, and the like; the image data of the nth hierarchy is obtained by cutting out the containing frame in the image data of the N-1 th hierarchy, wherein N is greater than or equal to 1. When only the labeling frame is included in the sub-image and no frame is included, the 0 th-level image data corresponding to the sub-image is said to complete all labeling.

When the annotated data pool is not empty, training of the predictive model using the image data in the annotated data pool that contains the annotation result is started. Illustratively, when training a predictive model using image data with labeling results and containing boxes, no sample loss is calculated for the regions in the containing boxes. Specifically, neither positive nor negative sample loss is calculated for the region in the containing box, thereby avoiding that the region in the containing box affects the accuracy of the prediction model.

In the training process, the prediction model can output a prediction result, so that in the initialization stage, the calibration module can acquire image data to be annotated from the data pool to be annotated for annotation, and can acquire the image data which contains the prediction result and is output by the prediction model from the prediction result pool for calibration.

In the early stage of marking, more containing frames can be marked, so that the positive samples acquired by the prediction model are insufficient, and enough positive sample information is difficult to acquire, thereby seriously affecting the training of the prediction model. In order to ensure that more sufficient positive samples can be marked in the initialization stage, the calibration module in the embodiment of the invention uses a hierarchical sampling strategy to sample the image data of different levels in proportion, so that the image data subjected to calibration is prevented from being concentrated in the original image data of the 0 th level to generate excessive containing frames. Specifically, as shown in fig. 4, when the image data to be annotated is obtained from the data pool to be annotated, the image data to be annotated is extracted from the image data of each level according to a target ratio. Similarly, when the image data to be calibrated is acquired from the prediction result pool, the image data containing the prediction result is also extracted from the image data of each level at a target ratio.

The calibration interface for the initialization phase of one embodiment of the present invention is shown in fig. 5. The calibration module pushes the image data to be marked to the front end and sends the image data to be marked to a user for calibration. The user may optionally select to annotate the target box, ignore the box, or include the box at the time of calibration, and the type of box may be selected by keypad numbers, e.g., numbers 1, 2, 3 represent the target box, include the box, ignore the box, respectively. When labeling, clicking the left button of the mouse and dragging can realize labeling of one frame; for a frame with inaccurate marking, clicking and dragging the frame to adjust; for the error marked box, the right mouse button can be clicked to delete. When the annotation is complete, the "final confirm" button may be clicked or the carriage return may be clicked for submission. Illustratively, in the initialization phase, the labeling difficulty is high, and therefore, when image data to be labeled is acquired from a data pool to be labeled or a prediction result pool and input to the calibration module, one image data at a time is input to the calibration module.

After the initialization stage, the prediction model obtains a certain prediction capability, so that the initialization stage can be transited to the quick labeling stage, and the prediction model is used for carrying out reasoning prediction on image data to be labeled while continuously training the prediction model, so as to obtain a prediction result. In general, the fast labeling phase is a process in which training reasoning of the predictive model and calibration of the calibration module are repeatedly alternated. In the fast labeling stage, the calibration difficulty is reduced, and at least two image data can be input to the calibration module at a time.

Illustratively, if the number of image data in the annotated data pool is less than a first threshold and/or the number of training of the predictive model is less than a second threshold, an initialization phase is maintained. If the number of the image data in the marked data pool is greater than or equal to a first threshold value and/or the training times of the prediction model is greater than or equal to a second threshold value, a quick marking stage is entered.

As shown in fig. 6, in the fast labeling stage, the prediction model samples a plurality of data to be labeled from the data pool to be labeled to perform inference prediction, the image data to be labeled containing the prediction result is added into the prediction result pool, and meanwhile, the calibration module samples a plurality of image data containing the prediction result from the prediction result pool according to a corresponding ordering strategy to push the image data to the user, and the user is submitted to calibration. During the labeling of the user, the prediction model is still continuously carrying out reasoning prediction on the data to be labeled, updating the prediction result pool, and ensuring that the prediction result in the prediction result pool is more accurate. In addition, the prediction model is continuously trained in the process, so that the detection capability of the prediction model is improved. Along with the increase of the data quantity in the marked data pool, the image data which can be used for training the prediction model is also continuously enriched, so that the prediction model can be more fully trained.

Illustratively, as shown in fig. 8, for the problem that the performance of the prediction model is too poor in the case of too small data volume, a separate regression module is added to the prediction model according to the embodiment of the present invention after the prediction module, so as to calibrate the result output by the prediction module. The prediction module is used for identifying a target object in the image data to be marked so as to obtain a primary prediction result; the regression module is used for acquiring local image data intercepted from the image data based on the primary prediction result and identifying a target object in the local image data so as to obtain a secondary prediction result. The prediction result output by the prediction model is a secondary prediction result output by the regression module.

When the prediction model is utilized to output the prediction result, the N standard-sized image data to be marked (including but not limited to RGB pictures) are input to the prediction module, and the output is the primary prediction result of each image data to be marked, specifically including the position, the size and the confidence level of the primary prediction frame. During training, the confidence level of the primary prediction frame and positive and negative sample labels obtained from the marked data pool are used for calculating cross entropy loss, and for the prediction frame belonging to the positive sample, the position and the size of the primary prediction frame and the position and the size of the real marked frame are used for calculating iou loss, so that regression of the prediction frame is realized. Illustratively, the backbone (backbone) of the prediction module may be implemented using a network such as VGG, resNet, efficientNet, and using a prediction framework such as FCOS, YOLOX, etc.; training of the detection module can be achieved through optimizers such as SGD and Adam, and the initial learning rate is 1e-4.

It is noted that in the training data, there are also containing boxes generated or annotated for the purpose of delaying annotation, in addition to the actual annotation boxes, neither positive nor negative sample loss is calculated for the region in the containing box.

The regression module uses the output of the prediction module as input, if the primary prediction result output by the prediction module can be matched with the real target frame, the local image data amplified and cut based on the primary prediction frame is used as the input of the regression module and output as a secondary prediction result, the secondary prediction result specifically comprises the position, the size and the confidence coefficient of the secondary prediction frame, the corresponding real target frame is used as the label of the regression module, and the minimum mean square error is calculated for the output secondary prediction frame and the corresponding real target frame, so that the training of the regression module is realized. Illustratively, the regression module may use a network such as ResNet, mobileNet as a backbone, trained using an optimizer such as SGD, adam, etc., with an initial learning rate of 1e-3. Because the input data used by the regression module is the local image data obtained after the primary prediction result is intercepted, more attention target objects can be noticed under the condition of filtering out surrounding irrelevant backgrounds.

In some embodiments, to take advantage of this a priori knowledge of the primary predictor itself, the primary predictor may be input into the regression module as a special anchor box, causing the regression module to output the offset of the secondary predictor relative to the anchor box. Specifically, if the center point coordinate and width-height of the primary prediction frame are expressed as x _d ,y _d ,w _d ,h _d The center point coordinate and width and height of the marked frame are expressed as x _g ,y _g ,w _g ,h _g The true value (group_threth) output by the regression module is expressed as:

accordingly, the coordinates of the secondary prediction frame may be obtained by:

therefore, the primary prediction frame is input into the regression module as priori knowledge, and a certain guiding effect can be provided for the regression module, so that the performance of the model is further improved.

The accuracy of the prediction result finally output by the prediction model cannot be guaranteed sufficiently, and meanwhile, more unlabeled targets possibly exist in the image data and are required to be completely labeled manually, so that the embodiment of the invention selects to generate the inclusion frame according to the prediction result with poor quality and input the inclusion frame into the calibration module, and the time for user confirmation and manual frame selection is shortened. Specifically, at least two prediction frames in the same image data may be combined into a containing frame, or a single prediction frame in the image data may be enlarged into a containing frame.

Illustratively, parameters such as size, shape, confidence of the predicted frame may be used as criteria for generating the containing frame. For example, it may be determined whether or not the prediction frames need to be merged according to the areas of the prediction frames, and at least two prediction frames having areas smaller than the third threshold value are merged into the second inclusion frame. During merging, the greedy algorithm can be used for searching the at least two prediction frames, the at least two prediction frames are merged in a merging mode with minimum loss, and the at least two prediction frames are merged into the last prediction frame, wherein the loss is calculated in the following mode:

L＝αL_area+βL_time+γL_shape+λL_score

l_area represents the remaining marking area after combination, L_time represents the required marking time after combination, L_shape represents the length-width ratio of the combined containing frame, L_score represents the entropy of the remaining prediction frame after combination, and the entropy is calculated by the confidence coefficient of each prediction frame.

In some embodiments, in order to adapt to more labeling tasks, flexibility and robustness are improved, an adaptive prediction frame conversion scheme may be further adopted, different conversion modes are learned for different data pools by using an end-to-end model, and at least two prediction frames are converted into a second inclusion frame by using a trained prediction frame conversion model.

Specifically, for several prediction blocks of the model output, it is used with a vector b of dimension 6 _i ^k ∈R ^d The representation, where i represents the ith image data, k represents the kth prediction box in the image data, d=6 represents the dimension of the prediction box vector, where four dimensions represent the top left and bottom right coordinates of the prediction box, one dimension represents the confidence score of the prediction box, and one dimension represents the type of box (i.e., whether the prediction box is a containment box or not). For each image data, the prediction frame information can be represented by a sequence of k×d, and after the prediction frame information is input into the prediction frame conversion model, an output of k×d is obtained, and the converted result is represented. Illustratively, the prediction frame transformation model may be implemented using the Encoder-Decoder structure of the transducer, and the training data is the pre-transformation and post-transformation prediction frames collected during the labeling process. The transformation from the prediction frame to the containing frame can be adaptively realized by using a leavable prediction frame transformation model, so that the transformation model can be pertinently adapted to different data pools, and meanwhile, the subtle design of transformation rules is omitted.

When the image data containing the predicted result is obtained from the predicted result pool and is input to the calibration module, a plurality of image data which are most easily marked can be searched and input to the calibration module. Specifically, determining entropy values of the prediction frames according to confidence degrees of a plurality of prediction frames in the image data; determining the marking difficulty index of the image data according to the entropy values of the plurality of prediction frames, sorting the image data according to the difficulty index of the image data, and inputting at least one image data with the lowest marking difficulty index into the calibration module. Illustratively, the labeling difficulty index is calculated as follows:

Wherein p is _k The confidence of the kth predicted frame is represented, and for the containing frame, the confidence is the highest confidence of the predicted frame in the containing frame.

Illustratively, the calibration interface for the quick annotation stage is shown in FIG. 7. The user may modify the position of the prediction box, add the prediction box, or delete the prediction box through the calibration interface. Similar to the initialization phase, the user may select whether to label as a target box or a containing box as desired. After confirmation and marking are completed, clicking a final confirmation button or clicking a carriage return for submitting, adding image data containing marking results into a marked data pool by a calibration module for subsequent training of a prediction model, intercepting out an area containing a frame if the frame is still contained in the image data, adding the area into the data pool to be marked as a sub-image to be marked, and waiting for subsequent sampling, prediction and calibration. When the cut sub-image is marked, the calibration module updates the marking result to the corresponding 0-level image data of the sub-image, namely, when all the areas in the containing frame in the 0-level image data are marked, the calibration module finishes marking. When all the image data to be annotated are annotated, the export sds can be clicked to save the annotated data. Meanwhile, in the labeling process, labeled image data can be stored at any time.

Through tests, the labeling method based on the image data provided by the embodiment of the invention has the advantages that the labeling speed is improved by 2-10 times, for example, in a license plate detection task, the labeling speed can be improved by 2-3 times, and in a cup construction labeling task, the labeling speed can be improved by 7-8 times. After the containing frame is used, the marking speed can be effectively improved, for example, the speed of license plate marking can be improved to 1.8 times from 1.4 times when the containing frame is not used. Meanwhile, after the regression model is used, the speed of license plate marking can be further improved to 2.4 times. Besides the effective improvement of the labeling speed, the labeling method of the embodiment of the invention can also improve the labeling experience, for example, for labeling small target objects, the step of manually zooming an interface by a user is omitted, and the labeling experience is better.

Based on the description, the image data labeling method according to the embodiment of the invention generates a prediction result based on the prediction model, and the calibration module calibrates the prediction result to obtain a labeling result, so that the labeling speed of the image data can be increased; meanwhile, the image data containing the labeling result is utilized to train the prediction model in the labeling process, so that the accuracy of the prediction model is improved.

The method for labeling image data according to the embodiment of the invention can be implemented in a device, an apparatus or a system having a memory and a processor.

By way of example, the image data annotation method according to the embodiment of the present invention may be deployed at a personal terminal, such as a smart phone, a tablet computer, a personal computer, or the like. Alternatively, the method for labeling image data according to the embodiment of the invention can be deployed at a server (or cloud). Alternatively, the labeling method of the image data according to the embodiment of the invention can be distributed and deployed at the server (or cloud) and the personal terminal. Alternatively, the labeling method of the image data according to the embodiment of the present invention may be distributed and deployed at different personal terminals.

Exemplary step flows included in the image data labeling method according to the embodiment of the present invention are exemplarily described above. Next, an electronic device according to an embodiment of the present invention is described with reference to fig. 9, fig. 9 shows a schematic block diagram of an electronic device 900 according to an embodiment of the present invention. The electronic device 900 of an embodiment of the present invention includes a memory 910 and a processor 920. Wherein the memory 910 stores program code for implementing corresponding steps in a method of labeling image data according to an embodiment of the present invention. The processor 920 is configured to execute the program code stored in the memory 910 to perform the corresponding steps of the method 200 for annotating image data according to an embodiment of the present invention. The labeling method 200 of image data in the embodiment of the present invention may refer to the above, and is not described herein.

Furthermore, according to an embodiment of the present application, there is also provided a storage medium on which program instructions are stored, which program instructions, when executed by a computer or a processor, are adapted to carry out the respective steps of the labeling method of image data of the embodiments of the present application. The storage medium may include, for example, a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, read-only memory (ROM), erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, or any combination of the foregoing storage media. The computer-readable storage medium may be any combination of one or more computer-readable storage media.

According to an embodiment of the present application, there is also provided a computer program product including a computer program/instruction which, when executed by a processor, implements the method for labeling image data described above.

According to the image data labeling method, the electronic equipment and the computer readable medium, a prediction result is generated based on the prediction model, and the prediction result is calibrated through the calibration module to obtain the labeling result, so that the image data labeling speed can be increased; meanwhile, the image data containing the labeling result is utilized to train the prediction model in the labeling process, so that the accuracy of the prediction model is improved.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the above illustrative embodiments are merely illustrative and are not intended to limit the scope of the present invention thereto. Various changes and modifications may be made therein by one of ordinary skill in the art without departing from the scope and spirit of the invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another device, or some features may be omitted or not performed.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in order to streamline the invention and aid in understanding one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof in the description of exemplary embodiments of the invention. However, the method of the present invention should not be construed as reflecting the following intent: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be combined in any combination, except combinations where the features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or other suitable processor may be used in practice to implement some or all of the functionality of some of the modules according to embodiments of the present invention. The present invention can also be implemented as an apparatus program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

The foregoing description is merely illustrative of specific embodiments of the present invention and the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the scope of the present invention. The protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method for labeling image data, characterized in that the method for labeling image data comprises:

Acquire image data to be annotated from a data pool to be annotated, input the image data to be annotated into a prediction model, and obtain a prediction result of a target object;

Adding the image data containing the prediction result to the prediction result pool;

Acquire image data containing the prediction result from the prediction result pool, input the image data containing the prediction result into a calibration module, and acquire a labeling result obtained by calibrating the prediction result through the calibration module;

Adding the image data including the annotation result output by the calibration module to the annotated data pool;

The image data including the annotation results is obtained from the annotated data pool, and the prediction model is trained using the image data including the annotation results.

2. The image data annotation method according to claim 1, characterized in that:

The image data including the annotation result further includes a containing frame, and the containing frame includes at least one target object;

The method further comprises: intercepting the image data in the containing frame to obtain a sub-image, and adding the sub-image to the to-be-annotated data pool;

After the sub-image is labeled, the labeling result of the sub-image is updated to the image data corresponding to the sub-image and including the labeling result.

3. According to the image data annotation method according to claim 2, the inclusion box is an inclusion box annotated by the calibration module, or the inclusion box is an inclusion box generated based on the prediction result output by the prediction model.

4. The image data annotation method according to claim 2 or 3, characterized in that the obtaining of the image data containing the prediction result from the prediction result pool comprises:

The image data containing the prediction result is extracted from the image data of each level according to the target ratio, wherein the image data of the Nth level is obtained by intercepting the containing box in the image data of the N-1th level, and N is greater than or equal to 1.

5. The image data annotation method according to claim 1, wherein the prediction result includes the position of the prediction box and the confidence of the prediction box; and the step of obtaining the image data including the prediction result from the prediction result pool comprises:

Determining an entropy value of the prediction box according to confidences of a plurality of prediction boxes in the image data;

Determine a labeling difficulty index of the image data according to the entropy values of the multiple prediction boxes, and obtain at least one image data with the lowest labeling difficulty index.

6. The image data labeling method according to claim 1, characterized in that before obtaining the image data to be labeled from the to-be-labeled data pool and inputting the image data to be labeled into the prediction model, the method further comprises:

Acquire the image data to be annotated from the to-be-annotated data pool, input the image data to be annotated into the calibration module, and obtain the initialization annotation result obtained by annotating the image data to be annotated through the calibration module;

The image data output by the calibration module and containing the initialization annotation result are added to the annotated data pool, and the prediction model is trained using the image data containing the initialization annotation result.

7. The image data annotation method according to claim 1, wherein the prediction model comprises a prediction module and a regression module.

The prediction module is used to identify the target object in the image data to be labeled to obtain a primary prediction result;

The regression module is used to obtain local image data intercepted from the image data based on the primary prediction result, and identify the target object in the local image data to obtain a secondary prediction result;

The prediction result output by the prediction model is the secondary prediction result.

8. An electronic device, characterized in that the electronic device comprises a memory and a processor, wherein the memory stores a computer program executed by the processor, and when the computer program is executed by the processor, the image data labeling method according to any one of claims 1 to 7 is executed.

9. A computer-readable medium, characterized in that a computer program is stored on the computer-readable medium, and the computer program executes the image data labeling method according to any one of claims 1 to 7 when running.

10. A computer program product, comprising a computer program/instruction, characterized in that when the computer program/instruction is executed by a processor, the image data annotation method according to any one of claims 1 to 7 is implemented.