WO2019017990A1 - Intégration unifiée d'apprentissage - Google Patents
Intégration unifiée d'apprentissage Download PDFInfo
- Publication number
- WO2019017990A1 WO2019017990A1 PCT/US2017/062222 US2017062222W WO2019017990A1 WO 2019017990 A1 WO2019017990 A1 WO 2019017990A1 US 2017062222 W US2017062222 W US 2017062222W WO 2019017990 A1 WO2019017990 A1 WO 2019017990A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- neural network
- verticals
- loss function
- machine learning
- processing apparatus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24143—Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/776—Validation; Performance evaluation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Definitions
- This specification relates to training a unified neural network model.
- Neural networks are machine learning models that employ one or more layers of operations to generate an output, e.g., a classification, for a received input.
- Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer of the network. Some or all of the layers of the network generate an output from a received input in accordance with current values of a respective set of parameters.
- Some neural networks include one or more convolutional neural network layers.
- Each convolutional neural network layer has an associated set of kernels.
- Each kernel includes values established by a neural network model created by a user.
- kernels identify particular image contours, shapes, or colors. Kernels can be represented as a matrix structure of weight inputs.
- Each convolutional layer can also process a set of activation inputs. The set of activation inputs can also be represented as a matrix structure.
- a data processing apparatus determines a learning target for each object vertical in a group of object verticals.
- the data processing apparatus can determine each learning target based on two or more embedding outputs of the neural network.
- Each embedding output may be generated by separate specialized models that are individually trained using a triplet loss function.
- Each specialized model is configured to identify data associated with a particular object vertical.
- a unified machine learning model is generated when the data processing apparatus trains a neural network to identify data associated with each object vertical in the group.
- the data processing apparatus trains the neural network based on an L2-loss function and using the respective learning targets of the specialized models.
- the data processing apparatus uses the trained neural network to generate a unified machine learning model.
- the unified model can be configured to identify particular electronic data items that include object representations of items that are in each of the object verticals.
- the method includes: determining, by the data processing apparatus and for the neural network, respective learning targets for each of a plurality of object verticals, wherein each object vertical defines a distinct category for an object that belongs to the vertical; training, by the data processing apparatus and based on a first loss function, the neural network to identify data associated with each of the plurality of object verticals, where the neural network is trained using the respective learning targets; and generating, by the data processing apparatus and using the neural network trained based on the first loss function, a unified machine learning model configured to identify items that are included in the data associated with each of the plurality of object verticals.
- determining respective learning targets for the neural network further includes: training, by the data processing apparatus and based on a second loss function, at least one other neural network to identify data associated with each of the plurality of object verticals; in response to training, generating, by the data processing apparatus, two or more embedding outputs, where each embedding output indicates a particular learning target and includes a vector of parameters that correspond to the data associated with a particular object vertical; and generating, by the data processing apparatus and using the at least one other neural network trained based on the second loss function, respective machine learning models, each machine learning model being configured to use a particular embedding output.
- determining respective learning targets for the neural network further includes: providing, for training the neural network, the respective learning targets generated from respective separate models.
- each of the plurality of object verticals corresponds to a particular category of items and the data associated with each of the plurality of object verticals includes image data of an item in the particular category of items.
- the particular category is an apparel category and items of the particular category include at least one of: handbags, shoes, dresses, pants, or outerwear; and wherein the image data indicates an image of at least one of: a particular handbag, a particular shoe, a particular dress, a particular pant, or particular outerwear.
- each of the respective machine leaming models are configured to identify data associated with a particular object vertical and within a first degree of accuracy; and the unified machine learning model is configured to identify data associated with each of the plurality of object verticals and within a second degree of that exceeds the first degree of accuracy.
- determining the respective learning targets for each of the plurality of object verticals includes: analyzing the two or more embedding outputs, each embedding output corresponding to a particular object vertical of the plurality of object verticals; and based on the analyzing, determining the respective learning targets for each of the plurality of object verticals.
- the first loss function is an L2-loss function and generating the unified machine learning model includes: generating a particular unified machine leaming model that minimizes a computational output associated with the L2-loss function.
- the neural network includes a plurality of neural network layers that receive multiple layer inputs, and where training the neural network based on the first loss function includes: performing batch normalization to normalize layer inputs to a particular neural network layer; and minimizing covariate shift in response to performing the batch normalization.
- the second loss function is a triplet loss function and generating the respective machine leaming models includes: generating a particular machine leaming model based on associations between an anchor image, a positive image, and a negative image.
- implementations of this and other aspects include corresponding systems, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices.
- a computing system of one or more computers or circuits can be so configured by virtue of software, firmware, hardware, or a combination of them installed on the system that in operation cause the system to perform the actions.
- One or more computer programs can be so configured by virtue of having instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
- the described teachings include processes for using a neural network to generate a unified machine learning model using an L2-loss function, where the unified model can be used to identify or recognize a variety of objects for multiple object verticals (e.g., dresses, handbags, shoes).
- object verticals e.g., dresses, handbags, shoes.
- the unified model generated according to the described teachings can be used to locate or retrieve the same or similar items.
- an item's appearance can change with lighting, viewpoints, occlusion, and background conditions.
- Distinct object verticals can also have different characteristics such that images from a dress vertical may undergo more deformations than those from a shoe vertical. Hence, because of these distinctions, separate models are trained to identify items in each object vertical.
- FIG. 1 illustrates a neural network architecture for generating a machine learning model based on a first loss function.
- FIG. 2 illustrates a neural network architecture for generating a machine learning model based on a second loss function.
- FIG. 3 illustrates example graphical representations of embedding data relating to different object verticals.
- FIG. 4 is an example flow diagram of a process for generating a unified machine learning model for multiple object verticals based on a particular loss function.
- FIG. 5 illustrates a diagram that includes graphical representations of respective embedding models for object verticals that correspond to a particular apparel category.
- FIG. 6 illustrates a diagram that with computing functions for obtaining image data for training one or more machine learning models.
- Machine learning systems can be trained, using deep neural networks, to recognize particular categories of items based on learned inferences.
- Deep neural networks can generate inferences based on analysis of input data received by the machine learning system.
- Trained machine learning systems may produce or generate one or more specialized models that use particular sets of learned inferences for identification or recognition of particular categories of items.
- a category or object vertical can correspond to a variety of items or objects, such as automobiles, animals, human individuals, and various physicals objects, for example represented in image data.
- object vertical can correspond to audio signal data.
- specialized models may significantly outperform general models, e.g., models that are trained to recognize items associated with a wide range of object verticals. Therefore, item recognition models, generated using deep neural networks, are often trained separately for different object verticals.
- An object vertical defines a distinct category for an object that belongs to the vertical, e.g., for apparel, object verticals may be the item categories of hats, shoes, shirts, jackets, etc.).
- an item recognition system that includes groups of specialized models for identifying items of different object verticals can be expensive to deploy and may not be sufficiently scalable.
- the subject matter described in this specification includes systems and methods for generating a unified machine leaming model using a neural network on a data processing apparatus.
- the unified embedding model can be generated using a deep neural network that utilizes learning targets indicated by embedding outputs generated by respective specialized models.
- the neural network (or deep neural network) can be used by an example machine learning system to learn inferences for identifying a variety of items across multiple obj ect verticals, e.g., item categories corresponding to various apparel classes.
- a machine learning system that includes a neural network can access respective learning targets for each object vertical in a group of object verticals.
- the system can determine the respective learning targets based on two or more embedding outputs of the neural network.
- Each embedding output can be generated by respective specialized models that are individually trained using a triplet loss function and to identify data (e.g., an image of a luxury handbag) associated with a particular object vertical (e.g., handbags).
- the data processing apparatus of the system trains the neural network to identify data associated with each vertical in the group of object verticals.
- the neural network can be trained using the respective learning targets of the specialized models and based on an L2- loss function.
- the data processing apparatus uses the trained neural network to generate a unified machine learning model configured to identify particular data items (e.g., Brand name sneakers, luxury purse, luxury blouse, etc.) associated with each object verticals in the group (e.g., shoes, handbags, tops/shirts, etc.).
- FIG. 1 illustrates a neural network system architecture 100 ("system 100") for generating an example machine learning model based on a first loss function.
- Generating a machine learning model can include system 100 performing neural network computations associated with inference workloads.
- computations for inference workloads can include processing neural network inputs (e.g., input activations) through layers of a neural network.
- Each layer of a neural network can include a set of parameters (e.g., weights) and processing an input through a neural network layer can include computing dot products using the input activations and parameters as operands for the computations.
- System 100 generally includes an example neural network indicated by neural net architecture 102.
- a neural network of architecture 102 can include a base network 103, a pooling layer 104, a first connected layer set 106, a second connected layer set 108, and embedding outputs 110.
- Base network 103 can include a subset of neural network layers of architecture 102.
- a deep neural network can include a base network 103 that includes multiple convolutional layers. These convolutional layers can be used to perform complex computations for computer based recognition of various items included in a variety of image data.
- base network 103 can be inception v2, an inception v3, an inception v4, or another related neural net structure.
- Architecture 102 can include a variety of additional neural network layers that perform various functions associated with inference computations for training a machine learning model.
- pooling layer 104 can be an average pooling layer or max pooling layer that perform functions related to pooling output activations for down-sampling operations.
- the down-sampling operations can reduce a size of output datasets by modifying certain spatial dimensions that relate to an input dataset.
- Connected layer sets 106, 108 can be respective sets of fully connected layers that include artificial neurons that have full connections to all activations in a previous layer.
- Embedding outputs 110 can correspond to one or more output feature sets that include a vector of floats/parameters for given output dimension (64-d, 256-d, etc.). As described in more detail below, embedding outputs 110 are formed, produced, or generated, when the example neural network of system 100 is trained to perform certain computational functions for object/item recognition or identification.
- System 100 can include one or more processors and other related circuit components that form one or more neural networks.
- processors Central Processing Units (CPUs), Graphics Processing Units (GPUs), digital signal processors (DSPs), or other related processor architectures.
- CPUs Central Processing Units
- GPUs Graphics Processing Units
- DSPs digital signal processors
- System 100 can include multiple computers, computing servers, and other computing devices that each include processors and memory that stores compute logic or software instructions that are executable by the processors.
- system 100 includes one or more processors, memory, and data storage devices that collectively form one or more architecture 102.
- Processors of architecture 102 process instructions for execution by system 100, including instructions stored in the memory or on the storage devices. Execution of the stored instructions can cause performance of the machine learning processes described herein.
- system 100 is configured to perform a variety of computing operations related to machine learning processes. For example, system 100 performs learning operations 112 and 114 as well as a variety of other operations related to training a neural network to generate one or more specialized machine learning models. In some implementations, system 100 executes programmed code or software instructions to perform computations associated with learning operations 112 and 114. As described in more detail below, learning operations 112 and 1 14 are executed by system 100 to train respective specialized learning models based on a triplet loss function indicated by computing logic 1 16.
- Learning operation 1 12 includes system 100 using the neural network of architecture 102 to generate model training data.
- the model training data can correspond to embedding outputs that are produced by system 100 when the system is trained to generate a particular specialized model.
- system 100 generates multiple distinct specialized models and produces individual sets of embedding outputs, where a particular set of embedding outputs corresponds to a particular specialized model.
- separate specialized models can be generated to recognize and retrieve apparel items for different apparel categories (e.g., dresses, tops, handbags, etc.).
- embedding models for recognizing images from various websites or other user produced images e.g., "street” or “real life” digital images captured using mobile devices/smartphones
- a sub-network for each vertical such as dresses, handbags, eyewear, and pants can be fine-tuned independently.
- the results of the model training can enable a machine learning system to produce up to eleven separate specialized models that each correspond to one of eleven verticals.
- a "vertical" or "object vertical” corresponds to an object or item category.
- an object vertical can be an apparel item category such as dresses, handbags, eyewear, pants, etc.
- using separate models for object recognition of items (e.g., apparel items) in a particular category/vertical can result in substantially accurate item recognition results.
- system 100 trains a neural network of architecture 102 using image data for apparel items that are each associated with different apparel categories. For example, system 100 can use image data for multiple different types of handbags to train the neural network. System 100 can then generate a specialized model to identify or recognize particular types of handbags based on embedding outputs that are produced in response to training the neural network.
- a particular set of embedding outputs can correspond to a particular specialized model.
- a first set of embedding outputs can correspond to neural network training data for learned inferences used to generate a first model for recognizing certain shirts/blouses or tops (e.g., see operation 114).
- a second set of embedding outputs can correspond to neural network training data for learned inferences used to generate a second model for recognizing certain jeans/pants/skirts or bottoms.
- Each set of embedding outputs include embedding feature vectors that can be extracted and used for object or item retrieval.
- extracted sets of embedding feature vectors can correspond to respective learning targets and an embedding output of a trained neural network model can include these embedding feature vectors.
- One or more learning targets can be used to train a machine learning system (e.g., system 100) to generate particular types of specialized computing models. For example, as discussed below with reference to features of FIG. 2, multiple distinct learning targets can be used to train at least one unified machine learning model that recognizes items that are associated with multiple different verticals or categories.
- system 100 executes learning operation 114 to generate specialized learning models based on model training data determined at learning operation 112.
- determining or generating model training data corresponds to an example process of "learning" individual embedding models.
- system 100 uses a two-stage approach when training (e.g., a first stage) and when extracting embedding feature vectors for object retrieval (e.g., a second stage).
- the first stage can include localizing and classifying an apparel item of the image data.
- classifying an apparel item of the image data includes system 100 determining an object class label for the apparel item of the image data.
- system 100 can use an example object detector to analyze the image data.
- System 100 can then use the analysis data to detect an object or apparel item of the image data that includes object attributes that are associated with handbags. Based on this analysis and detection, system 100 can then determine that the object class label for the apparel item is a "handbag" class label.
- system 100 includes object detection architecture that is a single-shot multi-box (SSD) detector for a base network 103 that is an inception V2 base network.
- system 100 can be configured to use or include a variety of other object detection architectures and base network combinations.
- the SSD can be an example computing module of system 100 that executes program code to cause performance of one or more object detection functions.
- this SSD detector module can provide bounding boxes that bound an object of the image data.
- the SSD can further provide apparel class labels that indicate whether the bounded object is a handbag, an eyewear item, or a dress.
- object pixels of the image data can be cropped and various features can then be extracted on the cropped image using a particular embedding model of system 100.
- Sub-process steps associated with the first stage of the two-stage process can be used to train a specialized embedding model based on a variety of image data.
- system 100 can proceed to the second stage and train specialized embedding models to compute similarity features for object retrieval. For example, system 100 can perform embedding model training using a triplet loss function indicated by computing logic 116. More specifically, system 100 uses triplet ranking loss to learn feature embeddings for each individual item/object vertical or category.
- a triplet includes an anchor image, a positive image, and a negative image.
- system 100 seeks to produce embeddings such that the positive image gets close to the anchor image while the negative is pushed away from the anchor image in a feature space of the neural network. Embeddings learned from triplet training are used to compute image similarity.
- a is the margin enforced between the positive and negative pairs
- ii) /(/) is the feature embedding for image /
- iii) D (f x , f y ) is the distance between the two feature embeddings f x md f y .
- a positive image is the same product (e.g., Chanel handbag) as the anchor image, while the negative image is of another product but in the same apparel vertical (e.g., luxury handbag).
- system 100 executes computing logic for semihard negative mining functions for obtaining negative image data. For example, system 100 can access online/web-based resources and use semi-hard negative mining to identify strong negative object images. System 100 can then use these object images to enhance or improve effectiveness of the training for a particular specialized model.
- model training based on the triplet loss function of logic 116 produces training data such as embedding outputs that include feature vectors. Extracted sets of embedding feature vectors can correspond to respective leaming targets.
- system 100 can determine respective learning targets based on the triplet loss model training data. System 100 can then use two or more of the leaming targets to train a machine leaming system (e.g., system 200 of FIG. 2) to generate at least one unified computing model, as described below.
- a machine leaming system e.g., system 200 of FIG. 2
- FIG. 2 illustrates a neural network system architecture 200 ("system 200") for generating an example machine learning model based on a second loss function, e.g., an L2- loss function.
- system 200 includes substantially the same features as system 100 described above.
- system 200 includes an L2 normalization layer 212 that is described in more detail below.
- system 200 is a sub-system of system 100 and can be configured to execute the various computing functions of system 100 described above.
- System 200 is configured to learn or generate a unified embedding model that is trained based on learned inferences. These learned inferences enable object recognition of various item groupings, where each grouping corresponds to distinct object verticals or categories (e.g., apparel categories). System 200 learns one or more unified models by combining training data produced when system 100 is used to train respective specialized models as described above. In some related model learning/training scenarios, combining training data from separate models to generate a unified model can cause performance degradation when triplet loss is used to train the unified model.
- system 200 can generate a unified embedding model that achieves equivalent performance and recognition accuracy when compared to individual specialized models. Further, the unified model can have the same, or even less, model complexity as a single individual specialized model. Hence, this specification describes improved processes and methods for easing or reducing the difficulties in training model embeddings for multiple verticals such that a unified model can be generated.
- separate specialized models can be first trained to achieve a desired threshold level of accuracy for recognizing objects included in image data.
- the separate models can be trained using system 100 and based on the triplet loss function of computing logic 118. Embedding outputs of each separately trained model are then used as the learning targets to train an example unified model.
- a particular specialized model can have an example accuracy metric of 0.66, where the model accurately identifies certain handbags 66.1 % (66.1) of the time.
- a unified model generated according to the described teachings can achieve accurate object recognition results that exceed an accuracy metric of the objection results of the particular specialized model (e.g., 66.1).
- a unified model generated according to the described teachings can have an object retrieval or recognition accuracy of 0.723 accuracy metric, or 72.3 percent accuracy, for a handbags apparel category.
- accurately recognizing/identifying an object using the unified model includes determining a category of the obj ect (e.g., "handbag"), determining an owner or designer of the object (e.g., "Chanel” or “Gucci”), and/or determining a type/style of the handbag (e.g., "Chanel 2.55 classic flap bag”).
- identifying an object by the unified model an include the model retrieving (e.g., object retrieval) associated image data that includes a graphical representation of the object.
- system 200 is configured to generate a unified model that can execute multiple tasks for accurate object recognition and retrieval that, in prior systems, are performed by separated models but with reduced accuracy.
- the described teachings include methods and processes for improvements in emulating separate model embeddings outputs (e.g., learning targets) through use of an L2 loss function.
- training a unified model based on triplet loss and by combining training data for two different object verticals (e.g., handbags and shoes) can generate a unified model that performs object recognition of items in those verticals with reasonable accuracy.
- triplet loss when combining training data for three or more different object verticals may result in a unified model that performs with substantially poor object recognition accuracy.
- the poor accuracy results from difficult and complex computing challenges that occur when training a unified model based on a triplet loss function for several distinct verticals.
- this specification proposes a learning scheme that uses embedding outputs from specialized models as learning targets such that L2-loss can be used instead of triplet loss.
- Use of the L2-loss function eases the training difficult with generating the unified model and provides for more efficient use of a neural network's feature space.
- the end result is a unified model that can achieve the same (or greater) retrieval accuracy as a number of separate specialized models, while having the model complexity of a single specialized model.
- system 200 uses the respective leaming targets of the separate models to leam a unified leaming model such that embeddings generated from this unified model are the same as (or very close to) the embeddings of separate specialized models generated by system 100.
- system 200 uses a neural network of architecture 102 to determine respective learning targets for each object vertical in a group of object verticals. Each of the respective leaming targets can be based on a particular embedding output of the neural network.
- FIG. 2 shows computing operations for generating a unified machine learning model.
- system 200 accesses learning targets that correspond to feature embeddings for the respective specialized models.
- system 200 generates unified model training data that correspond to feature embeddings for detecting objects of various verticals.
- the feature embeddings are based on neural network inference computations that occur during unified model learning.
- V ⁇ jj Li
- M be a set of embedding models, where each M t is the model learned for vertical set 1 ⁇ 2.
- f s j denote the feature embeddings generated from M s for image Ij .
- system 200 At learning operation 220, system 200 generates a unified machine learning model configured to identify particular items included in example image data.
- the image data can be associated with each object vertical of the group and the unified model is generated using a neural network trained based on a particular loss function (e.g., L2-loss).
- system 200 is configured to learn a model U, such that the features produced from model U are the same as features produced from the separate specialized models generated by system 100.
- f uj denote the feature embeddings generated from model U.
- a learning goal of system 200 is to determine a model U which can minimize a computational out ut associated with the following loss function shown as equation (2).
- system 200 is configured to generate a unified model that has an output dimension 107 that is 256-d.
- a single specialized model generated at learning operation 1 14 can have an output dimension 107, e.g., 4096-d, that is larger than, or substantially larger than, the 256-d output dimension of the unified model.
- use of L2-loss provides for a less complex and less difficult training process than triplet loss.
- L2-loss function provides for less complexity and difficulty in the application of learning techniques, such as batch normalization.
- learning techniques such as batch normalization
- neural network layer inputs can be normalized to allow for higher learning rates.
- batch normalization can be used to achieve desired threshold accuracy metrics (e.g., 0.60 or higher) with fewer training steps when compared to learning techniques used with other loss functions.
- covariate shift can be minimized in response to system 200 performing batch normalization functions that are applied via L2 normalization layer 212.
- deep neural networks can include multiple layers in a sequence. Training deep neural networks is often complicated by the fact that a distribution of each layer's inputs can change during model training, for example, as parameters of previous layers in a sequence change.
- Such changes can slow down a speed with which a model can be trained using a deep neural network, thereby requiring slower learning rates and careful parameter initialization.
- Parameter changes that adversely affect training speed can be described as neural network internal covariate shift.
- batch normalization processes for normalizing layer inputs can be performed to resolve or minimize adverse effects on training speed that are caused by covariate shift.
- a learning approach that uses L2-loss to generate a unified model allows for the use of increased amounts of unlabeled data relative to triplet loss.
- a product identity e.g. "Chanel 2.55 classic flap bag”
- L2-loss requires only the vertical labels, which can be generated automatically by an example localization/classification model.
- use of L2-loss can reduce processor utilization and increase system bandwidth for additional computations by foregoing computations for determining product identities.
- the described teachings also include methods for selecting vertical data combinations for producing a particular model (e.g., a unified model or other related combined model forms) that can be used to reduce a number of specialized models.
- This particular combined model can be successfully learned and can have a comparable object recognition accuracy that is similar to, the same as, or greater than the recognition accuracy of each separate specialized model.
- system 200 can include computing logic for determining which embeddings data from different verticals for specialized models can be combined to produce an example combined model.
- system 200 can progressively add embeddings data from other verticals. While adding embeddings data, system 200 can perform sample item recognition tasks to determine whether a model learned from the combined data causes observed accuracy degradation.
- system 200 can steadily add embeddings data from other verticals until accuracy degradation is observed.
- system 200 determines a particular combination of verticals for a number of specialized models, where each specialized model is used for item recognition across a subset of verticals. Further, system 200 can determine a particular combination of verticals for a number of specialized models while also maintaining a threshold level of accuracy. System 200 can then use feature embeddings for specialized models that correspond to particular verticals in the subset and produce a combined model based on the feature embeddings.
- FIG. 3 shows graphical representations of embeddings data 300 for different object verticals in a feature space of an example neural network.
- the graphical representations generally indicate that a unified model trained (e.g., learned) based on the described teachings can provide more efficient and broader use of a feature space of the neural network.
- the described learning approach that uses L2-loss can efficiently train a unified model by taking advantage of pre-established feature mappings (e.g., learning targets) learned for separate specialized models.
- Embeddings data 300 includes t-distributed stochastic neighbor embedding (t-SNE) visualizations generated from feature embeddings of separate specialized models.
- t-SNE stochastic neighbor embedding
- embeddings data 300 includes two thousand images from each vertical 302, 304, and 306, where the data is projected down to 2D space for visualization.
- FIG. 3 indicates that the feature embeddings f sj are separated across verticals 302, 304, 306 in the feature space.
- an embedding model for each vertical f sj uses only a part of the dimensional space (e.g., 64-d), and therefore one unified model can be learned to combine embeddings outputs for each apparel vertical included in embedding of data 300 (e.g., 8 total verticals).
- FIG. 4 is an example flow diagram of a process for generating a unified machine learning model for multiple object verticals based on a particular loss function.
- Process 400 corresponds to an improved process for generating unified machine learning models, where the generated models have item recognition accuracy metrics that are at least equal to an accuracy metric of two or more distinct specialized models.
- Process 400 can be implemented using system 100 or 200 described above, where system 100 can perform all described functionality associated with sub-system 200.
- Process 400 includes block 402 where system 100 determines respective learning targets for each object vertical in a group of object verticals.
- a neural network of system 100 determines respective learning targets based on two or more embedding outputs of the neural network.
- the object verticals can be apparel categories, such as a dresses, shoes, or handbags.
- each vertical can correspond to an embedding output that is produced when a particular model is trained to identify or recognize apparel or clothing items in a vertical.
- Example apparel items can include cocktail dresses, basketball sneakers, or brand name monogram handbags.
- system 100 trains the neural network to identify data associated with each object vertical in the group based on a first loss function (e.g., L2-loss).
- the neural network is trained using the respective learning targets that were determined for each object vertical. For example, given an image file or image data, system 100 can train the neural network to at least: i) identify a dress item in an image based on analysis of pixel data of the image; ii) identify a shoe item in an image based on analysis of pixel data of the image; or iii) identify a handbag item in an image based on analysis of pixel data of the image.
- system 100 generates, a unified machine learning model that is configured to identify items that are included in the data associated with each object vertical of the group of verticals.
- data processing apparatus of system 100 can use the neural network trained based on the first loss function to generate the unified machine learning model that performs one or more of the object recognition functions described herein.
- determining the respective learning targets includes: i) training the neural network to identify data associated with each of the object verticals, where the neural network is trained based on a second loss function; and ii) generating at least two embedding outputs, where each embedding output indicates a particular learning target of the respective learning targets.
- each embedding output can include a vector of floats (or parameters) that correspond generally to attributes of the image data associated with a particular object vertical.
- system 100 generates respective machine learning models, where each of the models are generated using the neural network trained based on the second loss function (e.g., triplet loss) that is different than the first loss function.
- the second loss function e.g., triplet loss
- each of the models may use a vector of floats for a particular embedding output to identify apparel or clothing items for a particular object vertical.
- generating the embeddings occurs in response to training the neural network.
- generating a unified machine learning model can include combining training data associated with different apparel verticals. To calibrate object identification and retrieval performance (e.g., determine learning targets), triplet loss is first used to learn embeddings for each vertical. A goal for vertical combination can be to use fewer numbers of individual specialized models, but without any observed retrieval accuracy degradation.
- Table 1 shows examples of retrieval accuracy metrics of an (1) individual model, (2) dress-top joint, or combined, model, and (3) dress-top-outerwear joint model, on verticals for "dresses", "tops", and “outerwear.”
- the dress-top joint model performances are very similarly or slightly better on “dresses” and "tops", however, the dress-top joint model does poorly with regard to retrieval accuracy of apparel items in the "outerwear" vertical category.
- a dress-top-outerwear joint model can cause significant accuracy degradation on all three verticals.
- Accuracy data of Table 1 indicates that some verticals can be combined to achieve better accuracy than individually trained models, but only to a certain extent, after which model training difficulties of the triplet loss function causes accuracy degradation (described above).
- An example process of system 100, 200 can include combining different verticals of training data.
- nine apparel verticals can be combined into four groups, where one combined model is trained for each group.
- more or fewer than nine apparel verticals can be combined into a particular number of groups based on user requirements.
- the four groups are shown in Table 2 and can have comparable performance retrieval accuracy as the individually trained models of each group.
- "clean triplets” are used to fine-tune each of the four models, where the clean triplets are obtained from "image search” triplets (described below).
- system 200 can be configured to fine-tune model performance using the clean data to accomplish effective improvements in retrieval accuracy for each of the four models.
- a unified model for all nine verticals can be trained, or learned, and then generated based on the above described teachings.
- a generated model is deployed for operational use by multiple users.
- the unified model can receive image data transmitted by a user, where the user seeks to obtain identifying data about a particular object or apparel item included in an image.
- the unified model is configured to receive image data for an image, identify or recognize an apparel item in the image, and determine identifying information about the apparel item in the image. In some instances, the unified model is configured to retrieve a related image of the apparel item, and provide, for output to the user via a mobile device, identifying information about the apparel item and/or the related image of the apparel item.
- FIG. 5 shows a diagram 500 that includes graphical representations of respective embedding models for object verticals that correspond to a particular apparel category. Moreover, the depictions of diagram 500 can correspond to a process, executable by system 100, 200, for extracting one or more feature embeddings. As described above, in the context of apparel recognition, an example two-stage approach can be used for extracting feature embeddings associated with image data of an item or apparel object.
- a clothing item can be first detected and localized in the image.
- an embedding e.g., a vector of floats
- the embedding is used by the system 100, 200 to compare a similarity image for item/object retrieval.
- the embedding obtained from the cropped image can correspond to a learning target for an apparel vertical to which the clothing item belongs.
- each embedding model 508, 510, 512, and 514 corresponds to obj ect retrieval functions for identifying and retrieving certain apparel items.
- the apparel items can correspond to items depicted in the particular cropped portions of image data.
- arrow 515 indicates that uniform embedding model 516 corresponds to object retrieval functions for identifying and retrieving apparel items for objects depicted in each of the cropped portions of image data in block 506.
- Block 506 includes respective cropped images that each include a representation of a clothing or apparel item for an object vertical category of the embedding model.
- a first cropped image data that depicts a handbag corresponds to embedding model 510 for identifying and retrieving image data for handbags.
- a second cropped image data that depicts pants corresponds to embedding model 512 for identifying and retrieving image data for pants.
- each cropped image data at block 506 corresponds to unified embedding model 516 for identifying and retrieving image data for various types of apparel items.
- FIG. 6 shows an example diagram 600 including computing functions for obtaining image data for training one or more machine learning models.
- Diagram 600 can correspond to computing functions executable by one or more computing modules of system 100 or 200 described above.
- training data relating to images is first collected from search queries.
- the search queries can be accessed from an example search system that receives and stores large volumes of image search queries.
- system 100 can be configured to access a query data storage device, such as privately owned query repository that stores thousands (e.g., 200,000) user queries submitted using Google Image Search.
- the search queries can include specific product or apparel item names.
- parsing logic for an example query text parser is executed by system 100 to obtain an apparel class label for each text query obtained from the query repository.
- the text queries can be associated with nine distinct verticals: i) dresses, ii) tops, iii) footwear, iv) handbags, v) eyewear, vi) outerwear, vii) skirts, viii) shorts, and ix) pants.
- system 100 selects a particular number of top rated images (e.g., 30 images) for each search query, where images are rated based on image pixel quality and an extent to which query text data accurately describes an apparel item of the image.
- Image data of the top rated images can be used to form the "image search triplets" described above, where a triplet includes a positive image, a negative image, and an anchor image.
- System 100 can identify at least a subset of triplets (e.g., 20,000 triplets for each object vertical) for system or user rating and verification as to the correctness of each image in the triplet.
- rating and image verification includes determining whether an anchor image and a positive image of the triplet are from the same product/vertical category.
- Subsets of triplet images that are rated and verified as correct can be used to form a second set of triplets referred to herein as "clean triplets".
- base network 103 can be initialized from a model pre-trained using one or more types of image data (e.g., ImageNet data).
- image data e.g., ImageNet data
- the same training data can be used for learning the unified embedding model as was used for triplet feature learning of two or more specialized models.
- generating a unified embedding learning model may only require vertical label data, which can be obtained via a localizer/classifier, as described above.
- unified embedding learning can be generated using the same training images as the training images generated during triplet embedding learning.
- Embodiments of the subject matter and the operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
- Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on computer storage medium for execution by, or to control the operation of, data processing apparatus.
- the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
- a computer storage medium can be, or be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them.
- a computer storage medium is not a propagated signal
- a computer storage medium can be a source or destination of computer program instructions encoded in an artificially-generated propagated signal.
- the computer storage medium can also be, or be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).
- the term "data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, a system on a chip, or multiple ones, or combinations, of the foregoing.
- the apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
- the apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them.
- the apparatus and execution environment can realize various different computing model infrastructures, such as web services, distributed computing and grid computing infrastructures.
- a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment.
- a computer program may, but need not, correspond to a file in a file system.
- a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
- a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
- the processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output.
- the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
- special purpose logic circuitry e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
- processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
- a processor will receive instructions and data from a read-only memory or a random access memory or both.
- the essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices.
- a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few.
- PDA personal digital assistant
- GPS Global Positioning System
- USB universal serial bus
- Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
- semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
- magnetic disks e.g., internal hard disks or removable disks
- magneto-optical disks e.g., CD-ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and a pointing device e.g., a mouse or a trackball
- a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's user device in response to requests received from the web browser.
- Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a user computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components.
- the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network.
- Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), an inter-network (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).
- LAN local area network
- WAN wide area network
- inter-network e.g., the Internet
- peer-to-peer networks e.g., ad hoc peer-to-peer networks.
- the computing system can include users and servers.
- a user and server are generally remote from each other and typically interact through a communication network. The relationship of user and server arises by virtue of computer programs running on the respective computers and having a user-server relationship to each other.
- a server transmits data (e.g., an HTML page) to a user device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the user device).
- Data generated at the user device e.g., a result of the user interaction
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Human Computer Interaction (AREA)
- Biodiversity & Conservation Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
L'invention concerne un procédé mis en œuvre par ordinateur qui permet de générer un modèle d'apprentissage machine unifié à l'aide d'un réseau neuronal sur un appareil de traitement de données. Le procédé consiste à déterminer, au moyen de l'appareil de traitement de données, des cibles d'apprentissage respectives pour chacune d'une pluralité de verticales d'objet. L'appareil de traitement de données détermine les cibles d'apprentissage respectives sur la base d'au moins deux sorties d'intégration du réseau neuronal. Le procédé consiste également à former, au moyen de l'appareil de traitement de données, le réseau neuronal pour identifier des données associées à chacune de la pluralité des verticales d'objets. L'appareil de traitement de données forme le réseau neuronal à l'aide des cibles d'apprentissage respectives et sur la base d'une première fonction de perte. L'appareil de traitement de données utilise le réseau neuronal formé pour générer un modèle d'apprentissage machine unifié, le modèle étant configuré de sorte à identifier des éléments de données particuliers associés à chacune de la pluralité des verticales d'objets.
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510270671.8A CN120298851A (zh) | 2017-07-17 | 2017-11-17 | 学习统一嵌入的方法和装置 |
| US16/494,842 US20200090039A1 (en) | 2017-07-17 | 2017-11-17 | Learning unified embedding |
| EP17812137.2A EP3642764B1 (fr) | 2017-07-17 | 2017-11-17 | Apprentissage intégration unifiée |
| CN201780089483.9A CN110506281B (zh) | 2017-07-17 | 2017-11-17 | 学习统一嵌入 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201762533535P | 2017-07-17 | 2017-07-17 | |
| US62/533,535 | 2017-07-17 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019017990A1 true WO2019017990A1 (fr) | 2019-01-24 |
Family
ID=60655079
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2017/062222 Ceased WO2019017990A1 (fr) | 2017-07-17 | 2017-11-17 | Intégration unifiée d'apprentissage |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20200090039A1 (fr) |
| EP (1) | EP3642764B1 (fr) |
| CN (2) | CN120298851A (fr) |
| WO (1) | WO2019017990A1 (fr) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021038592A3 (fr) * | 2019-08-30 | 2021-04-01 | Tata Consultancy Services Limited | Système et procédé de gestion d'un biais de popularité dans des recommandations d'articles |
| TWI725707B (zh) * | 2019-01-29 | 2021-04-21 | 荷蘭商Asml荷蘭公司 | 用於將經受半導體製程的基板歸類及建構相關的決定模型之方法、及相關的電腦程式及非暫時性電腦程式載體 |
| WO2021183151A1 (fr) * | 2020-03-11 | 2021-09-16 | Google Llc | Softmax utilisant des exemples croisés et/ou exploitation minière négative utilisant des exemples croisés |
| US20230267722A1 (en) * | 2021-02-25 | 2023-08-24 | Mitsubishi Electric Corporation | Loss contribution detecting method and loss contribution detecting system |
Families Citing this family (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108596882B (zh) * | 2018-04-10 | 2019-04-02 | 中山大学肿瘤防治中心 | 病理图片的识别方法及装置 |
| EP3847588A1 (fr) * | 2019-11-27 | 2021-07-14 | Google LLC | Modèle de données personnalisé utilisant des données fermées |
| US11449717B2 (en) * | 2020-03-12 | 2022-09-20 | Fujifilm Business Innovation Corp. | System and method for identification and localization of images using triplet loss and predicted regions |
| US20230117881A1 (en) * | 2020-04-01 | 2023-04-20 | NEC Laboratories Europe GmbH | Method and system for learning novel relationships among various biological entities |
| AU2021259170B2 (en) | 2020-04-21 | 2024-02-08 | Google Llc | Supervised contrastive learning with multiple positive examples |
| CN111652356B (zh) * | 2020-06-01 | 2025-02-14 | 深圳前海微众银行股份有限公司 | 神经网络模型保护方法、装置、设备及可读存储介质 |
| EP4189301A4 (fr) * | 2020-07-30 | 2024-08-28 | Onboard Data, Inc. | Système et procédés pour déterminer des relations opérationnelles dans des réseaux de commande et d'automatisation de bâtiments |
| CN114579294B (zh) * | 2020-12-02 | 2024-07-12 | 上海交通大学 | 云原生环境下支持服务负载激增预测的容器弹性伸缩系统 |
| KR20240032283A (ko) * | 2022-09-02 | 2024-03-12 | 삼성전자주식회사 | 이미지 표현 모델을 트레이닝하는 방법 및 이를 수행하는 컴퓨팅 장치 |
| TWI857629B (zh) * | 2023-05-31 | 2024-10-01 | 宏碁股份有限公司 | 關鍵資訊介面之顯示方法、關鍵資訊推論模型之建立方法及應用其之電子裝置 |
| WO2024263488A1 (fr) * | 2023-06-19 | 2024-12-26 | Tesla, Inc. | Recherche clip avec requêtes multimodales |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016100717A1 (fr) * | 2014-12-17 | 2016-06-23 | Google Inc. | Génération d'incorporations numériques d'images |
| US20160321522A1 (en) * | 2015-04-30 | 2016-11-03 | Canon Kabushiki Kaisha | Devices, systems, and methods for pairwise multi-task feature learning |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080065273A1 (en) * | 2006-08-28 | 2008-03-13 | Dan Gerrity | Method and device for adaptive control |
| US9053436B2 (en) * | 2013-03-13 | 2015-06-09 | Dstillery, Inc. | Methods and system for providing simultaneous multi-task ensemble learning |
| US9477908B2 (en) * | 2014-04-10 | 2016-10-25 | Disney Enterprises, Inc. | Multi-level framework for object detection |
| CN108604227B (zh) * | 2016-01-26 | 2023-10-24 | 皇家飞利浦有限公司 | 用于神经临床释义生成的系统和方法 |
| CN106649886A (zh) * | 2017-01-13 | 2017-05-10 | 深圳市唯特视科技有限公司 | 一种利用三元组标签的深度监督散列进行图像检索方法 |
| US9990687B1 (en) * | 2017-01-19 | 2018-06-05 | Deep Learning Analytics, LLC | Systems and methods for fast and repeatable embedding of high-dimensional data objects using deep learning with power efficient GPU and FPGA-based processing platforms |
| CN106897390B (zh) * | 2017-01-24 | 2019-10-15 | 北京大学 | 基于深度度量学习的目标精确检索方法 |
| CN106951911B (zh) * | 2017-02-13 | 2021-06-29 | 苏州飞搜科技有限公司 | 一种快速的多标签图片检索系统及实现方法 |
-
2017
- 2017-11-17 US US16/494,842 patent/US20200090039A1/en active Pending
- 2017-11-17 EP EP17812137.2A patent/EP3642764B1/fr active Active
- 2017-11-17 CN CN202510270671.8A patent/CN120298851A/zh active Pending
- 2017-11-17 CN CN201780089483.9A patent/CN110506281B/zh active Active
- 2017-11-17 WO PCT/US2017/062222 patent/WO2019017990A1/fr not_active Ceased
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016100717A1 (fr) * | 2014-12-17 | 2016-06-23 | Google Inc. | Génération d'incorporations numériques d'images |
| US20160321522A1 (en) * | 2015-04-30 | 2016-11-03 | Canon Kabushiki Kaisha | Devices, systems, and methods for pairwise multi-task feature learning |
Non-Patent Citations (5)
| Title |
|---|
| DEVASHISH SHANKAR ET AL: "Deep Learning based Large Scale Visual Recommendation and Search for E-Commerce", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 7 March 2017 (2017-03-07), XP080754872 * |
| SCHROFF FLORIAN ET AL: "FaceNet: A unified embedding for face recognition and clustering", 2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 7 June 2015 (2015-06-07), pages 815 - 823, XP032793492, DOI: 10.1109/CVPR.2015.7298682 * |
| SERGEY IOFFE ET AL: "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", 2 March 2015 (2015-03-02), pages 1 - 11, XP055266268, Retrieved from the Internet <URL:http://arxiv.org/pdf/1502.03167v3.pdf> [retrieved on 20160418] * |
| SONG YANG ET AL: "Learning Unified Embedding for Apparel Recognition", 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), IEEE, 22 October 2017 (2017-10-22), pages 2243 - 2246, XP033303688, DOI: 10.1109/ICCVW.2017.262 * |
| VIJAY KUMAR B G ET AL: "Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimizing Global Loss Functions", 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 27 June 2016 (2016-06-27), pages 5385 - 5394, XP033021734, DOI: 10.1109/CVPR.2016.581 * |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TWI725707B (zh) * | 2019-01-29 | 2021-04-21 | 荷蘭商Asml荷蘭公司 | 用於將經受半導體製程的基板歸類及建構相關的決定模型之方法、及相關的電腦程式及非暫時性電腦程式載體 |
| WO2021038592A3 (fr) * | 2019-08-30 | 2021-04-01 | Tata Consultancy Services Limited | Système et procédé de gestion d'un biais de popularité dans des recommandations d'articles |
| US12051099B2 (en) | 2019-08-30 | 2024-07-30 | Tata Consultancy Services Limited | System, method, and non-transitory machine readable information storage medium for handling popularity bias in item recommendations |
| WO2021183151A1 (fr) * | 2020-03-11 | 2021-09-16 | Google Llc | Softmax utilisant des exemples croisés et/ou exploitation minière négative utilisant des exemples croisés |
| CN115244527A (zh) * | 2020-03-11 | 2022-10-25 | 谷歌有限责任公司 | 交叉示例softmax和/或交叉示例负挖掘 |
| US20230267722A1 (en) * | 2021-02-25 | 2023-08-24 | Mitsubishi Electric Corporation | Loss contribution detecting method and loss contribution detecting system |
| US12417625B2 (en) * | 2021-02-25 | 2025-09-16 | Mitsubishi Electric Corporation | Loss contribution detecting method and loss contribution detecting system |
Also Published As
| Publication number | Publication date |
|---|---|
| EP3642764B1 (fr) | 2025-03-12 |
| CN120298851A (zh) | 2025-07-11 |
| CN110506281A (zh) | 2019-11-26 |
| CN110506281B (zh) | 2025-03-25 |
| US20200090039A1 (en) | 2020-03-19 |
| EP3642764A1 (fr) | 2020-04-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP3642764B1 (fr) | Apprentissage intégration unifiée | |
| Liu et al. | Localization guided learning for pedestrian attribute recognition | |
| Tan et al. | Attention-based pedestrian attribute analysis | |
| Zhang et al. | Detection of co-salient objects by looking deep and wide | |
| US11494616B2 (en) | Decoupling category-wise independence and relevance with self-attention for multi-label image classification | |
| Li et al. | Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios | |
| Liu et al. | Fashion parsing with weak color-category labels | |
| Li et al. | Pedestrian Attribute Recognition by Joint Visual-semantic Reasoning and Knowledge Distillation. | |
| Yang et al. | Robust face alignment under occlusion via regional predictive power estimation | |
| Zhao et al. | Fine-grained multi-human parsing | |
| Xia et al. | An evaluation of deep learning in loop closure detection for visual SLAM | |
| Chen et al. | Modeling fashion | |
| Lu et al. | Co-bootstrapping saliency | |
| CN108229559A (zh) | 服饰检测方法、装置、电子设备、程序和介质 | |
| CN112906730A (zh) | 一种信息处理方法、装置及计算机可读存储介质 | |
| Pu et al. | Learning recurrent memory activation networks for visual tracking | |
| Zhan et al. | DeepShoe: An improved Multi-Task View-invariant CNN for street-to-shop shoe retrieval | |
| Karaoglu et al. | Detect2rank: Combining object detectors using learning to rank | |
| Wang et al. | Negative deterministic information-based multiple instance learning for weakly supervised object detection and segmentation | |
| Peng et al. | Learning weak semantics by feature graph for attribute-based person search | |
| Liu et al. | Classification of fashion article images based on improved random forest and VGG-IE algorithm | |
| CN113822134B (zh) | 一种基于视频的实例跟踪方法、装置、设备及存储介质 | |
| Wang et al. | Toward Real-World Multi-View Object Classification: Dataset, Benchmark, and Analysis | |
| Galiyawala et al. | Person retrieval in surveillance videos using attribute recognition | |
| Nogueira et al. | Pointwise and pairwise clothing annotation: combining features from social media |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17812137 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2017812137 Country of ref document: EP Effective date: 20200120 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 201780089483.9 Country of ref document: CN |