WO2021030339A1 - Procédé et système pour la reconnaissance efficace d'objet multimodal - Google Patents
Procédé et système pour la reconnaissance efficace d'objet multimodal Download PDFInfo
- Publication number
- WO2021030339A1 WO2021030339A1 PCT/US2020/045754 US2020045754W WO2021030339A1 WO 2021030339 A1 WO2021030339 A1 WO 2021030339A1 US 2020045754 W US2020045754 W US 2020045754W WO 2021030339 A1 WO2021030339 A1 WO 2021030339A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- modality
- processor
- labels
- color
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B07—SEPARATING SOLIDS FROM SOLIDS; SORTING
- B07C—POSTAL SORTING; SORTING INDIVIDUAL ARTICLES, OR BULK MATERIAL FIT TO BE SORTED PIECE-MEAL, e.g. BY PICKING
- B07C5/00—Sorting according to a characteristic or feature of the articles or material being sorted, e.g. by control effected by devices which detect or measure such characteristic or feature; Sorting by manually actuated devices, e.g. switches
- B07C5/34—Sorting according to other particular properties
- B07C5/342—Sorting according to other particular properties according to optical properties, e.g. colour
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/56—Extraction of image or video features relating to colour
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/63—Scene text, e.g. street names
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/68—Food, e.g. fruit or vegetables
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/09—Recognition of logos
Definitions
- This invention relates field of autonomous retail checkout and vision-based object recognition for package sorting and picking.
- Self-checkout lanes have existed in food retail for several years, but only about 10% of all checkouts occur through those lanes. These self-checkout lanes save the retailer significant costs which is why more and more retailers are moving toward cashierless checkout processes. But with many of these systems customers have to do more work by scanning their own items. For example, Walmarf s new fast lane requires customers to scan their items individually with an app on their mobile phone as they shop before they proceed to a specific cashierless checkout lane. Sam's Club recently unveiled a new cashierless store that had a similar checkout process that involved customers using an app on their mobile phone. However, consumers do not want the added burden of having to scan items themselves.
- distribution centers or package sortation facilities utilize humans to position packages onto conveyor belts, such that barcodes or labels on packages are always facing in a particular direction and orientation (typically up). This facilitates the reading of the barcodes (or labels) by static laser scanners placed above or parallel to the conveyor belts. Barcodes are scanned at a certain rate that typically translates to the speed of the conveyor. If barcodes and/or labels could be read faster and from any orientation/direction, the sortation speed could be increased. Additionally, the need for workers to position packages in prescribed orientations would not be needed.
- the ability to localize and decode barcodes and labels in real-time and from any orientation maximizes efficiencies and minimizes labor costs. The ability to do that enables new solutions to come to fruition in markets such as retail, order fulfillment and distribution.
- the method and system described herein is a solution to eliminate one of the most painful pain points for the consumer when shopping in a physical store: the checkout line. This is an expensive and time-consuming process for the person that is scanning the products, be it the store’s human cashier or for the customer him or herself.
- the disclosed method and system eliminates the need for humans to position packages such that barcodes and labels area always oriented in the same direction.
- the method and system allows for barcodes and packages to be localized and decoded in- flight as packages/objects move in the field of view of a camera at high-speeds. Barcodes are read in real-time by high-speed/frame rate digital cameras such that motion blur and other disturbances are minimized.
- the method and system also identifies packages based on other modalities such as a graphic logo, text, color or weight if the barcode is unreadable (or in conjunction with a barcode read, to raise the accuracy of the system).
- Video frames (from one or more cameras) are fed into algorithms that localize and identify the barcodes and other modalities in the image at very high speeds. After barcode localization, the barcode is decoded using on-board software and the product, price and other attributes are identified. In some cases, this happens by using its SKU and the retailer’s (or other) database(s). Similarly other identifying modalities from the package or label information may be localized and processed and the product, price and/or other attributes are identified using the decoded description and additional, contextual and correlated information located in a database. 4.0 SUMMARY
- the present invention provides an elegant solution to the needs described above and offers numerous additional benefits and advantages, as will be apparent to persons of skill in the art.
- a unique system for recognition of an object is disclosed.
- the system includes a visual sensor to take an image of the object and a processor connected to the visual sensor.
- the processor (a) receives the image from the sensor; (b) localizes the object within the image; (c) generates coordinates within the image that correspond to a region of interest within which the object has been localized; (d) analyzes the region of interest to identify object labels, wherein the object labels are based on at least one modality selected from a group consisting of: a barcode, logo, text, and color; (e) for each modality, the processor establishes an object identification list based on the object labels, wherein the object identification list identifies candidates; (f) for each candidate, (i) calculates an arbitration score by calculating a recognition value based on the accuracy of the modality and either (1) the conditional probability based on one or more of the identified object labels or (2) the average probability of the existence of one or more of the identified object labels; and (ii) summing the recognition values, and identifying the object based on a ranking of the arbitration score.
- the system may use two modalities selected from the group consisting of: a barcode, logo, text, and color. Additional modalities may be used including weight (wherein the system would have a scale connected to the processor) and location (wherein the system would have a location sensor connected to the processor.
- the processor may decimate the image, convert the image to a monochromatic image, and/or process the image with at least one processing technique selected from a group consisting of: high dynamic range conversion, image registration, denoising, contrast stretching super-resolution conversion, and white-balancing.
- the processor may also receive a plurality of images in step (a) and the perform change detection that detects changes in the plurality of images based on a comparison to a static background.
- the processor may apply weights to the recognition values prior to summing the recognition values.
- the sensor may be an RGB sensor and prior to step (d) the processor may transform the image into a YCbCr image, and create a first image based on the Y channel of the YCbCr image and second image based on the CbCr channels of the YCbCr image.
- the barcode modality may be based on the first image, and it may have a conditional probability of 1.
- the color modality may be based on the second image.
- the system may have a touch screen and/or a speaker connected to the processor.
- a light may also be connected to the processors and the prior to step (a) the processor may activate the light.
- the systems can also have a transceiver to allow the processors to send and receive data to a central processor.
- the system may be mounted on a mobile shopping cart or a self-checkout stand.
- FIG. 1 A is a first portion of the system and method flow chart.
- FIG. IB is a second portion of the system and method flow chart.
- FIG. 1C is a third portion of the system and method flow chart.
- FIG. 2 is a front and back of a representative package.
- FIG. 3 is the front of a representative package.
- FIG. 4A is a graph representing the conditional probability function approximated by a shifted sigmoid function have an inflection point of 3.
- FIG. 4B is a graph representing the performance score of a text classifying system with an average accuracy of 95% and an inflection value of 3.
- FIG. 4C is a graph representing the performance score of a text classifying system with an average accuracy of 75% and an inflection value of 3.
- FIG. 4D is a graph representing the performance score of a text classifying system with an average accuracy of 50% and an inflection value of 3.
- FIG. 5 A are the fronts of a representative package.
- FIG. 5B is a zoomed in portion of the front of a representative package shown in FIG. 5 A.
- FIG. 6 is a schematic of the various components of a cart-mounted system.
- FIG. 7 is a schematic of several cart-mounted systems and a self-checkout stand mounted system in data communication with a central processor.
- FIG. 8 illustrates the system used with a conveyor and a robotic arm.
- connection, relationship or communication between two or more entities does not necessarily mean a direct, unimpeded connection, as a variety of other entities or processes may reside or occur between any two entities. Consequently, an indicated connection does not necessarily mean a direct, unimpeded connection unless otherwise noted.
- An object can possess many different signal modalities. These modalities can be color, weight, shape, and volume. Some classes of objects might include specialized modalities such as barcode or logo. The method and system described herein uses the following modalities for the purpose of object recognition: barcode, graphic logo, text, color, and weight. The overall method is shown in FIGS. 1 A through 1C, and will now be discussed in detail. I. Image Capture and Object Localization Subcomponent FIG. 1A
- the image sensor or camera 105 captures a full resolution RGB image in step 110.
- the image sensor may instead capture a monochromatic image using, for example, a black and white camera. That image is then decimated to improve the efficiency as only the minimally necessary data is used for processing.
- the full-resolution RGB or monochromatic image is not necessary. Therefore, the output image from cameras can be decimated in step 115 to a lower resolution by sampling every n pixel in the original image in both the vertical and horizontal dimension, where n is a natural number. Decimation is preferred over resampling since it is a lower-complexity operation.
- step 115 if necessary the decimated RGB image is converted to a monochrome image.
- the first way is to convert from RGB color space to YCbCr color space and retain only the Y channel image.
- the pixel conversion is a simple linear transformation (matrix multiply), which should incur a small computational cost if the image resolution is sufficiently small.
- the second way discards the blue and red channels from the RGB image and retain only the green channel image.
- the green channel situated between the red and blue channels in the frequency spectrum, has the broadest responsivity among the RGB channels. So it can be used as a proxy luminance channel.
- This method is even more computationally efficient at creating a monochrome image because it simply keeps one channel of the signal and discards the other two.
- the green channel in a typical Bayer sensor (with RGBG configuration) has twice the pixel number than that of the red or blue channel, it has relatively good signal-to-noise characteristics and can be used as a replacement for the Y signal, albeit with a narrower spectrum sensitivity.
- step 115 has an input of a full resolution RGB image and an output of either a decimated Y channel or G green channel image (step 120), or a full resolution monochromatic image input and an output of a decimated monochrome image (step 120).
- ROI Regions of Interest
- a ROI localizer is used to detect the presence of the object in an image.
- an optional Change Detector 125 for each camera can be used to identify the object versus the background of an image.
- Step 125 is a change detector that identifies areas of an image that have changed from a static background.
- the Background Image , B of a static-background camera can be established by averaging a set of m input images, /, before an object is presented to the camera:
- /(x,y) is the pixel value of the image / at coordinate (x,y).
- B can incorporate new images over time.
- the weighted sum of images changes by gradually incorporating new images periodically.
- subsequent images can be compared to it to establish the Changed Binary Image , C(x, y). If the pixel difference between the background image and the input image value exceeded the pixel threshold map, T(x,y), then that pixel is deemed “changed”:
- the cropped image from the change detector in a static-background camera, or the entire image from a non-static background camera can then be sent to the object localizer 130 to extract the ROI portion of the image for further classification.
- the optional step 125 has in input of a decimated resolution monochrome image and an output of the coordinates of a cropped and decimated image.
- object localizer 130 For both static and non-static background cameras, and object localizer 130 would be implemented. Classifiers such as deep neural networks are computationally intensive, it is therefore very inefficient to send a large image to the classifier if most of the image does not contain the object of interest. With the use of an object localizer 130, the goal is to remove as much of the image as possible by using relatively small amount of compute resources, and only send the likely areas of the image to the classifier for object classification.
- One implementation of the object localizer 130 is a cascade classifier where the presence of the object’s essential attributes are serially checked.
- the object localizer 130 is a light-weight, pre-trained classifier that does not need object feature input in real-time; rather this classifier may be trained off-line with object feature data.
- the cascade classifier is a suitable method since it is a serial greedy algorithm that looks for the presence of the most essential object features first, and if the next essential feature is absent, then the image portion is rejected. In other words, if large portions of an image do not contain the object, those areas can be rejected efficiently by a cascade classifier. This is computationally more efficient than neural networks where the classification decision is made only after all features have been extracted.
- the output of this object localizer 130 is an image based on the decimated resolution of the original image. This is achieved by taking the image coordinates produced by the localizer and remap them from the decimated dimension to the full dimension of the image. [49]
- step 130 has an input of a decimated resolution monochrome image and an output of coordinates of the detected potential object based on decimated image size.
- FIG. IB II. Image Enhancement Subcomponent
- RGB-to-YCbCr converter 150 performs the linear transformation that converts an image from the RGB color space to the YCbCr color space. Though the two color spaces contain the same amount of information, the transformation enables brightness and color signals to be readily separable. Specifically, the RGB color space represents color signals using 3 color channels, while the YCbCr color space represents color using 1 luminance channel and 2 independent color channels.
- YCbCr for color-matching applications: 1) it separates the luminance portion of the signal (Y) from the colorance portion of the signal (CrCb: Cr for the Red-Green dimension, Cb for the Blue- Yellow dimension), such that color matching can be more accurate, because it no longer depends on the brightness of the object; 2) the Euclidean distance between related colors are more consistent so that colorimetric comparison — distance between two colors — is more consistent.
- the RGB color space has the luminance signal mingled in all three color channels, such that changing the luminance alone (without changing the color itself) will result in all 3 channel values being changed, whereas in YCbCr color space, if only the luminance is changed, then only the Y channel is impacted, the Cb and Cr channels remain unchanged.
- step 150 has an input of the coordinates of the detected potential object based on decimated image size and a full resolution RGB image, and an output of potential object’s Y image in full resolution (step 155) or the potential object’s CbCr image in full resolution (step 160).
- the system may pre-process the images before passing the images to the uni- modality/multi-modality classifier (step 200, FIG. 1C). If improve image quality is not needed, an unprocessed image package 195 is passed to the uni-modality/multi -modality classifier (step 200, FIG. 1C). If improved image quality is needed, then multiple consecutive, temporally adjacent images generated by the RGB-to-YCbCr Converter can be buffered together in step 170 before they are sent to the subsequent image processing stage.
- step 175 three conditions may satisfy the criterion of whether a sufficient number of frames have been collected: (1) The number of frames exceeded a threshold; (2) The spatial coordinates of the frames (generated by the localizer) are no longer contiguous, indicate a new object has entered the field of view (FOV) of the camera; or (3) The elapsed time from the 1st buffer frame has exceeded a threshold. Since buffering introduces latency, this buffering step 170 should only be performed if real-time performance constraints are satisfied. [54] Thus steps 165, 170 and 175 have an input of multiple single object images, and an output of a buffered list of object images (step 180).
- step 185 has an input of a buffered set of object images (from step 180), and an output of a single enhanced/processed image (step 190).
- FIG. 1C III. Object Identification Subcomponent
- the modality of an object can be any meta-data that can be extracted from an object using a sensor or sensors.
- these modalities can be shapes, edges, textures, contrast, colors, texts, graphics, etc. If the object is a retail package, then the modalities can also include brand logo and barcode.
- the modality can be weight 210.
- the uni-modality/multi-modality classifier may access object meta-data 202 when identifying an object across the various modalities.
- a uni-modal/multi-modality classifier 200 can be implemented using a neural network, a cascade classifier, a decision tree, or other methods that map a high-dimension signal to a low- dimension label output. Many such classifiers output a single label, but it can also output a set of labels or a NULL label in the case of non-classification.
- the classification of an image can be formally described as:
- /(x,y) is the input object image
- /( ⁇ ) is a classifier
- the accuracy of a classifier can be measured by r:
- TP is the count of True Positives
- TN is the count of True Negatives
- FP is the count of False Positives
- FN is the count of False Negatives.
- Classification accuracy is dependent on the algorithm, but it can also be dependent on its training data if it is a data-driven learning system such as a neural network or a cascade classifier. As such when a classifier’s algorithm or training data are changed or improved, then the value of r for the classifier might change as well.
- the performance of a multi-modal classification system is highly dependent on the accuracy of each individual classifiers. The optimal multi-modal system configuration will need to be recomputed when the accuracy of a classifier is changed.
- the uni-modality/multi-modality classifier 200 inputs a single object image and the output is a list of candidate object IDs that can be passed to the multi-modal arbitrator 215.
- the set of object IDs includes object ID lists based on different modalities (200-1, 200-2, 200-3).
- Each classifier for each modality may implement a pre- determined threshold for a confidence level, below which object ID’s will not be considered reliable and will not be passed to the multi-modal arbitrator 215.
- a well-trained classifier should only report one object ID; however, in practice, it is possible that more than one object ID is reported by each classifier for each modality.
- a barcode may be a modality used in the uni-modal/multi-modality classifier 200.
- Barcodes are designed to uniquely identify a retail product; even for a product that is sold by weight, a barcode can be used to uniquely identify the product and its price. As such, there is a 1- to-1 mapping from most barcodes to a product identity, and vice versa. In other words, if the barcode is correctly recognized, then there is almost no ambiguity in the identity of the product. The only time where this may not be the case is when the barcode is specially formatted by the retailer to describe and price items sold by weight such as meat or deli counter items. In such a case, the barcode could define a ‘group’ or ‘class’ of items such as meat, bakery, deli, etc. and a price, instead of a specific type of item and price as would usually be the case. To put it formally:
- Most retail packages also contain one or more readily identifiable brand or product or graphics logos so that customers can easily distinguish one brand of products from another.
- logos and brand names are also a modality that can be used in the classifier to generate an object ID.
- the object identity can be narrowed to a small set of related products of the same brand.
- a logo is correctly recognized on an object, then there is a bounded ambiguity, namely u possibilities, in the identity of the product, where u is the number of products bearing the same logo in its package.
- U j is the number of individual products having the logo l j in their packages.
- the large(r) texts on a package can also serve as modality used in the uni-modal classifier to generate an object ID.
- a text can generally serve as a non-unique identifier of a product where there is a one-to-many mapping from a text to a product identity.
- a logo which is typically present only once on a face of a package
- there can be multiple instances of texts such that a package might be uniquely identified when these texts are recognized in conjunction. For example in FIG. 3, if “keto”, “biscuits”, “mix”, are recognized on the same package, then the joint conditional probability might be sufficiently close to being
- the recognition performance metric can be defined as:
- Equation 17 can be further simplified as:
- Equation 19 For example, if three texts were recognized in the package, then Equation 19 becomes:
- the system can optimize the number of texts to identify the object. Additional texts would not necessarily increase the accuracy of the system, and would actually slow down the system’s processing. For example, if the text classifier has an average accuracy of 0.95, then, on average, the optimal number of texts to decode would be six as shown in FIG. 4B. If the text classifier has an average accuracy of 0.75, then, on average, the optimal number of texts to decode would be four as shown in FIG. 4C. And shown in FIG. 4D, if the text classifier has an average accuracy of only 0.5, then, on average, the optimal number of texts to decode would be two.
- a shifted sigmoid is a good approximation (FIG. 4A, Eq. 25) since decoding very few texts will unlikely able to uniquely identify an object, and decoding beyond a certain higher number of texts will not increase the probability of success.
- this shifted sigmoid function is sufficient for illustration purposes. The actual probability distribution can be established empirically from the system data instead.
- Color can also be used to determine the identity of an object. While color is not the most discriminative or predictive modality in object recognition, nevertheless, it can serve to discriminate related objects that may otherwise have similar information in other modalities. For example, a retail package is presented to the system with the barcode being occluded or otherwise not visible by the cameras, but a text classifier might nevertheless be able to recognize two largest texts on the package, namely “Snaxly” and “KETO” (see FIG. 5 A).
- the system is able determine that the package as either the “Snaxly KETO cheddar biscuits” or “Snaxly KETO cinnamon” by identifying the dominant colors of an object or the dominant colors of other modalities inherent in the object, further discrimination and a more accurate recognition can be made.
- the color modality classification can be applied independently of other modalities, such as to the image of the entire object.
- Color can be defined as a (Cb, Cr ⁇ vector in the YCbCr color space. Dominant colors can be identified by running a clustering algorithm on the image using only the color Cb (blue vs. yellow) and Cr (red vs. green) channels output by the RGB-YCbCr Converter in Step 150 in the processing pipeline shown in FIG. IB.
- the dominant colors in the first package are (blue, brown, white ⁇ , while the dominant colors in the second package are ⁇ brown, red, white ⁇ .
- the performance metric formulation is the same as that of the text modality, such that Equation 19 becomes:
- r c is the average accuracy of the color classifier
- n is the number of dominant colors
- c x is the detected color
- q is the object ID
- color classification can be applied in conjunction with other modalities, such as the detected texts on an object or a logo.
- the colors of the recognized text can be combined with texts as a multi-modality, text-color.
- the formulation of the text-color multi-modality’s performance number can simply be constituted by replacing the conditional probability in Equation 19 with the added conditions:
- Equation (29) To generalize the performance score to any combination of modalities in a single multi - modal classifier, Equation (29) becomes:
- n is the number of symbols being combined in the multi-modal classifier
- s x is a uni-modal symbol
- r x is the average accuracy of the uni-modal classifier for s x
- q k is the object ID
- R n is the multi-modal performance score for classifying an attribute that is composed of n uni -modal symbols.
- Equation 29 [98] For FIG. 5B, the Equation 29 becomes:
- Weight is another modality that can be used to identify an object. Since there is at most one reading of an object’s weight, Equation 19 for the weight modality simply becomes:
- the location of the cart can serve as an additional modality.
- the system may have a location sensor 211 (FIG. 1C) that tracks the location of the shopping cart 212, which is then feed to the uni-modality/multi-modality classifier 200.
- the object meta-data 202 may include a map of the store along with the location of the products within the store. From this data, the uni- modality/multi-modality classifier 200 can provide a list of candidate object IDs that are within a proximity of the cart.
- a combined score can be computed by performing a weighted sum from each of the modalities in the multi-modal arbitrator 215.
- This arbitration score can be performed differently based on whether the prior probability distribution of each modality is known or not. [104] If the prior probability distribution of a modality is known, then the arbitration score, s q , for a specific object, q, can be defined as: [105] where r j is average accuracy of each uni-modal/multimodal classifier, and irq is the modality of the classifier, n is the total number of modalities used, and W j is the weight constant for each modality. The average accuracy of a classifier is an empirically measurable value. Then the multi-modal label output would simply be:
- the multi-modal arbitrator 215 has an input object IDs for each modality used (200- 1, 200-2, 200-3), and outputs a ranking of those object IDs in step 220. From this ranking, the system can identify the object, or if the scores in the ranking do not meet a required threshold, the system can report that the object cannot be identified. It should be understood that the identification of an object may include identifying the label on an object, which would be useful in the sorting application for the system 10.
- FIG. 6 illustrates an example of the system 10.
- the processor 100 is connected to several image sensors 105, each of which can take images from unique vantage points to better ensure that the images received by the processor 100 can indeed be decoded and labels can be identified.
- the system 10 may further include a light 102 that is activated by the processor 100 and further enhances the quality of the images.
- a proximity sensor 102-1 may be connected to the processor 100 to assist in triggering the illumination of the light 102.
- the light 102 may be activated. This triggering function may also be achieved by a coarse object detection of the images from the image sensor 105.
- the processor can instantaneously actuate the light 102 allowing the image sensors 105 to capture the well-illuminated images need for object identification. Triggering the light 102 only when needed will assist in prolonging the battery life of the system 10.
- a scale 205 may be used to implement the weight modality.
- the different processes such as uni-modality/multi -modality classifier 200 and the multi-modal arbitrator 215 may be connected to a database 202/217.
- the system 10 may include a touch display 103 connected to the processor 100, that may report the identification of each item as it is processed and may further include a total tally of all the items and their prices for objects in the shopping cart.
- a audio/visual indicator 101 such as a speaker or light such as a color LED light can be used to provide even more information to the customer.
- the speaker may sound a pleasant beep when the system identifies the object, but a buzzer sound if the system 10 cannot arbitrate the object within the required threshold.
- the color LED may glow green to indicate that the system was able to recognize the object and a red to indicate the system’s 10 inability to arbitrate the object, prompting the user to redo the action.
- an antenna 104 connected to a transceiver 104-1 and the processor 100 may be used to maintain a data connectivity between each of the individual cart-based systems and each other or a global system arbitrator on premises or in the cloud. This connectivity can be used to update the ML in a near real-time basis which increases the accuracy of the system as a whole.
- FIG. 7 several shopping cart mounted systems 10 are in bi-lateral data communication with a central processor 15, as is a system mounted at a self-checkout stand 16.
- the data connectivity can also be used to report the final cart contents to the central processor 15, so that the central processor can charge the appropriate account and can update its inventory.
- the system 10 may be mounted adjacent to a conveyor system 17, which can then be used to sort packages/objects 18, and also used during the induction of packages into the sorting system. This is an improvement over current methods that utilize humans to position packages onto conveyor belts with the barcodes or labels facing in a particular direction and orientation. Static laser scanners can then read the barcodes/labels.
- the system 10 can read the barcodes/labels faster and from any orientation/direction in real time, which increases the sortation speed and minimizes labor costs.
- the system 10 may be mounted (at least the image sensor 105 of the system 10) to the robotic sorting arm 19, as show in FIG. 7. In this configuration, the sensor of the system 10 may be able to see more faces of the package/object which can assist in more accurately arbitrating the identification of the object or its label.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
Abstract
L'invention concerne un système de reconnaissance d'un objet comprenant un capteur visuel connecté à un processeur. Le processeur (a) reçoit une image provenant du capteur; (b) localise l'objet à l'intérieur de l'image; (c) génère des coordonnées d'image qui correspondent à une région d'intérêt à l'intérieur de laquelle l'objet a été localisé; (d) analyse la région d'intérêt pour identifier des étiquettes d'objet, les étiquettes d'objet étant basées sur au moins une modalité; (e) pour chaque modalité, le processeur établit une liste de candidats d'identification d'objet sur la base des étiquettes d'objet; (f) pour chaque candidat, calcule un score d'arbitrage en calculant une valeur de reconnaissance sur la base de la précision de la modalité et soit de la probabilité conditionnelle sur la base d'une ou de plusieurs parmi les étiquettes d'objet identifiées, ou de la probabilité moyenne de l'existence d'une ou de plusieurs parmi les étiquettes d'objet identifiées; et (g) sur la base du classement du score d'arbitrage l'identification de l'objet.
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201962886965P | 2019-08-15 | 2019-08-15 | |
| US62/886,965 | 2019-08-15 | ||
| US202062960033P | 2020-01-12 | 2020-01-12 | |
| US62/960,033 | 2020-01-12 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021030339A1 true WO2021030339A1 (fr) | 2021-02-18 |
Family
ID=74571190
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2020/045754 Ceased WO2021030339A1 (fr) | 2019-08-15 | 2020-08-11 | Procédé et système pour la reconnaissance efficace d'objet multimodal |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2021030339A1 (fr) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117171699A (zh) * | 2023-08-04 | 2023-12-05 | 长城汽车股份有限公司 | 用户画像分类方法、装置、存储介质以及终端 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6124560A (en) * | 1996-11-04 | 2000-09-26 | National Recovery Technologies, Inc. | Teleoperated robotic sorting system |
| US20040074967A1 (en) * | 2002-10-10 | 2004-04-22 | Fujitsu Limited | Bar code recognizing method and decoding apparatus for bar code recognition |
| US20100217678A1 (en) * | 2009-02-09 | 2010-08-26 | Goncalves Luis F | Automatic learning in a merchandise checkout system with visual recognition |
| US20190057435A1 (en) * | 2016-02-26 | 2019-02-21 | Imagr Limited | System and methods for shopping in a physical store |
-
2020
- 2020-08-11 WO PCT/US2020/045754 patent/WO2021030339A1/fr not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6124560A (en) * | 1996-11-04 | 2000-09-26 | National Recovery Technologies, Inc. | Teleoperated robotic sorting system |
| US20040074967A1 (en) * | 2002-10-10 | 2004-04-22 | Fujitsu Limited | Bar code recognizing method and decoding apparatus for bar code recognition |
| US20100217678A1 (en) * | 2009-02-09 | 2010-08-26 | Goncalves Luis F | Automatic learning in a merchandise checkout system with visual recognition |
| US20190057435A1 (en) * | 2016-02-26 | 2019-02-21 | Imagr Limited | System and methods for shopping in a physical store |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117171699A (zh) * | 2023-08-04 | 2023-12-05 | 长城汽车股份有限公司 | 用户画像分类方法、装置、存储介质以及终端 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12175686B2 (en) | Item identification using multiple cameras | |
| US12217441B2 (en) | Item location detection using homographies | |
| Winlock et al. | Toward real-time grocery detection for the visually impaired | |
| US20200151692A1 (en) | Systems and methods for training data generation for object identification and self-checkout anti-theft | |
| US12223710B2 (en) | Image cropping using depth information | |
| US20240395009A1 (en) | Reducing a search space for item identification using machine learning | |
| EP2490171B1 (fr) | Recherche d'image de personne utilisant une demande textuelle de vêtement | |
| US8548203B2 (en) | Sequential event detection from video | |
| US20250054303A1 (en) | Hand detection trigger for item identification | |
| US10824902B2 (en) | Mislabeled product detection | |
| US12229714B2 (en) | Determining dimensions of an item using point cloud information | |
| Choi et al. | Fast human detection for indoor mobile robots using depth images | |
| US12400206B2 (en) | Identifying barcode-to-product mismatches using point of sale devices and overhead cameras | |
| US10503961B2 (en) | Object recognition for bottom of basket detection using neural network | |
| US20240020857A1 (en) | System and method for identifying a second item based on an association with a first item | |
| CN116824705B (zh) | 智能购物车购物行为判别方法 | |
| US20210166037A1 (en) | Image recognition device for detecting a change of an object, image recognition method for detecting a change of an object, and image recognition system for detecting a change of an object | |
| US12340360B2 (en) | Systems and methods for item recognition | |
| Liu et al. | An ultra-fast human detection method for color-depth camera | |
| CN118097519A (zh) | 基于商品轨迹分析的智能购物车购物行为分析方法及系统 | |
| Selvam et al. | Batch normalization free rigorous feature flow neural network for grocery product recognition | |
| WO2021030339A1 (fr) | Procédé et système pour la reconnaissance efficace d'objet multimodal | |
| Tian et al. | Reliably detecting humans in crowded and dynamic environments using RGB-D camera | |
| Vaneeta et al. | Real-time object detection for an intelligent retail checkout system | |
| Melebari et al. | No Bag Left Behind |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20853362 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.05.2022) |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 20853362 Country of ref document: EP Kind code of ref document: A1 |