WO2024261115A1 - Commande d'un robot domestique - Google Patents
Commande d'un robot domestique Download PDFInfo
- Publication number
- WO2024261115A1 WO2024261115A1 PCT/EP2024/067207 EP2024067207W WO2024261115A1 WO 2024261115 A1 WO2024261115 A1 WO 2024261115A1 EP 2024067207 W EP2024067207 W EP 2024067207W WO 2024261115 A1 WO2024261115 A1 WO 2024261115A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- household
- instruction
- classifier
- robot
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/20—Control system inputs
- G05D1/22—Command input arrangements
- G05D1/228—Command input arrangements located on-board unmanned vehicles
- G05D1/2285—Command input arrangements located on-board unmanned vehicles using voice or gesture commands
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/20—Control system inputs
- G05D1/24—Arrangements for determining position or orientation
- G05D1/243—Means capturing signals occurring naturally from the environment, e.g. ambient optical, acoustic, gravitational or magnetic signals
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D2105/00—Specific applications of the controlled vehicles
- G05D2105/10—Specific applications of the controlled vehicles for cleaning, vacuuming or polishing
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D2107/00—Specific environments of the controlled vehicles
- G05D2107/40—Indoor domestic environment
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D2109/00—Types of controlled vehicles
- G05D2109/10—Land vehicles
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D2111/00—Details of signals used for control of position, course, altitude or attitude of land, water, air or space vehicles
- G05D2111/10—Optical signals
Definitions
- the present invention relates to the control of a household robot.
- the invention relates to the intelligent control of the household robot depending on a voice instruction from a user.
- a household robot is designed to perform a predetermined function in a household.
- the household robot can be designed to work on a floor area and preferably to clean it, for example by sweeping, wiping or vacuuming.
- the household is usually mapped and a predetermined strategy is controlled with regard to the map data created.
- different strategies are predetermined, for example to take into account the type of surface, to avoid a place that is dangerous for the household robot or to work on areas of the surface that are subject to different stresses at different times or in different ways.
- the robot In order to make the household robot perform spontaneous cleaning at a predetermined location, the robot usually has to be transported to that location or manually controlled there. It has been proposed to use voice recognition to recognize a voice command and control a household robot depending on the command.
- US 2021/0401255 A1 concerns a robot that can be trained using machine learning.
- US 9 983 592 B2 proposes a robot and a control method.
- One object underlying the present invention is to provide an improved technique for controlling a household robot by voice.
- the invention solves this problem by means of the subject matter of the independent claims. Subclaims give preferred embodiments.
- a method for controlling a household robot comprises steps of detecting a verbal instruction from a user, which includes a description of a position in the household; determining a position intended by the user based on the description of the position included in the instruction using a classifier; and controlling the household robot to the determined position.
- the classifier comprises a first encoder for providing a first encoding for a text and a second encoder for providing a second encoding for a view; the encoders are trained so that first and second encodings, which are determined for a text and a view of the household associated with the text, are as similar to one another as possible.
- the description of the position in the instruction can, for example, be formulated in natural language.
- a BART encoder such as that described in "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension" by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvini- nejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer (see https://doi.org/10.48550/arXiv.1910.13461) can be used.
- VQGAN Vector Quantised Generative Adversarial Network
- VQGAN Vector Quantised Generative Adversarial Network
- the classifier is initially set up to use the encoders to encode similar texts and images as similarly as possible and different texts or images as differently as possible. Encoding is not necessarily limited to a predetermined number of classes. In a further embodiment, the classifier is set up to handle the determination of an encoder in the manner of a regression problem. An encoder can thus be formed that encodes a predetermined area of a view into an encoding, so that the encoding describes a position in the household.
- the classifier can in particular comprise a zero-shot classifier, which can assign a description or a term contained therein to an image even can be created when this assignment is not included in the training data used to train the classifier.
- the classifier can predict an assignment so that the description can be recognized much better.
- the verbal instruction can include a freer or more precise description of the position.
- a user can control the household robot more effectively using a verbal instruction.
- the hit rate when the household robot finds the intended position can be improved compared to conventional techniques.
- the user can formulate the instruction or description freely and does not have to adhere to a predetermined vocabulary or special machine-understandable semantics.
- the position can be determined more precisely than with known techniques so that the user can gain better control over the household robot. This can also enable the household robot to be used in a complex or large household.
- an encoder is designed to compress an input into a vector, wherein the vector has significantly fewer dimensions than the input.
- An encoder may comprise an autoencoder.
- the encoders are preferably implemented as an artificial neural network.
- an encoder may be designed as a large language model (LLM) so that it can create encodings based on speech or text.
- LLM large language model
- the encodings can each comprise a vector, with the encoders trained to maximize a vector product between corresponding encodings and minimize a vector product between non-corresponding encodings.
- This type of training can also be called “contrastive learning”.
- a similarity between two vectors is also called cosine similarity and can allow an analogous measure of similarity to be determined based on a distance between the vectors in space.
- the encoders are trained on views of objects that are usually found in a household.
- the trained classifier may already have a certain amount of "world knowledge" on the basis of which location finding in a current household can be improved.
- Such an object may comprise a structural element of a household, for example a door, a window, a staircase, a threshold or a passage to another room.
- the object may also comprise a piece of furniture, in particular a seat, a piece of bedroom furniture, a piece of work furniture, an item of entertainment furniture or a piece of decorative furniture.
- the object can also include a home textile, for example a carpet, a runner, a cushion, a tablecloth, a curtain or a textile wall hanging.
- the home textile can also be combined with the furniture, for example in the form of a corduroy sofa or a fabric armchair.
- a text that is assigned to a view preferably relates to an object that is shown in the view.
- the text preferably does not include a designation in the sense of a "label" used in machine learning, but is unstructured.
- the text can be in the form of a caption or a short description.
- the text can be in natural language and provide additional information that goes beyond the description of the object. For example, if a furniture manufacturer's catalog is used as a source of learning data for the classifier, a text can include a possible use, a reference to an advantageous combination with another piece of furniture, or a suggested use. Using such information, the classifier can learn connections between objects that later allow it to make a statement about a text-image combination that was not directly included in the learning data.
- an object shown on a view can be determined whose second encoding is as similar as possible to a first encoding of the linguistic instruction.
- the method can be based on a composite, comprehensive view of the household.
- Second encodings can be determined for objects shown in the view.
- a recorded linguistic instruction or a description included in it, more specifically a term included in it, can be converted into a first encoding.
- objects - more precisely: image content of the view - can be determined that have a maximum similarity to the linguistic instruction. In other words, For different areas of the view, it can be determined how likely it is that an object outlined by the instruction or description is shown on it.
- a position of this object can be used to control the household robot. If several objects are found whose second encoding is more similar to the first encoding than the predetermined threshold, the object with the greatest similarity can be selected, a query can be asked to a user, or the objects can be used one after the other to control the household robot.
- the latter option can be used in particular if the description includes a vocabulary that represents an all-quantifier, for example "all" or "every". Such a vocabulary can also be recognized using conventional means.
- the position described or intended by the user can be determined on the basis of a position associated with the specific view.
- a number of views of the household are available, more preferably such that the views partially overlap so that a seamless composite view can be formed.
- For each view it can be known from which position in the household the view was taken.
- the position of an object shown in a view can be determined with respect to a position from which the view was taken.
- the position described can be the position from which the view was taken.
- the position described or intended by the user is determined on the basis of a location-relative statement relating to an object mentioned in the verbal instruction.
- the location-relative statement can include a local preposition such as "in front of”, “behind”, “on”, “under”, etc.
- the statement can be resolved depending on an orientation, whereby an orientation is usually set from the position from which the view was taken to the position at which the object is located. In particular, statements such as "left of" or "right of” can be correctly recognized.
- the specification may refer to a geometry of a room in which the object is located. This may be the case if the object is a feature or element of the room, for example a door or a window. But even information such as "in the middle" can be resolved in this way.
- composite information can be recognized that includes several positions that are related in a specific way. For example, a position "between" a first and a second position can be described in this way. A non-point-like location such as a walking route can also be described in this way.
- the classifier comprises a Contrastive Language-Image Pretraining (CLIP) classifier.
- CLIP Contrastive Language-Image Pretraining
- One such classifier has been proposed by OpenAI; a description of the approach can be found in “Learning Transferable Visual Models From Natural Language Supervision” by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and llya Sutskever, published on February 26, 2021. The entire disclosure of this article by Radford et al. is hereby incorporated into the present disclosure by reference.
- the classifier described in the present disclosure can be easily trained to be used to control a domestic robot in a household using the specified procedure of said article by Radford et al. To this end, it is proposed to train the classifier with information concerning a typical or assumed household, as described in more detail herein.
- the household appliance is further preferably controlled to carry out a predetermined function of the household robot at the specific position. If the household robot is a floor cleaning robot, a predetermined cleaning function can be carried out.
- the cleaning function can include vacuuming, wiping, mopping or sweeping.
- the function is determined on the basis of the instruction.
- a verb can be determined from the instruction, which is contained in a list of predetermined, known verbs.
- the list can include all functions that can be fulfilled by the household robot and can be relatively short if the household robot is specialized in solving a certain problem, such as cleaning a floor area.
- the household robot can also have another function.
- the household robot can help to clean up the house.
- the robot can grasp and/or manipulate an object, in particular bring it to a predetermined location.
- the instruction can also specify an object that the robot should manipulate.
- the instruction can also include further details about the manipulation, for example the type (e.g. grasping, moving) or a target, such as where the object should be moved to.
- the instruction may be acoustic, so that a user or generally a person in the area of the household robot can simply provide the instruction by spoken language, wherein the acoustic instruction is converted into a textual representation.
- a known technique of speech-to-text conversion may be used.
- the conversion may be performed locally, in particular by the household appliance or by another household appliance in the same household, or a detected acoustic instruction may be sent to an external location for recognition or conversion and a result recorded.
- the external location may comprise a server or a service in a cloud.
- a keyword that begins an instruction may be recognized locally and a remaining instruction may alternatively be recognized locally or remotely.
- Conversion from speech to text may also be performed in a distributed manner using multiple processing devices. For this purpose, processing devices of multiple household appliances present in the household may be used.
- the household robot can be controlled to drive through the household and capture views of the household. This process can occur before a verbal instruction is evaluated in the manner described herein. It is preferred that the household robot drives through the household at least once and creates views in the process. As described, it is preferred that the views partially overlap each other so that all parts of the household are captured in one view.
- the views can be restricted to a predetermined perspective and/or a predetermined scanning angle. For example, a vertical scanning angle can be specified so that no view of an object that is above or below the scanning angle can be created.
- the household robot can already have a suitable scanning device, preferably a camera; alternatively also a radar sensor or a LiDAR sensor.
- the creation of the views can be done together with mapping the household.
- a map of the household can be created or updated using a SLAM (Simultaneous Localization and Mapping) process. If information is collected for mapping, thus, information for the views can be collected simultaneously. In one embodiment, the same samples can be used for mapping and for the views.
- SLAM Simultaneous Localization and Mapping
- An existing scan can be updated periodically. This can involve creating just one scan or multiple scans.
- a view of the household is captured while mapping or editing the household.
- a method for training a household robot which can be used to make the household robot more controllable using the method described here.
- the household robot is controlled so that it drives through the household and captures views of the household. All or selected captured views are converted into a second encoding using the second encoder described above.
- Both encoders i.e. the first and second encoders
- Both encoders were previously trained with images and associated image descriptions in text form, so that the encodings of text and image (i.e. the first and second encodings) are very similar if they describe the same object.
- the data pairs required for this were either manually labeled or collected by browsing the Internet.
- the CLIP classifier relies in particular on browsing the Internet to collect training data and also refers to the encoding as embedding.
- the captured views can be selected, for example, using object recognition. If the system is able to discover a relevant object that has not yet been sufficiently studied or understood, particularly meaningful views of the object could be used to learn. For example, the household robot could ask the user: "Whenever I drive through the living room, there is always a big blue thing. Please take a look at the attached picture. What do you call that?
- the classifier can be further improved or fine-tuned in order to be able to better recognize the semantic meaning of the images and to integrate it more effectively into the encodings. This can be done in particular with field data (i.e. views of the end user's real households) recorded by household robots that are already in use.
- the labels or text descriptions associated with the views can either be created by manual labelers or by feedback functions from the robot owners (i.e. the users) in the associated app.
- Initial encodings are created on the basis of this human feedback on the selected views.
- the classifier can then be adapted based on first and second encodings that belong to or relate to one another. In this way, the classifier is optimized specifically for the specific location of use, which can be quite typical for its region. These include, among other things, the type of data to be expected and the objects depicted in it: e.g. the perspective (usually from below), focal length, distortion, noise level, direction of view, the general recording quality of the robot sensors as well as the expected spatial conditions and the objects to be seen in them.
- this can also be used to provide classifiers that are optimized for a certain region. For example, an average household in China or Iran will look different from a typical household in Germany. It is particularly preferred that pairs of views and descriptions entered or at least confirmed by users and/or the corresponding associated first and second encodings are collected in a central data lake of the household appliance manufacturer in order to provide classifiers on their basis that become increasingly better as the number of household robots in the field increases.
- the invention therefore comprises a method for training a household robot in order to make it more controllable using the method described above.
- the training method can, for example, comprise the following steps:
- the views can be selected using object recognition and/or relevance determination, for example.
- the relevance determination can be carried out by machine (e.g. if the classifier is unsure what it is) and/or through user feedback.
- the classifier can be adjusted locally on the household robot and/or centrally by the home appliance manufacturer. For this purpose, collected data pairs of selected views and human feedback or the associated first and second encodings can be transferred to a server or cloud service of the home appliance manufacturer.
- a query can be processed as follows, for example. If, for example, a user's verbal instruction is used to search for a specific location or object, this verbal instruction is converted into a first encoding using the first encoder and then (to put it simply) compared with all previously recorded image embeddings, i.e. the second encodings. This can be done, for example, using the matrix described below. Alternatively or additionally, an implementation using an artificial neural network and/or a function approximator would be conceivable. The image embedding, i.e. the second encoding, with the shortest distance to the text embedding being searched for, i.e. the first encoding, is selected and the associated recording location is passed on to the robot navigation as a target.
- a control device for a household robot comprises a detection device for detecting a verbal instruction from a user, wherein the instruction comprises a description, in particular a natural language description, of a position in the household; a classifier; and a processing device for determining a position intended by the user based on the description of the position included in the instruction by means of the classifier; and for controlling the household robot to the determined position.
- the classifier comprises a first encoder for providing a first encoding for a text and a second encoder for providing a second encoding for a view; wherein the encoders are trained so that first and second encodings, which are determined for a text and a view of the household associated with the text, are as similar to one another as possible.
- the classifier can be formed or comprised by the processing device.
- the processing device is preferably designed to partially or completely carry out a method described herein.
- the processing device can be electronically be electronic and comprise, for example, an integrated circuit, a programmable logic module or a programmable microcomputer.
- the method can be implemented in the form of a configuration or as a computer program product with program code means for the processing device.
- the configuration or the computer program product can be stored on a computer-readable data carrier.
- Features or advantages of the method can be transferred to the device and/or vice versa. It should also be noted that some method steps relate to the inference phase and others to the training phase.
- a household robot comprises a control device described herein.
- the household robot can be set up to carry out a predetermined task in a household, for example cleaning or tidying up.
- the household robot can move through the household.
- the household robot comprises a floor cleaning robot that is set up to work on a floor surface.
- the floor cleaning robot preferably comprises a camera for optically scanning an environment and a microphone for detecting an acoustic verbal instruction.
- Figure 1 a system
- Figure 2 is a flow chart of a method
- Figure 3 shows the pre-training of an exemplary classifier
- Figure 4 shows the determination of a position using a classifier.
- FIG. 1 shows an exemplary system 100.
- a household 105 there are a household robot 110 and an object 115.
- a user 120 of the household robot 105 is usually also present in the household 105.
- the household robot 105 comprises a control device 125 which is configured to receive a voice instruction from the user 120, to extract a description of a position from the instruction, wherein the position may relate to the object 115, to find the position in the household 105, to control the household robot 105 to the position and to control a predetermined function of the household robot 105 there.
- the control device 125 comprises a processing device 130 and a classifier 135, which can also be part of the processing device 130 or can be formed by the processing device 130. Furthermore, a preferably optical scanning device 140 and an acoustic scanning device 145 are provided.
- the optical scanning device 140 preferably comprises a camera, particularly preferably a color camera.
- the camera 140 can capture images in a color spectrum visible to humans or in another, in particular expanded, color spectrum.
- the camera 140 can also provide depth information, for example if it is a depth camera, a stereo camera, a ToF camera or a combination of an optical camera and a LiDAR sensor.
- An optics of the camera 140 has predetermined properties such as a focal length or an opening angle of a scanning area.
- the camera 140 is usually attached immovably to the household robot 110. In other embodiments, the camera 140 can be pivoted horizontally and/or vertically or a focal length of the camera 140 can be controllable.
- the acoustic scanning device 145 preferably comprises a microphone 145 or an arrangement of several microphones.
- the microphone 145 is designed to detect an acoustic arrangement in the area of the household robot 110, which was usually uttered by the user 120.
- the user 120 can also be located outside the household 105; in this case, his voice can be transmitted to the household robot 105, for example by telephone or via a communication network.
- the detection of the vocal arrangement of the user 120 can take place at a base station of the household robot 110 or at a communication device at the user 120, for example a smartphone.
- the vocal utterance can be transmitted to the household robot 110, preferably wirelessly.
- the processing device 130 can be connected to a local memory 150, which is set up to store graphic information.
- a local memory 150 which is set up to store graphic information.
- one or more views of the household 105 which are captured by the camera 140, can be stored in the local memory 150.
- the views can be in uncoded or coded form. Form.
- a drive device can be accessed that is designed to move the household robot 110 in the household.
- the household robot 110 usually comprises at least one drive wheel that can roll on a surface.
- An optional wireless communication device 160 can be designed to communicate with a device that can be used for voice input.
- a location 165 external to the household robot 110 is typically located outside the household 105 and is configured to perform predetermined data processing.
- the external location 165 can perform text-to-speech conversion.
- an acoustic instruction can be transmitted to the external location 165; the external location 165 can use speech recognition to create a written text that represents the spoken text; and the written text can be transmitted to the household robot 105.
- the acoustic instruction can also be sent to the external location 165 from a device other than the household robot 105.
- the user 120 can use a mobile device (smartphone) to capture speech and acoustic data of the captured speech can also be transmitted from there directly to the external location 165.
- the conversion of the acoustic into textual data can take place in the mobile device.
- the mobile device can take on the function of the external location 165.
- the classifier 135 preferably comprises an artificial neural network (ANN) and is trained by means of contrastive learning to determine, on the basis of linguistic input data, a section in a view of the household 105 in which an object 115 referenced in the linguistic input data is located with the highest possible probability.
- ANN artificial neural network
- Figure 2 shows a flow chart of a method 200 for controlling a household robot 110.
- the method 200 can be carried out by means of a system 100.
- part of the method 200 may be carried out by an external location 165.
- Such a part may in particular relate to preparatory processing in order to create or train a classifier 135.
- creating a classifier 135 may be complex, but a created classifier 135 may be easily used within a variety of control devices 125.
- An image preferably shows an object 115 that is typically found in a household 105.
- An example object 115 comprises a structural element of a typical household 105, i.e. practically a structural feature that physically determines the household 105, such as a wall, a window or a door.
- the object 115 can also comprise, for example, a piece of furniture, for example an armchair, a sofa, a chair, a cupboard or a table.
- the object 115 can comprise a home textile, for example a carpet, a floor covering, a cushion or a curtain.
- a text associated with an image usually has no fixed format.
- the text can name or describe an object 115 shown in an image.
- the text can include additional information about the object 115, for example an explanation of a non-visible part or feature, a possible use, a categorization or a reference to another object 115.
- the text can describe the image or the object 115, for example in the manner of a caption in a photo album or an explanation in a catalog or reference work. It is generally preferred that the texts do not exceed a predetermined length, for example about ten words or about 200 characters.
- the images and texts can be used as training material for a classifier 135.
- the classifier 135 can be created using contrast-based learning so that it assigns similar texts or terms to the same class and different texts or terms to different classes.
- a text or term can be assigned a first encoding, with texts or terms with similar content receiving first encodings that are as similar as possible and different texts or terms receiving first encodings that are as different as possible.
- different images can be assigned second encodings, with similar images or images of similar objects 115 receiving second encodings that are as similar as possible and different images or images of different objects 115 receiving second encodings that are as different as possible.
- a first encoder, which creates the first encodings, and a second encoder, which creates the second encodings, are trained in such a way that for an image of an object 115 to which a text or term is assigned, the first encoding is as close as possible to the second. For pairs of images and texts that are not assigned to each other, however, the first and second encodings are as different as possible.
- This type of encoder creation is also called contrast-based learning.
- a classifier 135 can be provided on the basis of the encoders.
- the classifier 135 is configured to determine a distance between a first encoding and a second encoding, wherein the distance represents a similarity.
- Steps 205 to 215 can be performed once to create a classifier 135, which can subsequently be used as often as desired and in as many environments as desired.
- the household robot 110 can be controlled to travel around a household 105 in which it is to be used subsequently. To do this, the household robot 110 can move in the household 105 in such a way that the boundaries of the household 105 can be detected on all sides. Images can be created in the area of the boundaries and within the household 105, for example using the camera 140. The position at which an image was captured and the orientation in which the household robot 110 was located when the image was scanned can be recorded for an image. This information can later make it possible to determine the position of an object 115 shown in an image in the household 105. Such a determination can evaluate depth information included in the image.
- one or more views of the household can be created based on one or more images.
- a view can be based on several images and thus cover a larger area than would be possible with one image alone.
- a view can optically cover a room, whereby the view can include optical information that can come from a large horizontal angle range in the manner of a panoramic shot, in the extreme case from a full circle of 360°.
- a view does not have to be length- or angle-accurate, but should allow the position of a depicted object 115 to be determined from its position on the image.
- several views can be combined in order to to provide a combined view. Such a combined view may make it easier to locate an object 115 in the household 105.
- Steps 220 to 230 may be performed in a specific household 105 to provide as complete images of the household 105 as possible, on which an object 115 can be found that may be part of a verbal description of a position. Since typically not all objects 115 in the household 105 are immovable, steps 220 to 230 may be repeated at regular intervals to provide updated views. Optionally, a view may also be partially updated by combining it with information from a current image. It should be noted that scans of the household 105 may also come from another source. For example, the user 120 may also scan the household 105 with a separate camera, or the scan may be performed by a first household robot 110 and made available to a second household robot 110.
- the household robot 110 is networked with another household appliance in the household 105.
- a detection or report from the other household appliance can then be used to determine further information regarding the household 105. For example, if a dishwasher reports that it is being emptied, the household robot 105 can know where the dishwasher is and that tiles have been laid in front of it. The household robot 105 can therefore automatically navigate to the specific location and, for example, wipe in front of it.
- the household robot 105 After pre-training the classifier 135 and providing scans of the household 105, the household robot 105 can be controlled on the basis of a verbal position indication.
- a verbal instruction to the household robot 105 can be recorded.
- the instruction can include a keyword, followed by a description of a position in the household and optionally a reference to an activity to be carried out there.
- the instruction can be: "Robi, vacuum in front of the blue sofa.”
- the keyword (“Robi") can be evaluated locally and can be used to trigger a speech-to-text conversion of a following text. This conversion can take place in a step 240. The conversion can be carried out locally by the control device 125 or by the external location 165. The result of the conversion should then be available on the household robot 110.
- the instruction can be analyzed based on its textual representation.
- the instruction can be broken down into components and certain words can be recognized directly. For example, in the present example, it can be determined that the part relating to the "blue sofa” relates to an object that can be used to determine position. The local preposition "before” can also be recognized.
- a step 250 one or more words can be directly recognized that describe a position in more detail or refer to another position.
- a list of predetermined local prepositions can be available, with which words of the textual description are compared.
- a verb can be determined that can refer to an activity to be carried out by the household robot 110 at the described location.
- the instruction "vacuume” is directed, for example, to a floor cleaning robot 110 with a suction device.
- a description of an object 115 can be extracted from the instruction. In the given example, the description refers to a blue sofa 115.
- the position of the object 115 on a view can be determined with respect to the position description, on the basis of the scans and using the trained classifier 135. For this purpose, it can be determined for a large number of representations on one or more views how similar the representation is to the location description.
- the similarity can be determined in the manner of a distance, more precisely in the manner of a vector product between two vectors that include a first encoding of the location description and a second encoding of the representation.
- a view of the household 105 may be displayed in false colors, with the color of each pixel representing a similarity of the representation in its region to the description.
- a region of the view may be determined in which a predetermined spatial cluster of similarities above a predetermined threshold exists.
- a position described or intended by the user can be determined.
- a position of an object 115 shown in the area of the cluster can first be determined on the basis of the view. For example, with respect to a position and/or orientation that the household robot 110 assumed when scanning an image for the view, a position of the object 115 in the household 105 can then be determined.
- the described position can be determined more precisely with respect to the position of the object 115 based on a local preposition included in the description.
- a position of the "blue sofa" 115 in the household could therefore first be determined. Assuming a predetermined direction system, a position or an area in the household 105 that lies "in front of" the object 115 can then be determined.
- the direction system can be based on a current position and/or orientation of the user 120 or the household robot 110.
- a direction system can also be predetermined, for example with respect to a compass direction or a predetermined direction and/or position in the household 105.
- the function to be performed by the household robot 105 can be determined. This determination can be made on the basis of a verb in the instruction recognized in step 250. If the household robot 105 is only set up to perform a single function, this function can be assumed in any case. It may also be sufficient to control the household robot 105 to the specific position, for example to await further instructions there.
- a recognized element of the instruction may be provided to the user 120 to ensure that the instruction was processed correctly. If more than one position was determined based on the instruction, the user 120 may be asked to select one or more of the positions. The method 200 may continue when the user 120 has confirmed the recognized element(s).
- the household robot 110 can be controlled to the specific position. This control can be based on a map of the environment of the household 105, which can be available from the control device 125.
- the movement of the household robot 110 can be used to capture an image of the household 105, which can be used to update or generate a view.
- the household robot 110 can be controlled to carry out the function determined in step 270 when it has reached the specific position.
- Figure 3 shows an illustration of a pre-training of an exemplary classifier 135 (cf. steps 210, 215 of the method 200 in Figure 2).
- Training data comprises a plurality of texts 305 and a plurality of images 310, wherein each text 305 is bijectively assigned to an image 310.
- a first encoder 315 creates first encodings 325 (T_i) for the texts 305 and a second encoder 320 creates second encodings 330 (l_i) for the images 310.
- Each encoding 325, 330 comprises a vector, so that a similarity or a distance between the vectors or encodings 325, 330 can be determined by forming a vector product. Such a metric is known as cosine similarity.
- the pre-training is carried out as contrast-forming learning, in which the encoders 315, 320 are successively changed so that they produce encodings 325, 330 that meet predetermined conditions. More precisely, the encoders 315, 320 are preferably determined as KNN, wherein a learning goal when processing the texts 305 and images 310 is to determine first encodings 325 and second encodings 330 that are as similar to each other as possible when a text 305 is assigned to an image 310, and otherwise as different as possible.
- the distances (l_i • T_i) to be determined between the resulting first encodings 325 (T_1 ... T_N) and second encodings 330 (l_1 ... I_N) are shown in a matrix 335 in Figure 3.
- the main diagonal of the matrix 335 comprises combinations of first and second encodings 325, 330 that are assigned to one another, the remaining elements comprise combinations that cannot be traced back to an assignment of the output data 305, 310.
- the pre-training can be considered complete when the described learning objective has been sufficiently well achieved.
- the pre-training can include a large number of texts 305 and associated images 310 and that the texts 305 and images 310 usually have to be coded very often until the encoders 315 and 320 have the required properties.
- the encoders 315 and 320 can together form a classifier 135.
- Figure 4 shows the determination of a text 305 for an image 310 by means of a classifier 135, which comprises a first encoder 315 and a second encoder 320.
- a reverse procedure can be followed to determine an object 115 for a description.
- a second encoding 330 is created for the image 310, which is then compared with all known first encodings 325 of the texts of the training data.
- the comparison again involves forming a vector product and corresponds to determining a distance between the encodings 325, 330. Due to the way in which the encoders 315 and 320 are created, it can be concluded that there are similarities between the content shown in the image 310 and the content of the descriptions 305. The smaller the distance between two encodings 325, 330, the greater the similarity or correspondence between the respective associated contents.
- the smallest of the vector products formed (l_i • T_i) can be determined.
- it can be checked whether the vector product is below a predetermined threshold value or whether the similarity of the included first and second encodings 325, 330 is above a corresponding threshold value.
- the text 305 that is assigned to the first encoding 325 can be determined as the most probable text 305 that can be assigned to the image 310.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Radar, Positioning & Navigation (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Automation & Control Theory (AREA)
- Remote Sensing (AREA)
- Health & Medical Sciences (AREA)
- Aviation & Aerospace Engineering (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Manipulator (AREA)
Abstract
La présente invention concerne un procédé (200) de commande d'un robot domestique (110) qui comprend les étapes consistant à détecter une instruction orale d'un utilisateur, l'instruction comprenant une description d'une position dans le ménage ; utiliser un classificateur (135) pour déterminer une position voulue par l'utilisateur sur la base de la description de la position incluse dans l'instruction ; le classificateur (135) comprenant un premier codeur (315) destiné à fournir un premier codage (325) pour le texte et un second codeur (320) destiné à fournir un second codage (330) pour une vue ; les codeurs (315, 320) étant entraînés de telle sorte que le premier et le second codage (325, 330), qui sont destinés au texte et à une vue du ménage attribuée au texte, soient aussi similaires que possible ; et commander le robot domestique (110) jusqu'à la position voulue.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| DE102023205876.6A DE102023205876B4 (de) | 2023-06-22 | 2023-06-22 | Steuern eines Haushaltsroboters |
| DE102023205876.6 | 2023-06-22 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024261115A1 true WO2024261115A1 (fr) | 2024-12-26 |
Family
ID=91663884
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2024/067207 Pending WO2024261115A1 (fr) | 2023-06-22 | 2024-06-20 | Commande d'un robot domestique |
Country Status (2)
| Country | Link |
|---|---|
| DE (1) | DE102023205876B4 (fr) |
| WO (1) | WO2024261115A1 (fr) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9983592B2 (en) | 2013-04-23 | 2018-05-29 | Samsung Electronics Co., Ltd. | Moving robot, user terminal apparatus and control method thereof |
| US20190202062A1 (en) * | 2018-01-04 | 2019-07-04 | Samsung Electronics Co., Ltd. | Mobile home robot and controlling method of the mobile home robot |
| US20200097012A1 (en) * | 2018-09-20 | 2020-03-26 | Samsung Electronics Co., Ltd. | Cleaning robot and method for performing task thereof |
| US20210401255A1 (en) | 2019-06-14 | 2021-12-30 | Lg Electronics Inc. | Artificial intelligence robot and method of operating the same |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| DE102018207588A1 (de) | 2018-05-16 | 2019-11-21 | BSH Hausgeräte GmbH | Erstellen einer Umgebungskarte |
| DE102022100849A1 (de) | 2021-01-15 | 2022-07-21 | RobArt GmbH | Situationsbewertung mittels objekterkennung in autonomen mobilen robotern |
-
2023
- 2023-06-22 DE DE102023205876.6A patent/DE102023205876B4/de active Active
-
2024
- 2024-06-20 WO PCT/EP2024/067207 patent/WO2024261115A1/fr active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9983592B2 (en) | 2013-04-23 | 2018-05-29 | Samsung Electronics Co., Ltd. | Moving robot, user terminal apparatus and control method thereof |
| US20190202062A1 (en) * | 2018-01-04 | 2019-07-04 | Samsung Electronics Co., Ltd. | Mobile home robot and controlling method of the mobile home robot |
| US20200097012A1 (en) * | 2018-09-20 | 2020-03-26 | Samsung Electronics Co., Ltd. | Cleaning robot and method for performing task thereof |
| US20210401255A1 (en) | 2019-06-14 | 2021-12-30 | Lg Electronics Inc. | Artificial intelligence robot and method of operating the same |
Non-Patent Citations (2)
| Title |
|---|
| VON ALEC RADFORDJONG WOOK KIMCHRIS HALLACYADITYA RAMESHGABRIEL GOHSANDHINI AGARWALGIRISH SASTRYAMANDA ASKEILPAMELA MISHKINJACK CLA, LEARNING TRANSFERABLE VISUAL MODELS FROM NATURAL LANGUAGE SUPERVISION, 26 February 2021 (2021-02-26) |
| YAO LEWEI ET AL: "FILIP: FINE-GRAINED INTERACTIVE LANGUAGE-IMAGE PRE-TRAINING", 9 November 2021 (2021-11-09), XP093204838, Retrieved from the Internet <URL:https://arxiv.org/pdf/2111.07783> [retrieved on 20240913] * |
Also Published As
| Publication number | Publication date |
|---|---|
| DE102023205876A1 (de) | 2024-12-24 |
| DE102023205876B4 (de) | 2025-03-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| DE69527745T2 (de) | Sprachgesteuerte Spielvorrichtung | |
| EP3408719B1 (fr) | Procédé de création d'une carte d'environnement pour un appareil de traitement à déplacement autonome | |
| DE202019105282U1 (de) | Vorrichtung zum Optimieren eines System für das maschinelle Lernen | |
| WO2019081545A1 (fr) | Procédé et dispositif destinés à produire automatiquement un réseau neuronal artificiel | |
| DE102019109596A1 (de) | System aus einem manuell geführten Bodenbearbeitungsgerät, einem ausschließlich automatisch betriebenen Bodenbearbeitungsgerät und einer Recheneinrichtung | |
| EP2406697A1 (fr) | Procédé et système de détection d'un objet, ainsi que procédé et système de production d'un marquage dans une représentation sur écran à l'aide d'un pointeur sans contact commandé par la gestuelle | |
| DE102023205876B4 (de) | Steuern eines Haushaltsroboters | |
| EP3343308A1 (fr) | Procédé et dispositif de détection d'un espace au moyen d'un appareil d'entretien du sol ainsi qu'appareil d'entretien du sol | |
| EP3995065A1 (fr) | Appareil de nettoyage autonome | |
| DE102020204921A1 (de) | Verfahren und Anordnung zur digitalen Erfassung von Räumlichkeiten eines Gebäudes | |
| DE202012013505U1 (de) | System zur Untersuchung des Innenmaterials eines Objekts, beispielsweise einer Rohrleitung oder eines menschlichen Körpers, mittels Ultraschall von einer Oberfläche des Objekts | |
| WO2022152875A2 (fr) | Évaluation d'une situation par reconnaissance d'objets dans des robots mobiles autonomes | |
| DE202022100253U1 (de) | Ein IoT-gestütztes System zur Überwachung und Fütterung von Haustieren | |
| DE102018207588A1 (de) | Erstellen einer Umgebungskarte | |
| EP3987434A1 (fr) | Entraînement d'un appareil électroménager intelligent | |
| DE102016216409A1 (de) | Interaktive Bedienvorrichtung | |
| EP4152271A1 (fr) | Procédé et dispositif de remplissage complet assisté par ordinateur d'un modèle partiel 3d formé par points | |
| DE102022210911A1 (de) | Verfahren zum Bestimmen eines Auswahlbereichs in einer Umgebung für ein mobiles Gerät | |
| DE10214658A1 (de) | Bildanalyse-System | |
| EP4044079A1 (fr) | Système de planification de l'espace et/ou de surface, utilisation d'un algorithme génétique | |
| DE102023212177A1 (de) | Bestimmen von Tiefeninformationen auf der Basis zweidimensionaler Bilder | |
| DE102019128927A1 (de) | Verfahren zur Erstellung einer Umgebungskarte für ein sich selbsttätig fortbewegendes Bodenbearbeitungsgerät, sowie System aus zwei sich selbsttätig fortbewegenden Bodenbearbeitungsgeräten | |
| DE102023201628B3 (de) | Bestimmung eines Zustands eines Fensters oder einer Tür in einem Haushalt | |
| EP3899809A1 (fr) | Procédé et dispositif de classification de données de capteur et de détermination d'un signal de commande pour commander un actionneur | |
| WO2025082838A1 (fr) | Procédé de génération d'une carte de textures |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24735596 Country of ref document: EP Kind code of ref document: A1 |