WO2024226946A2

WO2024226946A2 - Systems and methods for feature detection de-duplication and panoramic image generation

Info

Publication number: WO2024226946A2
Application number: PCT/US2024/026470
Authority: WO
Inventors: Charles DELEDALLE; Ian Pegg; Ashwin RAMANATHAN; Kirill PIOROZHENKO
Original assignee: Brain Corp
Current assignee: Brain Corp
Priority date: 2023-04-28
Filing date: 2024-04-26
Publication date: 2024-10-31
Anticipated expiration: 2025-10-28
Also published as: WO2024226946A3

Abstract

Systems and methods for feature detection de-duplication and panoramic image generation are disclosed herein. According to at least one non-limiting exemplary embodiment, multiple images of scenes captured by robots may often include overlap, thereby depicting individual features multiple times. In determining a count of those features present in the physical space, the individual detections must be de-duplicated. Accordingly, a depth mesh is employed to account for 3-dimensional geometry of the image scenes to perform more accurate image projection which thereby converges the multiple feature detections into a count which more accurately reflects the physical space.

Description

SYSTEMS AND METHODS FOR FEATURE DETECTION DE DUPLICATION AND PANORAMIC IMAGE GENERATION

Copyright

[0001 ] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.

Background

Technological Field

[0002] The present application relates generally to robotics, and more specifically to systems and methods for feature detection de-duplication and panoramic image generation.

Summary

[0003] The foregoing needs are satisfied by the present disclosure, which provides for, inter alia, systems and methods for feature detection de-duplication and panoramic image generation.

[0004] Exemplary embodiments described herein have innovative features, no single one of which is indispensable or solely responsible for their desirable attributes. Without limiting the scope of the claims, some of the advantageous features will now be summarized. One skilled in the art would appreciate that as used herein, the term robot may generally be referred to autonomous vehicle or object that travels a route, executes a task, or otherwise moves automatically upon executing or processing computer readable instructions.

[0005] According to at least one non-limiting exemplary embodiment, a robot and a method for generating counts of features sensed by the robot is disclosed. The robot comprises, inter alia, a memory comprising computer readable instructions stored thereon; and at least one processor configured to execute the computer readable instructions to receive a set of one or more images from the robot, wherein the set of images are each localized to a first set of corresponding camera locations; perform a bundle adjustment process to determine a depth of a plurality of key points in the set of images, the bundle adjustment process yielding a second set of camera locations; construct a mesh via the plurality of key points based in part on a depth of each key point of the plurality of key points; detect one or more features in the set of images; project one or more regions occupied by the one or more detected features onto the mesh using a camera projection matrix and the second set of camera locations; and determine a set of counts, the set of counts comprising a number of each of the one or

1

SUBSTITUTE SHEET (RULE 26) more features, wherein the counts are based on a total number of the respective one or more features projected to different locations on the mesh.

[0006] Further, wherein the at least one processor is further configured to execute the computer readable instructions to, discretize the mesh into at least one of a plurality of regions and a plurality of pixels; and determine a color value of at least one of the plurality of regions and pixels of the mesh by projecting pixel color values of the images onto the mesh from the second set of camera locations, wherein the determined color value of the at least one of the plurality of regions and pixels of the mesh is based on color values of all pixels projected thereon. Furthermore, wherein the at least one processor is further configured to execute the computer readable instructions to, project the pixel color values of at least one of the plurality of regions and pixels of the mesh onto a designated plane to produce an orthographic panoramic perspective, the projection comprising an orthogonal projection onto the designated plane.

[0007] Additionally, the at least one processor is further configured to execute the computer readable instructions to, darken one or more pixels of the orthographic panoramic perspective based on depth of the one or more pixels, wherein depth and darkness of the one or more pixels is directly related; determine one or more key point correspondences between the images when performing the bundle adjustment process, wherein determining one or more key point correspondences comprises identifying a first key point in a first image of the set of one or more images and identifying a second key point in a second image of the set of one or more images, wherein the first key point and the second key point key point depict a same feature; and determine the second set of camera locations based on an imagespace location of the first key point and an image-space location of the second key point using epipolar geometry, wherein the correspondences used by the bundle adjustment process are removed if the resulting second set of camera locations deviates from the first set of camera locations by greater than a threshold amount.

[0008] Furthermore, the at least one processor is further configured to execute the computer readable instructions to, determine that two or more feature detections overlap on the mesh following the projection; compare the overlap to a first threshold, wherein the overlap being greater than the first threshold resolves the two or more detections as a singular count; and compare the overlap to a second threshold, wherein the overlap being less than the second threshold resolves the two or more detections as two or more counts.

[0009] These and other objects, features, and characteristics of the present disclosure, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form

2

SUBSTITUTE SHEET (RULE 26) a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the disclosure. As used in the specification and in the claims, the singular form of “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise.

Brief Description of the Drawings

[0010] The disclosed aspects will hereinafter be described in conjunction with the appended drawings, provided to illustrate and not to limit the disclosed aspects, wherein like designations denote like elements.

[0011 ] FIG. 1 A is a functional block diagram of a robot in accordance with some embodiments of this disclosure.

[0012] FIG. IB is a functional block diagram of a controller or processor in accordance with some embodiments of this disclosure.

[0013] FIG. 2 is a functional block diagram of a server network coupled to a plurality of robots, devices, and data sources in accordance with some embodiments of this disclosure.

[0014] FIG. 3A(i-ii) depict a robot and a scanning module for use in imaging features within the environment of the robot, according to an exemplary embodiment.

[0015] FIG. 3A(iii) depicts a special purpose robot configured to image features within its environment, according to an exemplary embodiment.

[0016] FIG. 3B includes a plurality of overlapping images which encompass singular features to generate duplicate feature detections, according to an exemplary embodiment.

[0017] FIG. 4 is a functional block diagram of a system configured to ingest robot imagery and odometry and produce a feature report comprising counts of detected features, according to an exemplary embodiment.

[0018] FIG. 5A-B illustrate a bundle adjustment process, according to an exemplary embodiment.

[0019] FIG. 5C(i-ii) depict improper and proper correspondences between key points of two images, according to an exemplary embodiment.

[0020] FIG. 5D depicts a 3-dimensional mesh produced using depth information extracted from a bundle adjustment process, according to an exemplary embodiment.

[0021] FIG. 6 depicts projection of images onto a 3-dimensional mesh, de -duplication of feature detections using bounding box locations, and generation of an orthographic panoramic view,

3

SUBSTITUTE SHEET (RULE 26) according to an exemplary embodiment.

[0022] FIG. 7 is a process flow diagram illustrating a method for a processor of a server to generate a feature report, according to an exemplary embodiment.

Detailed Description

[0024] Currently, retail and commercial spaces often contain a large number of individual products, objects, or other features, wherein proper and up-to-date tracking of these features within the environment may be critical to operational efficiency. Contemporary solutions include, for example, human associates using hand-held scanners which communicate with inventory databases to track inventory /stock of a given item as the items are received into the environment. Similarly, point of sales scanners may also track inventory leaving the environment. There is, however, a missing link between these two endpoints: the sales floor. Although tracking items entering and leaving an environment has been solved using a plurality of contemporary solutions, there still exists a need for a human being to identify low/out of stock items and accordingly restock them on the sales floor which can be a difficult and time consuming task for large commercial environments with thousands of different products, each arranged in a specific manner on a display/shelf. Accordingly, there is a need in the art to automatically produce counts of items/features within an environment, localize such features to specific places in the environment, and to provide human associates with a readily understandable view of the item displays/sales floor to quickly make actionable decisions (e.g., identify and restock a low stock item, move a misplaced item, etc.).

[0025] Various aspects of the novel systems, apparatuses, and methods disclosed herein are described more fully hereinafter with reference to the accompanying drawings. This disclosure can, however, be embodied in many different forms and should not be constmed as limited to any specific structure or function presented throughout this disclosure. Rather, these aspects are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. Based on the teachings herein, one skilled in the art would appreciate that the scope of the disclosure is intended to cover any aspect of the novel systems, apparatuses, and methods disclosed herein, whether implemented independently of, or combined with, any other aspect of the disclosure. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein. In addition, the scope of the disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or structure and functionality in addition to or other than the various aspects of the disclosure set forth herein. It should

4

SUBSTITUTE SHEET (RULE 26) be understood that any aspect disclosed herein may be implemented by one or more elements of a claim. [0026] Although particular aspects are described herein, many variations and permutations of these aspects fall within the scope of the disclosure. Although some benefits and advantages of the preferred aspects are mentioned, the scope of the disclosure is not intended to be limited to particular benefits, uses, and/or objectives. The detailed description and drawings are merely illustrative of the disclosure rather than limiting, the scope of the disclosure being defined by the appended claims and equivalents thereof.

[0027] The present disclosure provides for systems and methods for feature detection deduplication and panoramic image generation.

[0028] As used herein, a feature may comprise one or more numeric values (e.g., floating point, decimal, a tensor of values, etc.) characterizing an input from a sensor unit including, but not limited to, detection of an object, parameters of the object (e.g., size, shape color, orientation, edges, etc.), color values of pixels of an image, depth values of pixels of a depth image, brightness of an image, the image as a whole, changes of features overtime (e.g., velocity, trajectory, etc. of an object), sounds, spectral energy of a spectrum bandwidth, motor feedback (i.e., encoder values), sensor values (e.g., gyroscope, accelerometer, GPS, magnetometer, etc. readings), a binary categorical variable, an enumerated type, a character/string, or any other characteristic of a sensory input. The features discussed herein are ones designated by humans based on their desired level of abstraction. For instance, a shelf may comprise a feature of a store, items on that shelf could be a feature of the shelf and/or store, those items may further contain features such as notable shapes, patterns, text, etc. and so forth, wherein the systems and methods herein enable accurate counting and depiction of any of such features (i.e., the shelf, items on the shelf, aspects of those items, etc.) a user may desire.

[0029] As used herein, a feature detection, detecting features, or other ‘detections’ of features corresponds to any identification, sensing, or other detection of a feature, be that once or multiple times. For example, consider twenty images which all depict the same rock, that rock would therefore produce twenty detections of ‘rock’ even if there is only one rock feature present in the physical space. In other words, the number of “rock features detected” is twenty.

[0030] As used herein, a feature count, count of features, or other ‘counts’ refer to the number of items, features, objects, etc. present within an environment. Feature counts differ from feature detections in that feature counts refer to the number of those features physically present in physical space, whereas feature detections refer to the number of times the given feature was sensed using sensors (e.g., cameras). Continuing with the prior example, despite twenty images being taken of a rock and twenty feature detections of ‘rock’ being generated, only one rock is counted in the final count of the ‘rock’ feature. The present disclosure aims to transform multiple feature detections from

5

SUBSTITUTE SHEET (RULE 26) overlapping images into accurate feature counts which represent the state of the environment.

[0031] As used herein, a robot may include mechanical and/or virtual entities configured to carry out a complex series of tasks or actions autonomously. In some exemplary embodiments, robots may be machines that are guided and/or instructed by computer programs and/or electronic circuitry. In some exemplary embodiments, robots may include electro-mechanical components that are configured for navigation, where the robot may move from one location to another. Such robots may include autonomous and/or semi-autonomous cars, floor cleaners, rovers, drones, planes, boats, carts, trams, wheelchairs, industrial equipment, stocking machines, mobile platforms, personal transportation devices (e.g., hover boards, SEGWAY® vehicles, etc.), trailer movers, vehicles, and the like. Robots may also include any autonomous and/or semi-autonomous machine for transporting items, people, animals, cargo, freight, objects, luggage, and/or anything desirable from one location to another.

[0032] As used herein, network interfaces may include any signal, data, or software interface with a component, network, or process including, without limitation, those of the FireWire (e.g., FW400, FW800, FWS800T, FWS1600, FWS3200, etc.), universal serial bus (“USB”) (e.g., USB l.X, USB 2.0, USB 3.0, USB Type-C, etc.), Ethernet (e.g., 10/100, 10/100/1000 (Gigabit Ethernet), 10-Gig- E, etc.), multimedia over coax alliance technology (“MoCA”), Coaxsys (e.g., TVNET™), radio frequency tuner (e.g., in-band or OOB, cable modem, etc.), Wi-Fi (802.11), WiMAX (e.g., WiMAX (802.16)), PAN (e.g., PAN/802.15), cellular (e.g., 3G, 4G, or 5G including LTE/LTE-A/TD-LTE/TD- LTE, GSM, etc. variants thereof), IrDA families, etc. As used herein, Wi-Fi may include one or more oflEEE-Std. 802.11, variants oflEEE-Std. 802.11, standards related to IEEE- Std. 802.11 (e.g., 802.11 a/b/g/n/ac/ad/af/ah/ai/aj/aq/ax/ay), and/or other wireless standards.

[0033] As used herein, processor, microprocessor, and/or digital processor may include any type of digital processing device such as, without limitation, digital signal processors (“DSPs”), reduced instruction set computers (“RISC”), complex instruction set computers (“CISC”) processors, microprocessors, gate arrays (e.g., field programmable gate arrays (“FPGAs”)), programmable logic device (“PLDs”), reconfigurable computer fabrics (“RCFs”), array processors, secure microprocessors, and application-specific integrated circuits (“ASICs”). Such digital processors may be contained on a single unitary integrated circuit die or distributed across multiple components.

[0034] As used herein, computer program and/or software may include any sequence or human or machine cognizable steps which perform a function. Such computer program and/or software may be rendered in any programming language or environment including, for example, C/C++, C#, Fortran, COBOL, MATLAB™, PASCAL, GO, RUST, SCALA, Python, assembly language, markup languages (e.g., HTML, SGML, XML, VoXML), and the like, as well as object-oriented environments such as the Common Object Request Broker Architecture (“CORBA”), JAVA™ (including J2ME, Java

6

SUBSTITUTE SHEET (RULE 26) Beans, etc.), Binary Runtime Environment (e.g., “BREW”), and the like.

[0035] As used herein, connection, link, and/or wireless link may include a causal link between any two or more entities (whether physical or logical/virtual), which enables information exchange between the entities.

[0036] As used herein, computer and/or computing device may include, but are not limited to, personal computers (“PCs”) and minicomputers, whether desktop, laptop, or otherwise, mainframe computers, workstations, servers, personal digital assistants (“PDAs”), handheld computers, embedded computers, programmable logic devices, personal communicators, tablet computers, mobile devices, portable navigation aids, J2ME equipped devices, cellular telephones, smart phones, personal integrated communication or entertainment devices, and/or any other device capable of executing a set of instructions and processing an incoming data signal.

[0037] Detailed descriptions of the various embodiments of the system and methods of the disclosure are now provided. While many examples discussed herein may refer to specific exemplary embodiments, it will be appreciated that the described systems and methods contained herein are applicable to any kind of robot. Myriad other embodiments or uses for the technology described herein would be readily envisaged by those having ordinary skill in the art, given the contents of the present disclosure.

[0038] Advantageously, the systems and methods of this disclosure at least: (i) enable autonomous feature scanning in complex environments by improving feature count accuracy; (ii) improve efficiency of human associates by autonomously gathering product/feature information of a given environment; (iii) enable humans to rapidly identify and localize products/features they may wish to purchase, restock, or move; and (iv) ensure feature counts in reports are free from duplicate detections. Other advantages are readily discernable by one having ordinary skill in the art given the contents of the present disclosure.

[0039] FIG. 1A is a functional block diagram of a robot 102 in accordance with some principles of this disclosure. As illustrated in FIG. 1A, robot 102 may include controller 118, memory 120, user interface unit 112, sensor units 114, navigation units 106, actuator unit 108, operating system 110, and communications unit 116, as well as other components and subcomponents (e.g., some of which may not be illustrated). Although a specific embodiment is illustrated in FIG. 1 A, it is appreciated that the architecture may be varied in certain embodiments as would be readily apparent to one of ordinary skill given the contents of the present disclosure. As used herein, robot 102 may be representative at least in part of any robot described in this disclosure.

[0040] Controller 118 may control the various operations performed by robot 102. Controller 118 may include and/or comprise one or more processors (e.g., microprocessors) and other peripherals.

7

SUBSTITUTE SHEET (RULE 26) As previously mentioned and used herein, processor, microprocessor, and/or digital processor may include any type of digital processing device such as, without limitation, digital signal processors (“DSPs”), reduced instruction set computers (“RISC”), complex instruction set computers (“CISC”), microprocessors, gate arrays (e.g., field programmable gate arrays (“FPGAs”)), programmable logic device (“PLDs”), reconfigurable computer fabrics (“RCFs”), array processing devices, secure microprocessors and application-specific integrated circuits (“ASICs”). Peripherals may include hardware accelerators configured to perform a specific function using hardware elements such as, without limitation, encryption/description hardware, algebraic processors (e.g., tensor processing units, quadradic problem solvers, multipliers, etc.), data compressors, encoders, arithmetic logic units (“ALU”), and the like. Such digital processors may be contained on a single unitary integrated circuit die, or distributed across multiple components.

[0041] Controller 118 may be operatively and/or communicatively coupled to memory 120. Memory 120 may include any type of integrated circuit or other storage device configured to store digital data including, without limitation, read-only memory (“ROM”), random access memory (“RAM”), non-volatile random access memory (“NVRAM”), programmable read-only memory (“PROM”), electrically erasable programmable read-only memory (“EEPROM”), dynamic randomaccess memory (“DRAM”), Mobile DRAM, synchronous DRAM (“SDRAM”), double data rate SDRAM (“DDR/2 SDRAM”), extended data output (“EDO”) RAM, fast page mode RAM (“FPM”), reduced latency DRAM (“RLDRAM”), static RAM (“SRAM”), flash memory (e.g., NAND/NOR), memristor memory, pseudostatic RAM (“PSRAM”), etc. Memory 120 may provide computer-readable instructions and data to controller 118. For example, memory 120 may be a non-transitory, computer- readable storage apparatus and/or medium having a plurality of instructions stored thereon, the instructions being executable by a processing apparatus (e.g., controller 118) to operate robot 102. In some cases, the computer-readable instructions may be configured to, when executed by the processing apparatus, cause the processing apparatus to perform the various methods, features, and/or functionality described in this disclosure. Accordingly, controller 118 may perform logical and/or arithmetic operations based on program instructions stored within memory 120. In some cases, the instructions and/or data of memory 120 may be stored in a combination of hardware, some located locally within robot 102, and some located remote from robot 102 (e.g., in a cloud, server, network, etc.).

[0042] It should be readily apparent to one of ordinary skill in the art that a processor may be internal to or on board robot 102 and/or may be external to robot 102 and be communicatively coupled to controller 118 of robot 102 utilizing communication units 116 wherein the external processor may receive data from robot 102, process the data, and transmit computer-readable instructions back to controller 118. In at least one non-limiting exemplary embodiment, the processor may be on a remote

8

SUBSTITUTE SHEET (RULE 26) server (not shown).

[0043] In some exemplary embodiments, memory 120, shown in FIG. 1A, may store a library of sensor data. In some cases, the sensor data may be associated at least in part with objects and/or people. In exemplary embodiments, this library may include sensor data related to objects and/or people in different conditions, such as sensor data related to objects and/or people with different compositions (e.g., materials, reflective properties, molecular makeup, etc.), different lighting conditions, angles, sizes, distances, clarity (e g., blurred, obstructed/occluded, partially off frame, etc.), colors, surroundings, and/or other conditions. The sensor data in the library may be taken by a sensor (e.g., a sensor of sensor units 114 or any other sensor) and/or generated automatically, such as with a computer program that is configured to generate/ simulate (e.g., in a virtual world) library sensor data (e.g., which may generate/simulate these library data entirely digitally and/or beginning from actual sensor data) from different lighting conditions, angles, sizes, distances, clarity (e.g., blurred, obstructed/occluded, partially off frame, etc.), colors, surroundings, and/or other conditions. The number of images in the library may depend at least in part on one or more of the amount of available data, the variability of the surrounding environment in which robot 102 operates, the complexity of objects and/or people, the variability in appearance of objects, physical properties of robots, the characteristics of the sensors, and/or the amount of available storage space (e.g., in the library, memory 120, and/or local or remote storage). In exemplary embodiments, at least a portion of the library may be stored on a network (e.g., cloud, server, distributed network, etc.) and/or may not be stored completely within memory 120. As yet another exemplary embodiment, various robots (e.g., that are commonly associated, such as robots by a common manufacturer, user, network, etc.) may be networked so that data captured by individual robots are collectively shared with other robots. In such a fashion, these robots may be configured to learn and/or share sensor data in order to facilitate the ability to readily detect and/or identify errors and/or assist events.

[0044] Still referring to FIG. 1A, operative units 104 may be coupled to controller 118, or any other controller, to perform the various operations described in this disclosure. One, more, or none of the modules in operative units 104 may be included in some embodiments. Throughout this disclosure, reference may be to various controllers and/or processors. In some embodiments, a single controller (e.g., controller 118) may serve as the various controllers and/or processors described. In other embodiments different controllers and/or processors may be used, such as controllers and/or processors used particularly for one or more operative units 104. Controller 118 may send and/or receive signals, such as power signals, status signals, data signals, electrical signals, and/or any other desirable signals, including discrete and analog signals to operative units 104. Controller 118 may coordinate and/or manage operative units 104, and/or set timings (e.g., synchronously or asynchronously), turn off/on

9

SUBSTITUTE SHEET (RULE 26) control power budgets, receive/send network instructions and/or updates, update firmware, send interrogatory signals, receive and/or send statuses, and/or perform any operations for running features of robot 102.

[0045] Returning to FIG. 1A, operative units 104 may include various units that perform functions for robot 102. For example, operative units 104 includes at least navigation units 106, actuator units 108, operating system 110, user interface units 112, sensor units 114, and communication units 116. Operative units 104 may also comprise other units such as specifically configured task units (not shown) that provide the various functionality of robot 102. In exemplary embodiments, operative units 104 may be instantiated in software, hardware, or both software and hardware. For example, in some cases, units of operative units 104 may comprise computer implemented instructions executed by a controller. In exemplary embodiments, units of operative unit 104 may comprise hardcoded logic (e.g., ASICS). In exemplary embodiments, units of operative units 104 may comprise both computer- implemented instructions executed by a controller and hardcoded logic. Where operative units 104 are implemented in part in software, operative units 104 may include umts/modules of code configured to provide one or more functionalities.

[0046] In exemplary embodiments, navigation units 106 may include systems and methods that may computationally construct and update a map of an environment, localize robot 102 (e.g., find the position) in a map, and navigate robot 102 to/from destinations. The mapping may be performed by imposing data obtained in part by sensor units 114 into a computer-readable map representative at least in part of the environment. In exemplary embodiments, a map of an environment may be uploaded to robot 102 through user interface units 112, uploaded wirelessly or through wired connection, or taught to robot 102 by a user.

[0047] In exemplary embodiments, navigation units 106 may include components and/or software configured to provide directional instructions for robot 102 to navigate. Navigation units 106 may process maps, routes, and localization information generated by mapping and localization units, data from sensor units 114, and/or other operative units 104.

[0048] Still referring to FIG. 1A, actuator units 108 may include actuators such as electric motors, gas motors, driven magnet systems, solenoid/ratchet systems, piezoelectric systems (e.g., inchworm motors), magneto strictive elements, gesticulation, and/or any way of driving an actuator known in the art. By way of illustration, such actuators may actuate the wheels for robot 102 to navigate a route; navigate around obstacles; repose cameras and sensors, etc. According to exemplary embodiments, actuator unit 108 may include systems that allow movement of robot 102, such as motorize propulsion. For example, motorized propulsion may move robot 102 in a forward or backward direction, and/or be used at least in part in turning robot 102 (e.g., left, right, and/or any other direction).

10

SUBSTITUTE SHEET (RULE 26) By way of illustration, actuator unit 108 may control if robot 102 is moving or is stopped and/or allow robot 102 to navigate from one location to another location.

[0049] Actuator unit 108 may also include any sy stem used for actuating and, in some cases actuating task units to perform tasks. For example, actuator unit 108 may include driven magnet systems, motors/engines (e.g., electric motors, combustion engines, steam engines, and/or any type of motor/engine known in the art), solenoid/ratchet system, piezoelectric system (e g., an inchworm motor), magneto strictive elements, gesticulation, and/or any actuator known in the art.

[0050] According to exemplary embodiments, sensor units 114 may comprise systems and/or methods that may detect characteristics within and/or around robot 102. Sensor units 114 may comprise a plurality and/or a combination of sensors. Sensor units 114 may include sensors that are internal to robot 102 or external to robot 102, and/or have components that are partially internal and/or partially external to robot 102. In some cases, sensor units 114 may include one or more exteroceptive sensors, such as sonars, light detection and ranging (“LiDAR”) sensors, radars, lasers, cameras (including video cameras (e.g., red-blue -green (“RBG”) cameras, infrared cameras, three-dimensional ("3D") cameras, thermal cameras, etc.), time of flight (“ToF”) cameras, structured light cameras, etc.), antennas, motion detectors, microphones, and/or any other sensor known in the art. According to some exemplary embodiments, sensor units 114 may collect raw measurements (e.g., currents, voltages, resistances, gate logic, etc.) and/or transformed measurements (e.g., distances, angles, detected points in obstacles, etc.). In some cases, measurements may be aggregated and/or summarized. Sensor units 114 may generate data based at least in part on distance or height measurements. Such data may be stored in data structures, such as matrices, arrays, queues, lists, stacks, bags, etc.

[0051 ] According to exemplary embodiments, sensor units 114 may include sensors that may measure internal characteristics of robot 102. For example, sensor units 114 may measure temperature, power levels, statuses, and/or any characteristic of robot 102. In some cases, sensor units 114 may be configured to determine the odometry of robot 102. For example, sensor units 114 may include proprioceptive sensors, which may comprise sensors such as accelerometers, inertial measurement units (“IMU”), odometers, gyroscopes, speedometers, cameras (e.g. using visual odometry), clock/timer, and the like. Odometry may facilitate autonomous navigation and/or autonomous actions of robot 102. This odometry may include robot 102’s position (e.g., where position may include robot’s location, displacement and/or orientation, and may sometimes be interchangeable with the term pose as used herein) relative to the initial location. Such data may be stored in data structures, such as matrices, arrays, queues, lists, stacks, bags, etc. According to exemplary embodiments, the data structure of the sensor data may be called an image.

11

SUBSTITUTE SHEET (RULE 26) [0052] According to exemplary embodiments, sensor units 114 may be in part external to the robot 102 and coupled to communications units 116. For example, a security camera within an environment of a robot 102 may provide a controller 118 of the robot 102 with a video feed via wired or wireless communication channel(s). In some instances, sensor units 114 may include sensors configured to detect a presence of an object at a location such as, for example and without limitation, a pressure or motion sensor may be disposed at a shopping cart storage location of a grocery store, wherein the controller 118 of the robot 102 may utilize data from the pressure or motion sensor to determine if the robot 102 should retrieve more shopping carts for customers.

[0053] According to exemplary embodiments, user interface units 112 may be configured to enable a user to interact with robot 102. For example, user interface units 112 may include touch panels, buttons, keypads/keyboards, ports (e.g., universal serial bus (“USB”), digital visual interface (“DVI”), Display Port, E-Sata, Firewire, PS/2, Serial, VGA, SCSI, audioport, high-defmition multimedia interface (“HDMI”), personal computer memory card international association (“PCMCIA”) ports, memory card ports (e.g., secure digital (“SD”) and miniSD), and/or ports for computer-readable medium), mice, rollerballs, consoles, vibrators, audio transducers, and/or any interface for a user to input and/or receive data and/or commands, whether coupled wirelessly or through wires. Users may interact through voice commands or gestures. User interface units 112 may include a display, such as, without limitation, liquid crystal display (“ECDs”), light-emitting diode (“LED”) displays, LED LCD displays, in-plane -switching (“IPS”) displays, cathode ray tubes, plasma displays, high definition (“HD”) panels, 4K displays, retina displays, organic LED displays, touchscreens, surfaces, canvases, and/or any displays, televisions, monitors, panels, and/or devices known in the art for visual presentation. According to exemplary embodiments user interface units 112 may be positioned on a body of robot 102. According to exemplary embodiments, user interface units 112 may be positioned away from the body of robot 102 but may be communicatively coupled to robot 102 (e.g., via communication units including transmitters, receivers, and/or transceivers) directly or indirectly (e.g., through a network, server, and/or a cloud). According to exemplary embodiments, user interface units 112 may include one or more projections of images on a surface (e.g., the floor) proximally located to the robot, e.g., to provide information to the occupant or to people around the robot. The information could be the direction of future movement of the robot, such as an indication of moving forward, left, right, back, at an angle, and/or any other direction. In some cases, such information may utilize arrows, colors, symbols, etc.

[0054] According to exemplary embodiments, communications unit 116 may include one or more receivers, transmitters, and/or transceivers. Communications unit 116 may be configured to send/receive a transmission protocol, such as BLUETOOTH®, ZIGBEE®, Wi-Fi, induction wireless

12

SUBSTITUTE SHEET (RULE 26) data transmission, radio frequencies, radio transmission, radio-frequency identification (“RFID”), nearfield communication (“NFC”), infrared, network interfaces, cellular technologies such as 3G (3.5G, 3.75G, 3GPP/3GPP2/HSPA+), 4G (4GPP/4GPP2/LTE/LTE-TDD/LTE-FDD), 5G (5GPP/5GPP2), or 5G LTE (long-term evolution, and variants thereof including LTE-A, LTE-U, LTE-A Pro, etc.), highspeed downlink packet access (“HSDPA”), high-speed uplink packet access (“HSUPA”), time division multiple access (“TDMA”), code division multiple access (“CDMA”) (e.g., IS-95A, wideband code division multiple access (“WCDMA”), etc.), frequency hopping spread spectrum (“FHSS”), direct sequence spread spectrum (“DSSS”), global system for mobile communication (“GSM”), Personal Area Network (“PAN”) (e.g., PAN/802.15), worldwide interoperability for microwave access (“WiMAX”), 802.20, long term evolution (“LTE”) (e.g., LTE/LTE-A), time division LTE (“TD- LTE”), global system for mobile communication (“GSM”), narrowband/frequency-division multiple access (“FDMA”), orthogonal frequency-division multiplexing (“OFDM”), analog cellular, cellular digital packet data (“CDPD”), satellite systems, millimeter wave or microwave systems, acoustic, infrared (e.g., infrared data association (“IrDA”)), and/or any other form of wireless data transmission. [0055] Communications unit 116 may also be configured to send/receive signals utilizing a transmission protocol over wired connections, such as any cable that has a signal line and ground. For example, such cables may include Ethernet cables, coaxial cables, Universal Serial Bus (“USB”), FireWire, and/or any connection known in the art. Such protocols may be used by communications unit 116 to communicate to external systems, such as computers, smart phones, tablets, data capture systems, mobile telecommunications networks, clouds, servers, or the like. Communications unit 116 may be configured to send and receive signals comprising numbers, letters, alphanumeric characters, and/or symbols. In some cases, signals may be encrypted, using algorithms such as 128-bit or 256-bit keys and/or other encryption algorithms complying with standards such as the Advanced Encryption Standard (“AES”), RSA, Data Encryption Standard (“DES”), Triple DES, and the like. Communications unit 116 may be configured to send and receive statuses, commands, and other data/information. For example, communications unit 116 may communicate with a user operator to allow the user to control robot 102. Communications unit 116 may communicate with a server/network (e.g., a network) in order to allow robot 102 to send data, statuses, commands, and other communications to the server. The server may also be communicatively coupled to computer(s) and/or device(s) that may be used to monitor and/or control robot 102 remotely. Communications unit 116 may also receive updates (e.g., firmware or data updates), data, statuses, commands, and other communications from a server for robot 102.

[0056] In exemplary embodiments, operating system 110 may be configured to manage memory 120, controller 118, power supply 122, modules in operative units 104, and/or any software,

13

SUBSTITUTE SHEET (RULE 26) hardware, and/or features of robot 102. For example, and without limitation, operating system 110 may include device drivers to manage hardware recourses for robot 102.

[0057] In exemplary embodiments, power supply 122 may include one or more batteries, including, without limitation, lithium, lithium ion, nickel-cadmium, nickel-metal hydride, nickelhydrogen, carbon-zinc, silver-oxide, zinc-carbon, zinc-air, mercury oxide, alkaline, or any other type of battery known in the art. Certain batteries may be rechargeable, such as wirelessly (e.g., by resonant circuit and/or a resonant tank circuit) and/or plugging into an external power source. Power supply 122 may also be any supplier of energy, including wall sockets and electronic devices that convert solar, wind, water, nuclear, hydrogen, gasoline, natural gas, fossil fuels, mechanical energy, steam, and/or any power source into electricity.

[0058] One or more of the units described with respect to FIG. 1A (including memory' 120, controller 118, sensor units 114, user interface unit 112, actuator unit 108, communications unit 116, navigation unit 106, and/or other units) may be integrated onto robot 102, such as in an integrated system. However, according to some exemplary embodiments, one or more of these units may be part of an attachable module. This module may be attached to an existing apparatus to automate so that it behaves as a robot. Accordingly, the features described in this disclosure with reference to robot 102 may be instantiated in a module that may be attached to an existing apparatus and/or integrated onto robot 102 in an integrated system. Moreover, in some cases, a person having ordinary skill in the art would appreciate from the contents of this disclosure that at least a portion of the features described in this disclosure may also be run remotely, such as in a cloud, network, and/or server.

[0059] As used herein, a robot 102, a controller 118, or any other controller, processor, or robot performing a task, operation or transformation illustrated in the figures below comprises a controller executing computer readable instructions stored on a non-transitory computer readable storage apparatus, such as memory 120, as would be appreciated by one skilled in the art.

[0060] Next referring to FIG. IB, the architecture of a processor or processing device 138 is illustrated according to an exemplary embodiment. As illustrated in FIG. IB, the processing device 138 includes a data bus 128, a receiver 126, a transmitter 134, at least one processor 130, and a memory 132. The receiver 126, the processor 130 and the transmitter 134 all communicate with each other via the data bus 128. The processor 130 is configurable to access the memory 132 which stores computer code or computer readable instructions in order for the processor 130 to execute the specialized algorithms. As illustrated in FIG. IB, memory 132 may comprise some, none, different, or all of the features of memory 120 previously illustrated in FIG. 1A. The algorithms executed by the processor 130 are discussed in further detail below. The receiver 126 as shown in FIG. IB is configurable to receive input signals 124. The input signals 124 may comprise signals from a plurality of operative

14

SUBSTITUTE SHEET (RULE 26) units 104 illustrated in FIG. 1A including, but not limited to, sensor data from sensor units 114, user inputs, motor feedback, external communication signals (e.g., from a remote server), and/or any other signal from an operative unit 104 requiring further processing. The receiver 126 communicates these received signals to the processor 130 via the data bus 128. As one skilled in the art would appreciate, the data bus 128 is the means of communication between the different components — receiver, processor, and transmitter — in the processing device. The processor 130 executes the algorithms, as discussed below, by accessing specialized computer-readable instructions from the memory 132. Further detailed description as to the processor 130 executing the specialized algorithms in receiving, processing and transmitting of these signals is discussed above with respect to FIG. 1A. The memory 132 is a storage medium for storing computer code or instructions. The storage medium may include optical memory (e.g., CD, DVD, HD-DVD, Blu-Ray Disc, etc.), semiconductor memory (e.g., RAM, EPROM, EEPROM, etc.), and/or magnetic memory' (e.g., hard-disk drive, floppy-disk drive, tape drive, MRAM, etc.), among others. Storage medium may include volatile, nonvolatile, dynamic, static, read/write, read-only, random-access, sequential-access, location-addressable, file-addressable, and/or content-addressable devices. The processor 130 may communicate output signals to transmitter 134 via data bus 128 as illustrated. The transmitter 134 may be configurable to further communicate the output signals to a plurality of operative units 104 illustrated by signal output 136.

[0061] One of ordinary skill in the art would appreciate that the architecture illustrated in FIG. IB may illustrate an external server architecture configurable to effectuate the control of a robotic apparatus from a remote location. That is, the server may also include a data bus, a receiver, a transmitter, a processor, and a memory that stores specialized computer readable instructions thereon. [0062] One of ordinary skill in the art would appreciate that a controller 118 of a robot 102 may include one ormore processing devices 138 and may further include other peripheral devices used for processing information, such as ASICS, DPS, proportional-integral-derivative (“PID”) controllers, hardware accelerators (e.g., encryption/decryption hardware), and/or other peripherals (e.g., analog to digital converters) described above in FIG. 1A. The other peripheral devices when instantiated in hardware are commonly used within the art to accelerate specific tasks (e.g., multiplication, encryption, etc.) which may alternatively be performed using the system architecture of FIG. IB. In some instances, peripheral devices are used as a means for intercommunication between the controller 118 and operative units 104 (e.g., digital to analog converters and/or amplifiers for producing actuator signals). Accordingly, as used herein, the controller 118 executing computer readable instructions to perform a function may include one or more processing devices 138 thereof executing computer readable instructions and, in some instances, the use of any hardware peripherals known within the art. Controller 118 may be illustrative of various processing devices 138 and peripherals integrated into a single circuit

15

SUBSTITUTE SHEET (RULE 26) die or distributed to various locations of the robot 102 which receive, process, and output information to/from operative units 104 of the robot 102 to effectuate control of the robot 102 in accordance with instructions stored in a memory 120, 132. For example, controller 118 may include a plurality of processing devices 138 for performing high level tasks (e.g., planning a route to avoid obstacles) and processing devices 138 for performing low-level tasks (e.g., producing actuator signals in accordance with the route).

[0063] FIG. 2 illustrates a server 202 and communicatively coupled components thereof in accordance with some exemplary embodiments of this disclosure. The server 202 may comprise one or more processing units depicted in FIG. IB above, each processing unit comprising at least one processor 130 and memory 132 therein in addition to, without limitation, any other components illustrated in FIG. IB. The processing units may be centralized at a location or distributed among a plurality of devices (e.g., a cloud server). Communication links between the server 202 and coupled devices may comprise wireless and/or wired communications, wherein the server 202 may further comprise one or more coupled antenna to effectuate the wireless communication. The server 202 may be coupled to a host 204, wherein the host 204 may correspond to a high-level entity (e.g., an admin) of the server 202. The host 204 may, for example, upload software and/or firmware updates for the server 202 and/or coupled devices 208, connect or disconnect devices 208 to the server 202, or otherwise control operations of the server 202. External data sources 206 may comprise any publicly available data sources (e.g., public databases such as weather data from the national oceanic and atmospheric administration (NOAA), satellite topology data, public records, etc.) and/or any other databases (e.g., private databases with paid or restricted access) of which the server 202 may access data therein. Devices 208 may comprise any device configured to perform a task at an edge of the server 202. These devices may include, without limitation, internet of things (loT) devices (e.g., stationary CCTV cameras, smart locks, smart thermostats, etc.), external processors (e.g., external CPUs or GPUs), and/or external memories configured to receive and execute a sequence of computer readable instructions, which may be provided at least in part by the server 202, and/or store large amounts of data.

[0064] Lastly, the server 202 may be coupled to a plurality of robot networks 210, each robot network 210 comprising a local network of at least one robot 102. Each separate network 210 may comprise one or more robots 102 operating within separate environments from each other. An environment may comprise, for example, a section of a building (e.g., a floor or room) or any space in which the robots 102 operate. Each robot network 210 may comprise a different number of robots 102 and/or may comprise different types of robot 102. For example, network 210-2 may comprise a scrubber robot 102, vacuum robot 102, and a gripper arm robot 102, whereas network 210-1 may only

16

SUBSTITUTE SHEET (RULE 26) comprise a robotic wheelchair, wherein network 210-2 may operate within a retail store while network 210-1 may operate in a home of an owner of the robotic wheelchair or a hospital. Each robot network 210 may communicate data including, but not limited to, sensor data (e.g., RGB images captured, LiDAR scan points, network signal strength data from sensors 202, etc.), IMU data, navigation and route data (e.g., which routes were navigated), localization data of objects within each respective environment, and metadata associated with the sensor, IMU, navigation, and localization data. Each robot 102 within each network 210 may receive communication from the server 202 including, but not limited to, a command to navigate to a specified area, a command to perform a specified task, a request to collect a specified set of data, a sequence of computer readable instructions to be executed on respective controllers 118 of the robots 102, software updates, and/or firmware updates. One skilled in the art may appreciate that a server 202 may be further coupled to additional relays and/or routers to effectuate communication between the host 204, external data sources 206, devices 208, and robot networks 210 which have been omitted for clarity. It is further appreciated that a server 202 may not exist as a single hardware entity, rather may be illustrative of a distributed network of non-transitory memories and processors.

[0065] According to at least one non-limiting exemplary embodiment, each robot network 210 may comprise additional processing units as depicted in FIG. IB above and act as a relay between individual robots 102 within each robot network 210 and the server 202. For example, each robot network 210 may represent a plurality of robots 102 coupled to a single Wi-Fi signal, wherein the robot network 210 may comprise in part a router or relay configurable to communicate data to and from the individual robots 102 and server 202. That is, each individual robot 102 is not limited to being directly coupled to the server 202, external data source 206, and devices 208.

[0066] One skilled in the art may appreciate that any determination or calculation described herein may comprise one or more processors of the server 202, devices 208, and/or robots 102 of networks 210 performing the determination or calculation by executing computer readable instructions. The instructions may be executed by a processor of the server 202 and/or may be communicated to robot networks 210 and/or devices 208 for execution on their respective controllers/processors in part or in entirety (e.g., a robot 102 may calculate a coverage map using measurements collected by itself or another robot 102). Advantageously, use of a centralized server 202 may enhance a speed at which parameters may be measured, analyzed, and/or calculated by executing the calculations (i.e., computer readable instructions) on a distributed network of processors on robots 102 and devices 208. Use of a distributed network of controllers 118 of robots 102 may further enhance functionality of the robots 102 as the robots 102 may execute instructions on their respective controllers 118 during times when the robots 102 are not in use by operators of the robots 102.

17

SUBSTITUTE SHEET (RULE 26) [0067] FIG. 3A(i) depicts a robot 102 configured with a special purpose modular attachment 302 configured to capture high quality images of its environment, according to an exemplary embodiment. The robot 102 in this example is a robotic floor scrubber configured to clean and scrub floors beneath itself as it navigates, however it is appreciated that the specific task of the base robot 102 is not limiting. For instance, the robot 102 could alternatively be an item transport robot, a vacuum, a security robot, a personal assistant robot, or any other ground-navigating robot 102 capable of supporting the module 302. To scan for features, the robot 102 drives as it captures images from its right-hand side, and/or left-hand side in alternative embodiments. Effectively, the robot 102 swipes the camera 306 array horizontally across objects to be scanned for features as the cameras 306 capture images.

[0068] The module 302 is shown separate from the robot 102 in FIG. 3 A(ii), according to an exemplary embodiment. The scanning module 302 may contain a connection interface 310 comprising mechanical connectors 312, 314 for securing the module 302 to the robot 102 as well as electrical/data connectors (obscured from view) to ensure communication between a processor of the module 302 and the controller 118 of the robot 102. The processor of the module, as well as other circuitry components, are housed within the module body 320. Optionally, the module 302 may further contain various non- imagmg sensors, such as a planar LiDAR curtain 316 which measures distances along field of view 318. This curtain LiDAR 316 may enable the controller 118 of the robot 102 to consider potential collisions as a result of the added height and change of form factor due to the module 302 even if its base sensor units 114 are not capable of doing so.

[0069] The module 302 further contains two cameras 306 in this embodiment. In other embodiments, three or more cameras may be used. While a single camera may be used with the systems and methods disclosed herein, it is less preferable because extracting orthographic views from a single perspective is difficult, as discussed further below. Additionally, multiple cameras arranged vertically enable imaging of tall objects from orthographic perspectives, such as tall supermarket shelves. The cameras 306 may be adjacent to controllable lights 322 to ensure the scene is properly illuminated. In some embodiments, the module 302 may be further coupled to an upper reserve camera 308 via another connection interface 304. The upper reserve camera 308 may be angled upwards to enable capture of high-up shelves, such as reserve storage in a warehouse setting. In the illustrated embodiment, the upper reserve camera 308 points in the opposing direction as the other two cameras 306 as this robot 102 and module 302 embodiment is configured for scanning aisles in a warehouse, wherein the robot 102 will pass by all shelves eventually using switch-back or S-pattem maneuvers through the aisles. Other embodiments may include the reserve camera 308 being configured in the same direction as the other cameras 306, a 360° view camera, or may simply not include this camera 308. Fig. 3A(iii) shows a

18

SUBSTITUTE SHEET (RULE 26) special purpose scanning robot 340 in which a scanning module 302 is mounted on a mobility chassis 341 for moving scanning module 302 around an environment. The robot 340 is purpose built for scanning its environment and does not include other functions not related to its scanning and processing functionalities.

[0070] A key feature of the scanning module 302 and/or special purpose scanning robot 340 is that cameras 306 include overlapping fields of view vertically. It is also assumed that the frequency of these cameras is great enough such that there is horizontal overlap between successive images as the robot 102 travels. FIG. 3B depicts four images 324 captured by two cameras 306 sequentially as a robot 102 navigates with a scanning module 302 coupled thereto, according to an exemplary embodiment. The top two images 324-Al and 324-A2 are captured by an upper “camera A” 306 at a first time and a second time. The bottom two images 324-B1 and 324-B2 are captured by a different, lower “camera B” 306 at the same first and the second times. Additional rows of images 324 and cameras 306 may be considered in some embodiments, however for clarity and without limitation only two are presently considered. Three features are depicted represented by a triangle 326, a circle 328 and a square 330. The triangle 326 is depicted twice, in image 324-Al and 324-B 1 taken from different cameras 306. The square 330 is also depicted twice, although in different images 324-Al and 324-A2 from “camera A” 306. Lastly, the circle 328 is depicted in all four images 324. Using the individual images, the triangle 326 and square 330 are detected twice, and the circle 328 is detected four times, despite there being only one of each feature in the actual environment. When the four images 324 are overlaid on each other, it is readily apparent that, despite being imaged multiple times, there exist only one triangle 326, circle 328, and square 330. When performing feature identification using the raw image data, it is required that duplicate feature detections are filtered to reflect the true counts of the environmental features 326, 328, 330 more accurately. It may be preferable to perform feature identification on the individual images 324 as opposed to, for example, a single stitched image as this increases the likelihood of detecting the features 326, 328, 330 in at least one image, which is an overall goal: to detect and count all present features. Further, accurate stitching requires that no features 326, 328, 330 be overlooked or duplicated in the stitched image, which is a constraint that can only be applied once the features 326, 328, 330 per image are determined, as will be discussed further below.

[0071] FIG. 4 is a functional block diagram illustrating a system 400 configured to ingest images captured by two or more cameras on a robot 102 and produce a feature report detailing the features detected within the images, according to an exemplary embodiment. An overall goal of the system 400 is to receive the input images, identify features depicted therein, localize the features to points in the environment, and report the findings in a feature report 422. The feature report 422 should only indicate the presence of actual features along with the number of those features which are present

19

SUBSTITUTE SHEET (RULE 26) in the environment. There are two primary aspects of this system: (i) the feature identification itself in block 404, and (ii) construction of a stitched image. Recall from FIG. 3B that a robot 102 capturing images of its environment may produce a plurality of images each depicting the same objects 326, 328, 330, wherein the feature report 422 should only report the presence of that one, singular object without duplication or stitching over (i.e., skipping) a feature.

[0072] The robot 102 may collect a plurality of images, and/or other sensory information, as it navigates around its environment. While navigating, the robot 102 may also track its position and orientation over time to construct a computer readable map of the environment based on sensing and localizing objects using its sensor units 114. In some embodiments, the robot 102 may utilize a special purpose sensor accessory or module to capture these images, such as module 302 shown and described in FIG. 3A(i-iii) above. In some embodiments, the robot 102 may be pre -configured to capture high quality images of its environment without the need of an additional module (for example, robot 340). Since the robot controller 118 is already localizing the robot 102 continuously, the controller 118 may further correlate captured images of features to locations where the robot 102 was during capture of the images in block 402. The images, location data, and corresponding metadata (e.g., timestamps, robot identifiers, route identifiers, etc.) may be communicated to a server 202 for further feature identification. The server 202 and robot 102 may also share the computer readable map used by the robot 102 to navigate the route, where the images were captured as part of the localization data.

[0073] According to at least one non-limiting exemplary embodiment, a robot 102 may perform any or all of the functional blocks depicted in server 202 in Fig. 4, provided it comprises sufficient computing recourses to do so. It is appreciated, however, that the amount of processing recourses needed to perform the functional blocks may exceed the capabilities of a given robot 102 and/or may greatly encumber the robot 102 from performing other tasks, where it is often preferable, though not required, to off-load this processing either in whole or in part to enable the robot 102 to continue doing other tasks. Use of an external server may also enable processing of the images when the robot 102 is low power, or otherwise unavailable (e.g., powered off, low network connectivity, etc.). [0074] The feature identification block 404 is configured to receive the input images and identify features within the image. The feature identification block 404 may include, for example without limitation, (convolutional) neural networks, pattern recognition models, referential image databases, and/or other conventional methods known in the art for identifying a set of features within an image. The set of features is a pre -determined list of objects, traits, or other features. For commercial settings, the set of features may include the products sold in a store. In other settings, the features could be specific people, objects or places expected in that setting. There are a plurality of conventional methods within the art for configuring a network to identify features in imagery such as, for instance,

20

SUBSTITUTE SHEET (RULE 26) Google’s Vertex Al Vision®, a cloud based platform for executing artificial intelligence models frequently used for feature detection in imagery; or AutoKeras, which allows users to configure artificial neural networks for a desired task. The specific method and/or model(s) used to perform the feature identification 404 is largely beyond the scope of this disclosure and is an active area of research and development, where any method of feature identification which provides a bounding box or other image-space location of an identified feature (e.g., semantic segmentation) would be operable with the systems and methods of the present disclosure. The feature identification block 404 produces a tabulated list of detected features along with bounding boxes which encompass the pixels of those features within the images, for each image received by the robot 102. For retail settings, the identification of the features may come in the form of a shelf keeping unit (“SKU”), universal product code (“UPC”), and/or other standardized inventory tracking identifiers.

[0075 ] Block 406 represents a depth stitching pipeline configured to produce stitched imagery in an orthographic view. The stitched imagery must not (i) duplicate any features, (ii) skip any features, and must (iii) minimally distort the overall image in the orthographic view, thereby necessitating consideration of depth of scene. While capturing a panoramic image of a flat surface or distant objects without distortion is fairly trivial in the art, often the features being imaged are not on a flat surface.

[0076] For instance, in retail settings, products are arranged on shelves with depth, and the products themselves may contain depth. Simple image stitching of two adjacent images requires the displacement of the camera and correlating the same notable features in both images. To display the stitched image in an orthographic view, the displacement of the object as seen by the two camera positions needs to be known, thereby requiring depth. To illustrate, consider a shelf lined with a plurality of identical products, e.g., a plurality of boxes of a cereal flavor/brand, the products are arranged in rows on the shelf, wherein one of the rows is lower stock than the rest thereby making the leading box in that row (i.e., the one seen closest to the camera) to appear smaller than the other adjacent rows of boxes. A stitching process which does not consider depth may yield artifacts as the stitching attempts to cone late like features inbetween images, wherein the depth difference yields different interframe displacement of the low-stock cereal box row as compared to the other rows. The stitching process cannot be loaded with a priori knowledge that all of the boxes are the same size, as this would require a manual input for each and every item. The further away (i.e., low stock) row of boxes would appear to move less than the other boxes in between frames, despite all rows of boxes being static objects. This distortion is further increased when considering the stitched image must also be vertically stitched to account for both (or any number of) cameras 306 of the scanning module 302. This complicates simple stitching using feature-based correlations and camera translation data alone. Accordingly, depth of scene must also be considered to minimize artifacts and distortion in the stitched

21

SUBSTITUTE SHEET (RULE 26) images to represent the true state of the environment more accurately.

[0077] Sub-block 408 includes the processor(s) of the server 202 performing a bundle adjustment, or commonly referred to as block bundle adjustment. Bundle adjustment, as further described below in FIG. 5A-B, uses epipolar geometry of sequential images to estimate depth of pixels within those images. Bundle adjustment yields a matrix of key points which are tracked between images, where the depth of those key points can be estimated based on their inter-frame motion and epipolar geometry. The key points used by bundle adjustment do not need to be the same as feature identification. For instance, the key points may include salient color transitions, comers, round edges, certain letters/numbers/text, and/or other easily resolvable and trackable points (i.e., single pixels) in images. Using epipolar geometric constraints and the location/displacement of the cameras during acquisition of each image from the robot 102, the depth of each of the key points can be extracted from the sequential imagery to yield a 3-dimensional (“3D”) point cloud which can be transformed into a 3D point mesh in block 410. The mesh is formed by connecting nearest neighboring points of the point cloud together to form a network of triangles, each vertex of each triangle now containing (x, y, depth) coordinates in the image-space, as shown in more detail in FIG. 5D below. Tracking of the key points may further yield poses or positions of the cameras (camera poses block 412) which captured the images. Typically the camera poses from bundle adjustment 408 are of improved accuracy compared to normal odometry used by the robot 102, as further described below in FIG. 5A-C.

[0078] The 3D point mesh 410 is an estimate of the variable depth of field for the imaged object, such as a shelf with varying stock items. By projecting each image onto the mesh in the following block 414 and using the known camera location/position 412 for each image, artifacts and other distortions are substantially reduced. Projecting the collection of images onto the mesh further enables calculation of an orthographic view of the stitched image, which further reduces distortions and improves discemability of the features within the image.

[0079] The resulting stitched image with depth 418 comprises an orthographic view of a sequence of images of an object, such as a shelf in a retail store for example. The controller 118 (see Fig. 1A) may be configured with a computer readable map, which enables the robot 102 to determine which groups of sequential images should be stitched together. For instance, when the controller 118 detects the robot 102 is in a location where image collection should occur, the controller 118 and/or processors of the module 302 begin capturing a plurality images which are grouped together to be stitched via the depth stitching block 406. The processor(s) of the server 202 may stitch together all the images for a given group of sequential images to produce a singular stitched image with depth for each group of images.

[0080] Next, the feature identifications 404 are combined with the stitched image in a de-

22

SUBSTITUTE SHEET (RULE 26) duplication block 420. The de-duplication block 420 ensures that the features of the stitched image correspond to a single detection of said feature. To illustrate, recall in FIG. 3C a singular feature, such as the circle 328, could appear multiple times despite only one circle 328 existing in the environment. The feature identification block 404 produces four counts of the circle feature 328 when analyzing the images 324-Al, 324-A2, 324-B1, and 324-B2; however the circle 328 would only appear once in the stitched imagery. The de-duplication block 420 considers the (x, y) location for bounding boxes for each feature detected in the feature identification block 404. By maintaining those (x, y) positions in the image in combination with the known position of the cameras 412 from the bundle adjustment 408 and known stitching parameters (i.e., (x, y) location in the mesh for each image added to the stitched image), the resulting (x, y) position of each bounding box can be resolved for the overall stitched image such that the duplicated counts are eliminated.

[0081] Once the duplicated counts are eliminated, and the singular stitched image is produced, the final counts (without duplications) can be communicated to a device 208 and included in a feature report 422. The stitched image may also be communicated, although it may be preferable to require the device 208 to request the stitched imagery only when needed, such as by a user, due to added data transmission costs. The device 208 may comprise, for example and without limitation, a cell phone, personal computer, another server 202, etc. such as a device 208 of a store owner where the robot 102 is capturing imagery. The device 208 may also, in some embodiments, include databases, such as inventory tracking databases, wherein the feature counts could correspond to product counts useful for tracking inventory of an environment.

[0082] FIG. 5A-D illustrate a process of bundle adjustment and construction of a 3 -dimensional mesh for use in image stitching, according to an exemplary embodiment. First, in FIG. 5A, an exemplary shelf 502 containing a plurality of objects 504 is to be scanned for features, wherein the features in this example are the type of object 504. In this simplified example, the object type could be the shape, e.g., rectangle, oval, triangle, etc. In a more practical example, the objects 504 could represent different products on the shelf 502, wherein scanning for features requires identification of the products using, for example, a UPC, SKU, or other (alphanumeric identifier which may be specialized to the environment (e.g., a SKU) and/or universal codes (e.g., a UPC). Identifying the features may, in some applications, further include identifying specified characteristics of the features such as quantity, price (e.g., on a website, internal database, mean suggested retail price, or other reference price), and displayed price (i.e., the price listed on an adjacent price tag or price label which may differ from a reference price). In some embodiments, characteristics such as “item missing” (i.e., detecting an empty space) may also generate an identification. In some embodiments, characteristics such as damage to products may also be identified as a “damaged product” feature.

23

SUBSTITUTE SHEET (RULE 26) [0083] The robot 102 may navigate by the shelf and, upon being within a threshold range of the shelf, begin to capture a plurality of images thereof. Some exemplary camera 306 positions are shown below the shelf 502. The robot 102 in this embodiment includes two cameras 306-A and 306-B which are vertically arranged. The cameras 306-A, 306-B are each further depicted in three positions 306-Al, 306-A2, 306-A3 and 306-B 1, 306-B2, and 306-B3, respectively, to show that a single object can be depicted numerous times from various angles as the robot passes the object. For a robot moving from left to right in Fg. 5A, a specific object would appear to move from right to left in sequential images taken at positions 306-Al, 306-A2, 306-A3 and 306-B1, 306-B2, and 306-B3.

[0084] Superimposed on the objects 504 are a plurality of key points 506 extracted during the bundle adjustment process. The selected key points are pixels in images which may be salient, such as sharp comers, color transitions, certain shapes and the like, which can be readily resolved and identified in sequential imagery. These key points 506 are not to be confused with the features to be identified, despite the fact that the key points 506 are being selected based on features or components of the objects 504. The key points 506 are assumed to be on static objects. One exemplary key point 506 is depicted with lines of sight 508 from the cameras 306-A, 306-B to show how the singular key point can be viewed multiple times from various perspectives.

[0085] To better illustrate the epipolar geometry used to calculate the depth of each key point 506, FIG. 5B illustrates epipolar geometry used to calculate depth of a given pixel 512 in a first image using a second image. FIG. 5B depicts two image planes 510-1, 510-2 corresponding to two sensors represented by a pin-hole camera model. The sensors are depicted as points corresponding to the pinhole camera model, wherein two positions 306-Al and 306-A2 of the camera 306-A are being shown. It is appreciated that the two image planes 510-1, 510-2 could also correspond to images from different cameras, such as images captured by cameras 306-A and 306-B at positions 306-Al and 306-B1, respectively. The pixel 512 is an arbitrary pixel in the image plane 510-1. Based on (1) the camera projection matrix, which is an intrinsic tensor typically measured by manufacturers of the cameras, and (2) only the first image, the depth of the pixel 512 cannot be resolved. That is, the pixel could represent an object anywhere on the epiline 514 which could extend infinitely. Assuming the pixel depicts a static object that is also seen in the second image 510-2, the location of that same feature of the static object represented by the pixel 512 can be identified in the second image 510-2, and triangulation can be utilized to determine where along the epiline 514 the object represented by the pixel 512 lies. Three exemplary locations for the pixel in the second image frame 510-2 are shown, each corresponding to a different depth along the epiline 514 and different location on image plane 510-2, as indicated by lines 514. Accordingly, the depth of the pixel 512 can be triangulated and calculated based on its location in the second image frame 510-2. In the illustration shown, the actual depth of object 512 is indicated on

24

SUBSTITUTE SHEET (RULE 26) epiline 514 where it intersects the middle line of lines 515.

[0086] The exemplary configuration in FIG. 5B depicts an epipole 515, shown by a dashed line, being within the second image frame 210-2, wherein one skilled in the art may appreciate that the epipole 515 may exist outside the image plane 210-2 such as for the two-camera module 302 depicted in FIG. 3B wherein neither camera 306 sees or is pointed towards the other camera 306 or its prior locations. Since the vertical camera displacement on the module 302 is known, the epipole 515 can be readily calculated to determine the epiplane as shown in FIG. 5B, the epiplane comprises the plane formed by epipole 515 and the intersection between epilines 514 between both image frames 510-1, 510-2. Similarly, the lateral displacement between sequential images is also known and measured by the controller 118 which generates the epipole 515 for a single camera in between frames.

[0087] The two image planes 510-1, 510-2 could represent either two cameras 306-A, 306-B or could represent a single camera capturing two images from two positions (e.g., images from positions 306-Al, 306-A2) which depict the same scene containing the subject of the pixel 512. Returning to FIG. 5A, each key point 506 is treated as the pixel 512 in performing depth analysis via bundle adjustment. The depth analysis can be performed on each pair of images from the pair of cameras 306- A, 306-B, or can be performed on pairs of images which are captured sequentially, e.g., the image captured at positions 306-Al and 306-A2. Optimally, all configurations are considered and the value for the depth is an average value calculated via any combination of image pairs used, provided the pair both depict the feature denoted by the same key point 506. That is, there is no requirement that bundle adjustment be performed only between sequential images or between two contemporaneous captured images from two cameras, and would preferably be performed between all pairs of images which all depict the same key point 506 in order to determine the depth of that key point 506 with additional data for improved accuracy. In some embodiments, epipolar geometry as shown in FIG. 5B can be simultaneously applied to a plurality of images pairs to determine depth using multiple images contemporaneously using an optimization process.

[0088] A key requirement for bundle adjustment to yield accurate depth information is that the key points 506 must be properly corresponded to the exact same point/object/feature in all images, in this case all six images taken at the six camera positions. Stated another way, the object/feature depicted by the pixel 512 must be corresponded to the same object/feature in the second image 510-2 to determine depth, wherein improper correspondence would produce an incorrect depth along epiline 514. This is shown via three different points in the second image frame 510-2 corresponding to three respective distances along epiline 514. Using visual features alone may be insufficient, especially in retail environments where the objects 504 are arranged adjacent to identical objects 504 such as the row of identical rectangles, hexagons, triangles, etc. on shelves 502. Using the translation of the robot

25

SUBSTITUTE SHEET (RULE 26) 102 between successive images as measured by odometry, navigation units 106, sensor units 114, etc., improper correspondences can be filtered out.

[0089] To illustrate further, FIG. 5C(i-ii) depicts two scenarios: an improper correspondence in FIG. 5C(i) and a proper correspondence in FIG. 5C(ii), according to an exemplary embodiment. The visual scene in this example has been reduced to the worst-case scenario for correspondence matching: all features 504 appear identical, are arranged evenly, are free from other contextual features (e.g., backgrounds, other features, etc.), and only one key point 506 at the same, top-left comer of all the features 504 is considered. It is appreciated that the objects 504 are static and do not move, wherein illustration of the objects 504 in different positions is indicative of the relative motion of the scene due to the moving camera 306 on a robot 102. The images are vertically aligned in FIG. 5C(i-ii) for visual clarity only, wherein a controller 118 may not need to do such visual alignment.

[0090] First, in FIG. 5C(i), an improper correspondence is made. The image frames 516-1, 516-2 depict the field of view of the camera, e.g. , camera 306-A or 306-B, used to image the four boxes 504, wherein only three boxes 504 appear in any given frame. For illustration, the boxes have been numerically labeled to be discernable to the present viewer/reader; however, it is appreciated that the system at this stage does not discern these features 504 as distinct or label them distinctly as shown.

[0091 ] A proper correspondence would comprise corresponding, as shown by arrows 518, the key point 506 of box 1 to the same key point 506 of box 1. Since box 1 has moved out of frame in image 516-2, however, there should not be any correspondence. Similarly, the key point 506 of box 2 in image 516-1 should be corresponded to the key point 506 of box 2 in image 516-2. However, due to the identical nature of the scenario the improper correspondence is made between the key point 506 on box 1 in image 516-1 and the key point 506 on box 2 in the latter image 516-2. This improper correspondence would indicate that the robot 102 has barely, if at all, moved between capturing the image 516-1 and 516-2. However, as shown by the boxes 1 and 4 being illustrated out of frame, the camera has actually moved by approximately one box length between capturing images 516-1 and 516-2.

[0092] To summarize, image data alone may be insufficient in performing correspondence matching needed for bundle adjustment, especially in environments where the features 514 are repeated identically across the scene. Accordingly, the odometry received by the robot 102 may aid in the correspondence matching by filtering out poor correspondences which yield inaccurate camera translations. Stated another way, the improper correspondence shown in FIG. 5C(i) would indicate that the robot 102 has not moved, or barely moved, which would substantially differ from the odometry of the robot 102 by at least a threshold amount. Accordingly, the correspondences in FIG. 5C(i) can be determined to be erroneous since the calculated translation using the image correspondences differs

26

SUBSTITUTE SHEET (RULE 26) from the odometry, and the processors of the server 202 may attempt to perform a different correspondence as shown in FIG. 5 C(ii) . FIG. 5 C(ii) depicts proper correspondence, with the key points 506 on box two and three in the first image 516-1 corresponding properly to the same boxes two and three in the second image 516-2. Since box four was not depicted in the first image 516-1, there is no correspondence in the second image. Similarly for box one, there is no correspondence since box one has moved out of frame of the second image 516-2.

[0093] These two correspondences, as shown by arrows 518 being non-vertical when the image frames 516-1, 516-2 are vertically aligned in Fig. 5C(ii), would indicate there is horizontal movement of the scene in between image frame 516-1 and 516-2. Since the features 514 are assumed to be static, the movement as perceived in the image frames 516-1, 516-2 should approximately match the displacement as measured by the robot 102 odometry. Stated another way, the horizontal component of the depicted arrows 518 in FIG. 5C(ii) correspond to the horizontal translation of the robot 102 in between images 516-1 and 516-2. Typically, odometry from the robot 102 is of lower resolution and more prone to noise than the image-based calculation of translation. Accordingly, the two estimates for robot 102 translation (odometry versus image-space translation of key points 506) may not be exactly the same, however the two methods of localization should be approximately equal (e g., 10% or less difference in robot 102 translation for proper correspondences), wherein the odometry, despite being noisy and of lower resolution, may still be utilized to filter some improper correspondences.

[0094] It is appreciated that use of odometry to filter incorrect correspondences between sequential images may also be utilized to filter incorrect correspondences between contemporaneously captured images from two different cameras. Rather than utilize odometry to constrain the potential correspondences as discussed above, the controller 118 may utilize the known camera displacement between the two cameras 306-A, 306-B to constrain potential correspondences. That is, the key points 506 corresponded between two contemporaneous images should yield the spatial separation between those two cameras (via the epipolar geometry' discussed in FIG. 5B) and, if it differs substantially (i.e., excluding potential errors from noise and limited camera resolution), a poor correspondence may be identified.

[0095] Returning to FIG. 5A, the initial spatial displacement of the cameras 306-A, 306-B between successive images are measured by the controller 118 to provide the bundle adjustment process with an initial pose estimate. This measurement of displacement comprises an aspect of the odometry received by the server which corresponds to the images, wherein the controller 118 of the robot 102 may provide time stamps for the acquired images which may be corresponded to the location of the robot 102 and its cameras 306-A, 306-B at the time the images were captured. This initial estimate should be sufficiently accurate to determine if a given key point 506 would be present in a successive

27

SUBSTITUTE SHEET (RULE 26) or prior image, but may not be of sufficient resolution to rely upon for pixel-wise image stitching. Bundle adjustment further utilizes epipolar constraints to provide an estimation for the true camera displacement in between images. This estimation is further constrained since the cameras 306-A, 306- B, and/or other cameras if present, of the module and/or scanning device are positioned at fixed and known locations with respect to each other. Using the key points 506 at known (x, y, z) positions, the orientation of the cameras 306-A, 306-B, and the apparent image-space translation of properly corresponded key points 506 between successive images, the position of the cameras 306-A or 306-B can be calculated. This calculated translation, and thereby the position, of the cameras 306-A, 306-B would typically be of higher resolution than the typical odometry used by the controller 118 to move the robot 102 as it is performed using per-pixel image elements. This improved localization will be essential in de -duplication and creating accurately stitched images.

[0096] The bundle adjustment process calculates the depth for each key point 506. Using the depth information for the key points 506, a mesh 520 of the shelf 502 can be extracted as shown in FIG. 5D, according to the exemplary embodiment. The mesh 520 is constructed by connecting the key points 506 in 3D (x, y, z) space. The connected key points 506 may comprise any n>2 nearest neighboring points to a given key point 506. The mesh 520 in the illustrated embodiment comprises triangles, however other shapes are considered applicable.

[0097] FIG. 5A shows that the depth of the shelf unit 502 may generate some key points 506 which are behind the objects 504 of the bottom row and other key points 506 on the leading edge of the shelf. The depth values of the key points 506 which correspond to objects 504 on the top and bottom rows of objects 504 would produce a lower depth value than the five key points 506 which lie between the rows of objects 504 and a greater depth value than key points on the leading edge of the shelf 502. Accordingly, the mesh 520 shown in FIG. 5D reflects this change in depth as a function of horizontal and vertical position along the shelf 502. Further, any object 504 which is not at the leading edge of the shelf 502 would generate key points 506 with a different depth value than the other objects, which is essential for determining a size of the object when being projected onto the orthographic view plane in the stitched image. In instances where the shelf is transparent, or the display comprises a table or ledge with no backing, some key points 506 may be generated on background features which are much further in depth from other key points 506 on the foreground objects/display/shelf.

[0098] It is appreciated that the illustrated embodiment is a simplified mesh 520 comprising a few key points 506 for clarity. Bundle adjustment algorithms common within the art may calculate hundreds of key points 506 per image, wherein the mesh 520 formed therefrom would be of significantly higher resolution than depicted.

[0099] To summarize briefly, following the bundle adjustment the server 202 processor(s)

28

SUBSTITUTE SHEET (RULE 26) have calculated the mesh 520 that models the approximate depth of the scene in the images and pose estimates for the cameras 306-A, 306-B during acquisition of each image processed during the bundle adjustment. A stitched image can be generated via projecting the individual images onto the depth mesh 520. As part of producing the stitched image, the server 202 processor(s) must de-duplicate repeated feature detections (e.g., as shown in FIG. 3C and discussed above). Recall in FIG. 4, a separate block 404 has identified a set of features for each image captured, however this set of features includes duplications due to overlap in camera field of view (FoV). The feature identification block 404 produces, for each detected feature, a name or identifier (e.g., a UPC or other alphanumeric), and a bounding box around the feature, wherein the bounding box location in the image-space is also calculated. As used herein, a bounding box will be defined by the (x, y) location of its bottom left pixel, however it is appreciated that this is merely for clarity and is not intended to be limiting. The bounding boxes, however, do not differentiate between two items having the same identifier (e.g., SKU/UPC). For instance, a series of identical products on a shelf would all be identified with the same identifier, wherein the series should generate multiple counts of the product.

[00100] FIG. 6 depicts two images 602, 604 captured by a single camera 306 sequentially being projected onto a mesh 520 in order to account for duplications, according to an exemplary embodiment. The exemplary feature in the images 602, 604 is encompassed by a bounding box 606 with an imagespace location defined by its bottom left comer, shown by a square 608. In the second image 604, the bounding box 606 has moved due to the robot 102 and camera 306 moving in between the images 602, 604.

[00101] Also depicted in FIG. 6 is a top-down view of the scene encompassed by the two images 602, 604. The top-down view depicts a slice of the mesh 520 taken at a height corresponding to the height of the bottom left comer 608 of the bounding boxes 606. The bounding boxes 606 are shown separately for two images 602, 604 corresponding to images taken at respective camera positions 610 and 612. Assume both bounding boxes 606 represent the same object and correspondence matching was properly performed. Camera positions 610, 612 are calculated via the bundle adjustment process discussed herein. The projection matrices of the cameras 306 further define the projection of a given pixel onto the environment as shown by lines 614 which extend from the sensor origins at positions 610, 612 through the bottom left pixel of the bounding box 606 of the image respective planes 616 and into the scene.

[00102] FIG. 6 may alternatively depict a vertically oriented geometry, wherein the two image planes 616 are representing two vertically disposed cameras 306 capturing two images simultaneously. Moving left to right on the page may be analogous to moving up or down in 2D space. Lines 614 may depict the vertical constraints as to the location in 2D space of the feature within the bounding box 606.

29

SUBSTITUTE SHEET (RULE 26) In practice, both vertical and horizontal constraints may be utilized during the projection and deduplication process to determine the 3D location on the mesh of the given feature, in this case pixel 608.

[00103] FIG. 6 may alternatively be depicting projection of individual images onto the mesh 520, wherein any readily definable pixel of the images 602, 604, and other images is used instead of the bottom-left pixel 608 of the bounding box 606, which is/are projected onto the mesh 520.

[00104] Accordingly, by using two (or more) images 602, 604 which include two respective feature identifications, shown by bounding boxes 606, of the same object/feature in conjunction with the mesh 520 and bundle adjustment pose estimates at positions 610, 612, duplicate features can be substantially reduced and verified to be eliminated from the resulting stitched image. The duplicates are removed by more accurately, via use of the depth mesh 520, projecting the images 602, 604 onto the mesh 520. Since the mesh 520 models the 3D spatial geometry of the scene, the locations where the pixels 608 are projected onto the mesh 520 should, for any given pair of images containing the same pixel 608, converge as shown by lines 614. In turn, artifacts are reduced substantially yielding a duplicate-free panoramic. Consider for a moment embodiments in which the depth mesh 520 is not employed; accordingly, the pixel 608 in the two images 602, 604 would be projected at two locations where the line s 614 intersect the singular plane 618 of the shelf, yielding a duplicate feature in the final panoramic image. By implementing the depth mesh 520, the chances of a given pixel 608 in both images 602, 604 converging to a single point in 3D space is greatly improved, and any non-convergence would be reduced substantially since these would be the result of the mesh 520 (more specifically, spaces inbetween key points 506) comprising estimates of the depth values in between key points 506 (i.e., the edges of the mesh 520 are estimated via interpolation between key points 506).

[00105] By projecting the images onto the mesh 520, or more specifically the pixel color values

(or averages thereof) of the individual images as captured by the cameras 306 onto the mesh as shown with the exemplary pixel 608, the depth mesh 520 may be populated with a plurality of color values. The mesh 520 would encompass a wider range than any individual image, wherein projecting the images onto the mesh 520 thereby produces a stitched image which is free from duplications due to the spatial constraints used to project the pixels of the images 602, 604 onto the mesh 520. Use of the bottom left pixel 608 of a bounding box 606 is illustrated to depict how duplicate detections in individual images are removed due to spatial constraints (i.e., use of the depth mesh 520 which enables lines 614 to converge to a single point thereon), however it is appreciated that the pixel 608 could be any pixel of the images 602, 604 depicting an identified feature (e.g., pixels within the bounding box 606), background, or other object(s).

[00106] Since depth of field is considered when performing the projection, a substantial

30

SUBSTITUTE SHEET (RULE 26) majority of lines 614 extending from a camera position 610, 612 through a pixel of the image planes 616 should converge on the mesh 520, provided both pixels depict the same static feature. When the lines 614 do not converge, artifacts, duplicate pixels, and other irregularities may form, which are reduced substantially when using the mesh 520 as opposed to projection onto a flat plane (e.g., as shown by plane 618 in FIG. 6 comprising two intersection points with lines 614 corresponding to the same bounding box comer 608). Projection on a flat plane may produce minimal distortion if the depth of field in the scene is negligible compared to the distance of the camera to the scene (e.g., a panoramic of a mountain, wherein the mountain is tens of miles away and the difference in depth is at most a couple miles). However, for robots 102 which scan shelves, displays, storage systems, etc. for features, these shelves may include variance in depth which is of the same magnitude as the distance of the camerato the shelf (e.g., the robot 102 may be 5 feet from the shelf, and the depth variance of the shelf may be 2-4 feet). Since the shelves are closer and their depth variance is on the same order of magnitude as the distance of the camera to the shelf, the parallax motion of a given point on an object close to the edge of the shelf and close to the back of the shelf will both vary in their image-space locations in sequential imagery, which, if not accounted for via a mesh 520, would produce artifacts. Conversely, returning to the mountain example, a point on the mountain tens of miles away captured by two spatially separated cameras would include an apparent inter-frame motion approximately equal to the apparent motion of the entire mountain, thus projection onto a flat plane would include negligible artifacts as opposed to a close-up image of a shelf where the difference in inter-frame motion between a point 506 at the front edge of the self and a point 506 at the back edge of the shelf is significantly larger.

[00107] The process of identifying the 3D location of a given bounding box 606 as shown in FIG. 6 could be repeated for every feature identified by the feature identification block 404. Once complete, duplicate feature identifications will have been omitted and a final stitched image, which considers depth, produced. Accordingly, final counts of the identified features can be calculated since duplicate identifications have been removed. These counts may represent, for example without limitation, counts of products on a shelf. In some embodiments, no object (i.e., empty space) could constitute a feature to be identified to track low/out of stock items. Empty space can be further verified to be empty space, as opposed to e.g., a dimly lit feature, using the depth mesh 520, wherein a sharp spike in depth in the mesh 520 (e.g., greater than a threshold based on the average depth of the mesh 520) would be produced proximate to or encompassing the empty space. In some embodiments, the server 202 may further utilize historical data of the location to determine if, in the past, a feature was present at that location which is now feature-free and has large depth to confirm that the space should contain a feature which is presently missing.

[00108] According to at least one non-limiting exemplary embodiment, the counts of the

31

SUBSTITUTE SHEET (RULE 26) detected products can be determined and de-duplicated without the need for a stitched image to be produced. Similar to projecting images onto the mesh 520, the spatial region occupied by each feature is defined by the bounding box 606. By projecting the bounding boxes 606 onto the mesh 520 spatially (i.e., without considering color values of the image, and only the image space occupied by the bounding box 606) duplicate detections can be removed from the final count of the features. In other words, although the present disclosure shows and describes de-duplication from the perspective of creating a single, spatially consistent panoramic image, one skilled in the art would appreciate that the same spatial considerations (i.e., projection onto the mesh 520) would also effectively de-duplicate features without needing to produce a viewable image with color.

[00109] Once the images 602, 604 and their depicted features and scenes (e g., bounding box 606 and other pixels of the images) are projected onto the mesh 520, the color values of the mesh 520 can then be projected onto a designated shelf plane 618 to produce a panoramic image of the shelf, display, or scene in an orthographic view. The shelf plane 618 comprises a plane in 3-dimensional space that extends vertically from the floor and runs substantially parallel to the shelf, display, or direction of travel of the robot 102 as it collects the images. The shelf plane 618 may, in some embodiments, include the outermost boundary of the shelf, display, or object being scanned for features, which may be localized onto a computer readable map using sensor units 114. The depth mesh 520 may differ from the shelf plane 618 by a variable distance shown in part by ray 620. In some exemplary embodiments, a human may provide the location of one or more shelf planes 618 using annotations on the computer readable map, wherein the human indicates regions on the map that include objects to be scanned. In alternative embodiments, the plane 618 may be the closest point (i.e., minimum depth) of the mesh 520 to the locations of the cameras 306. In some embodiments, the plane 618 may be spaced from the depth mesh 520 by a constant value with respect to a minimum depth point on the mesh 520. These regions, which are typically rectangular although not required to be, would then define the shelf plane 618 based on their edge closest to the cameras 306. These regions on the map which indicate objects to be scanned may cause the robot 102 to begin and end collection of images upon navigating proximate to these regions. Images collected while the robot 102 navigates proximate to these regions may be grouped or binned as a series of images of a discrete object and may be processed separately using the systems and methods of the present disclosure from other series of images of different discrete objects to be scanned for features. That is, each group or bin of sequential images of a discrete object to be scanned for features may produce a corresponding panoramic image and feature report 422.

[00110] Projection vectors or rays 620 depict the projection of the color values of the depth mesh 520 onto the designated shelf plane 618. Rays 620 should always be orthogonal to the designated shelf plane 618 in order to produce the desired orthographic view of the scene.

32

SUBSTITUTE SHEET (RULE 26) [00111] According to at least one non-limiting exemplary embodiment, the color values of the pixels projected onto the shelf plane 618 could be darkened based on their distance to the shelf plane 618. Orthographic views do not contain any perspective or depth information, wherein separating out background and foreground becomes a difficult and unintuitive task. Darkening the pixels based on the length of projection vectors 620 (i.e., based on the depth of the mesh 520) would darken background pixels with respect to the foreground, thereby adding perspective back into the image. Such darkening may be preferable for short shelves or displays being imaged in front of larger ones wherein both displays are present in the images. This may, however, require that the shelf plane 618 be lower in depth (i.e., closer to the robot 102 and/or camera 306) than the depth mesh 520 itself, as shown in FIG. 6 for example.

[00112] FIG. 7 is a process flow diagram illustrating a method 700 embodied on one or more processors of a server 202 configured to produce a final report 422, according to an exemplary embodiment. The final report 422 as described above includes, at minimum, a count of the number of identified features sensed by the robot 102 and their location in 3D space. In some cases, the report 422 may further contain a stitched image of the scene containing the features to be identified and counted. Method 700 may alternatively be executed on one or more controllers 118 of a robot 102 and/or one or more processors of a module 302 configured for imaging features.

[00113] Method 700 begins with block 702, which includes the one or more processors receiving images from a robot 102. The images are localized to a first set of camera locations based on odometry estimates from the controller 118 during acquisition ofthe images. The controller 118 of the robot 102 continuously localizes itself using data from navigation units 106, actuator units 108 (e.g., feedback), and sensor units 114. In turn, since the position of the cameras on the robot 102 are at known (and often fixed) locations on the robot 102, the position ofthe cameras can be calculated. The first set of camera locations would include a resolution approximately equal to the resolution of the odometry/localization capabilities of the robot 102. The images received in block 702 comprise sequentially captured images of a scene without discontinuity. For instance, the robot 102 may capture images in some locations but not others, wherein the images received are bundles or groups of images which are captured continuously and sequentially within the designated areas.

[00114] Block 704 includes the one or more processors identifying features within the images and assigning a bounding box to the features. In some embodiments, the identification of features may be performed on the server 202 via the one or more processors embodying feature identification models, such as convolutional neural networks for example. In other embodiments, the identification of features may be performed via another entity, such as another server. The method(s) used for identifying the features is/are not intended to be limiting and may include any contemporary method known in the art,

33

SUBSTITUTE SHEET (RULE 26) such as neural networks; other forms of computer vision and learning; or referential databases, where images are compared to large databases of images of the features to be identified. The bounding box corresponding to each feature is assigned an identifier which corresponds to the feature. For instance, a bounding box encompassing a cereal box could be identified using a SKU, UPC, or other (alpha)numeric value that corresponds to the particular brand, flavor, etc. of the cereal box. Each bounding box includes an image-space location encompassing the corresponding feature. There may exist a plurality of individual feature detections for a single given feature at this stage due to image overlap. In some embodiments, block 704 could be performed contemporaneously with or after block 706 described next. In some embodiments, some characteristics of the features are also identified such as their listed price (e.g., based on adjacent or affixed price labels).

[00115] Block 706 includes the one or more processors performing a bundle adjustment using the plurality of images received in block 702 and a first set of camera locations. The first set of camera locations provides the bundle adjustment algorithm with an initial estimate as to the positions of the cameras during acquisition of the images received in block 702. This initial estimate improves correspondence matching, as shown and described in FIG. 5C above, by filtering incorrect correspondences, wherein proper correspondence matching is necessary for the bundle adjustment to yield accurate results. Key points 506 can be extracted for each image and held under epipolar constraints as described in FIG. 5B above in order to evaluate a depth of field for each key point 506. The first set of camera positions enables an approximate initial estimate as to the true position of the camera, wherein the bundle adjustment would provide a more resolute estimate of the true camera position. The bundle adjustment process achieves this by optimizing the calculated image-space displacement of key points 506 between images to determine depth, while also optimizing the camera displacement between the two images so as to conform to the calculated depth, where the robot 102 odometry provides an initial guess on the camera locations. This initial guess may be utilized as a constraint on improper correspondences (see FIG. 5C). Accordingly, the bundle adjustment process calculates (i) the depth of scene for each key point 506, and (ii) a second set of camera locations, which is of a finer resolution than the odometry used previously in the first set of camera locations. By connecting each key point 506 to its nearest neighboring key points 506 in 3D (x, y, z) space for all images received in block 704, a depth mesh 520 is produced. The depth mesh defines a plurality of 3D surfaces, each containing vertices defined by the key points 506.

[00116] The first set of camera locations may define the 3D location of the object being scanned for features in the overall environment. This set of locations may be sufficiently accurate to roughly approximate the locations of the features overall, but may be insufficient in reprojecting these features in a singular stitched image. For instance, the first set of locations may be accurate enough to define

34

SUBSTITUTE SHEET (RULE 26) sections, shelves, displays, or other ‘bins’ of features for processing using methods herein. The second set of locations may be sufficiently resolute to enable such projection but too granular for mapping the locations of the features on a global environment scale. For instance, the second set of locations may be resolute enough to determine where specifically on/in a section, shelf, display, or other ‘bin’ a given identified feature exists and project the image(s) of the feature onto a mesh 520.

[00117] Block 708 includes the one or more processors projecting the images received in block 702 onto the depth mesh 520 created in block 706. The projection comprises, for each pixel of a given image, determining a corresponding location on the mesh 520 for that pixel (e.g., using a camera proj ection matrix, as shown in FIG. 6 via lines 614), and assigning the color value (e.g., RGB, greyscale, etc.) of that pixel to the mesh 520. A camera projection matrix may denote the unit vectors for each pixel of an image which denotes the direction of projection for the given pixel (i.e., angle of lines 614 shown in FIG. 6). The mesh 520 itself may be comprised of or discretized into pixels, preferably of higher resolution (i.e., smaller pixels) than the images themselves, wherein the projection involves assigning pixels in mesh 520 color values based on the color values of the images. By projecting the images onto the mesh 520, pixels depicting common features between multiple images should have those common features project onto the same location on the mesh 520, thereby removing (or consolidating) duplicate detections.

[00118] This projection onto the depth mesh 520 further accounts for the locations of the bounding boxes in the image-space, determined in block 704, and second set of camera locations. FIG. 6 depicts this visually with lines 614, which extend from a bounding box 606 and converge on a point 608 on the depth mesh 520. The second set of camera locations defines the camera positions 610 and 612 and the image-space location of comer 608 of the bounding box 606 constrains the location of the same feature when being projected onto the mesh 520, thereby excluding duplications from the final count and panoramic image. The spatial consideration during projection of images from the second set of camera locations onto the calculated depth mesh 520 effectively removes duplicates caused by (i) overlapping images, and (ii) artifacts (e.g., artifacts caused by lines 614 intersecting the shelf plane 618 twice in FIG. 6). The same projection process could be performed for every pixel, or on groups/blocks of pixels, of the images received in block 702 as a method for producing the overall stitched image. Stated another way, due to consideration of the 3D geometry of the scene using the depth mesh 520, the duplicate feature detections should be all projected to the same location on the mesh 520 (e.g., as shown in FIG. 6) and, thereby, effectively remove duplicate detections from the final panoramic image. [00119] According to at least one non-limiting exemplary embodiment, block 710 is optional and can be skipped if the end user of the feature report does not desire a colorized panoramic image, but still desires to know the counts of features within their environment. In such embodiments where

35

SUBSTITUTE SHEET (RULE 26) the panoramic image is not required, block 710 is executed without projecting the color values onto a designated plane 618, wherein only the 3D locations of the bounding boxes are projected onto the designated plane 618 to determine a final count of the bounding boxes on the shelf in block 712.

[00120] Block 710 includes the one or more processors projecting the color values of the depth mesh 520 assigned in block 708 onto a designated plane to generate an orthographic view. This projection includes, for every point or pixel on the mesh 520, projecting the color value of that pixel/point onto a flat plane. The projection is orthogonal to the plane at all points along the plane. This orthogonal projection is shown by ray 620 in FIG. 6, wherein similar rays 620 for other points on the mesh 520 are projected in the same direction (i.e., parallel to ray 620). By performing this second projection after the color values of the mesh 520 are assigned, an orthographic and panoramic view of the scene is created. Orthographic perspectives may be advantageous in viewing long shelfs from a singular and uniform perspective which does not include any angular perspective.

[00121] According to at least one non-limiting exemplary embodiment, the one or more processors may further modify the color values projected onto the designated plane 618 based on the distance of those points of the mesh 520 to the plane 618. As discussed above, orthographic views are advantageous in viewing a long shelf from a uniform perspective. A drawback, however, is that all perspective in depth is lost to the viewer, wherein separating background from a foreground requires prior knowledge of the scene and is not always intuitive. Accordingly, the one or more processors may darken (or lighten, if preferred) the pixel color values as a function of depth (i.e., length of projection ray 620) of the mesh 520, wherein farther away points on the mesh 520 are darkened. Darkening the background pixels in the orthographic perspective provides the viewer with immediate context that the darker regions are background, thus making the background regions easier to ignore when viewing the foreground, despite the image itself not containing any perspective.

[00122] According to at least one non-limiting exemplary embodiment, the colorized panoramic image may not be necessary or requested by an end consumer, wherein the end consumer may simply desire to know the location and count of the detected features without viewing them visually (e.g., in an image or video stream). A similar projection from the depth mesh 520 onto the designated plane is still be performed, however the projection of the color values may be ignored. The bounding boxes for the identified features, however, should be projected onto the designated plane. By projecting only the spatial areas occupied by the detected features (i.e., bounding boxes), the designated plane would then include a plurality of bounding boxes each corresponding to a unique, non-duplicated feature detection. These bounding boxes on the designated plane may then be counted to yield the counts for the features corresponding to the bounding boxes.

[00123] Block 712 includes the one or more processors calculating a final count of the

36

SUBSTITUTE SHEET (RULE 26) identified features. The final count should be free from duplicate counts resulting from duplicate detections of a same feature in multiple images, wherein the duplicates have been removed due to accurate projection of the imaged features onto the depth mesh 520. The final count of the detected features would correspond to the number of bounding boxes corresponding to those features depicted on the designated plane (blocks 708-710) in the orthographic perspective. In other words, due to the de-duplication, the number of bounding boxes remaining after the projection(s) in blocks 708-710 correspond to the number of features imaged. In some embodiments, the second set of camera locations is utilized to determine the final feature count. Using the known location of the camera from the second set of camera locations and intrinsic parameters of the camera 306 (e g., its field of view, projection matrix, etc.), the spatial location of each feature identified can be mapped. Mapping of these features could include but does not require generating a stitched image.

[00124] The final count for a given feature may comprise the total number of identifications of that feature detected at different locations, wherein the locations of a given detection may be determined by the size of the corresponding bounding box 606. A different location, in this context, more specifically means non-overlapping regions on the mesh 520 (i.e., two bounding boxes 606 projected onto the place on the mesh 520 would produce one count). In some embodiments, a threshold tolerance may be implemented which would resolve two overlapping bounding boxes projected onto two slightly different, yet overlapping, locations on the mesh 520. Consider, for now, that the two bounding boxes correspond to the same feature being detected. A first threshold may be utilized if the overlap is substantial (e.g., 90% or more) to resolve the two bounding boxes as a single count of the feature if their overlap is equal to or above than the first threshold. A second threshold may be utilized if the overlap is minimal (e.g., 20% or less) to resolve the two bounding boxes as two separate objects to be counted twice if their overlap is equal to or less than the second threshold. If the two bounding boxes correspond to different features, they both contribute to the count for those respective two features if they overlap less than the second threshold. Since the correspondence matching is performed using visual features, rarely should two different feature detections have conflicting locations during projection, however if they are above the first threshold in overlap, an error or “unresolved” count should be denoted at this location. Incorrect predictions often are more troublesome than “unresolved” predictions as there is a potential to mislead. The threshold tolerance would account for small reprojection errors as a result of the mesh 520 being an approximation of the depth of scene without adding additional counts of the features. The percentage overlap could be along a vertical or horizontal axis, or as a percentage of the areas of the bounding boxes 606.

[00125] Stated another way, the one or more processors of the server 202 have, due to the projection of the images onto the mesh 520, identified the spatial location of each bounding box 606

37

SUBSTITUTE SHEET (RULE 26) for each identified feature. Due to image overlap, a plurality of feature detections of a single physical object within a plurality of overlapping images thereof would be projected onto the same location on the mesh 520, wherein the processors may determine that only one such feature exists within the spatial region occupied by those bounding boxes on the mesh 520.

[00126] It will be recognized that while certain aspects of the disclosure are described in terms of a specific sequence of steps of a method, these descriptions are only illustrative of the broader methods of the disclosure, and may be modified as required by the particular application. Certain steps may be rendered unnecessary or optional under certain circumstances. Additionally, certain steps or functionality may be added to the disclosed embodiments, or the order of performance of two or more steps permuted. All such variations are considered to be encompassed within the disclosure disclosed and claimed herein.

[00127] While the above detailed description has shown, described, and pointed out novel features of the disclosure as applied to various exemplary embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the device or process illustrated may be made by those skilled in the art without departing from the disclosure. The foregoing description is of the best mode presently contemplated of carrying out the disclosure. This description is in no way meant to be limiting, but rather should be taken as illustrative of the general principles of the disclosure. The scope of the disclosure should be determined with reference to the claims.

[00128] While the disclosure has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive. The disclosure is not limited to the disclosed embodiments. Variations to the disclosed embodiments and/or implementations may be understood and effected by those skilled in the art in practicing the claimed disclosure, from a study of the drawings, the disclosure and the appended claims.

[00129] It should be noted that the use of particular terminology when describing certain features or aspects of the disclosure should not be taken to imply that the terminology is being redefined herein to be restricted to include any specific characteristics of the features or aspects of the disclosure with which that terminology is associated. Terms and phrases used in this application, and variations thereof, especially in the appended claims, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. As examples of the foregoing, the term “including” should be read to mean “including, without limitation,” “including but not limited to,” or the like; the term “comprising” as used herein is synonymous with “including,” “containing,” or “characterized by,” and is inclusive or open-ended and does not exclude additional, unrecited elements or method steps; the term “having” should be interpreted as “having at least;” the term “such as” should be interpreted

38

SUBSTITUTE SHEET (RULE 26) as "such as, without limitation;” the term "includes” should be interpreted as “includes but is not limited to;” the term “example” is used to provide exemplary instances of the item in discussion, not an exhaustive or limiting list thereof, and should be interpreted as "‘example, but without limitation;” adjectives such as “known,” “normal,” “standard,” and terms of similar meaning should not be construed as limiting the item described to a given time period or to an item available as of a given time, but instead should be read to encompass known, normal, or standard technologies that may be available or known now or at any time in the future; and use of terms like “preferably,” “preferred,” “desired,” or “desirable,” and words of similar meaning should not be understood as implying that certain features are critical, essential, or even important to the structure or function of the present disclosure, but instead as merely intended to highlight alternative or additional features that may or may not be utilized in a particular embodiment. Likewise, a group of items linked with the conjunction “and” should not be read as requiring that each and every one of those items be present in the grouping, but rather should be read as “and/or” unless expressly stated otherwise. Similarly, a group of items linked with the conjunction “or” should not be read as requiring mutual exclusivity among that group, but rather should be read as “and/or” unless expressly stated otherwise. The terms “about” or “approximate” and the like are synonymous and are used to indicate that the value modified by the term has an understood range associated with it, where the range may be ±20%, ±15%, ±10%, ±5%, or ±1%. The term “substantially” is used to indicate that a result (e.g., measurement value) is close to a targeted value, where close may mean, for example, the result is within 80% of the value, within 90% of the value, within 95% of the value, or within 99% of the value. Also, as used herein “defined” or “determined” may include “predefined” or “predetermined” and/or otherwise determined values, conditions, thresholds, measurements, and the like.

SUBSTITUTE SHEET (RULE 26)

Claims

WHAT IS CLAIMED IS:

1. A method for generating counts of features detected by a robot, the method comprising one or more processors of a server: receiving a set of one or more images from the robot, wherein the set of images are each localized to a first set of corresponding camera locations; performing a bundle adjustment process to determine a depth of a plurality of key points in the set of images, the bundle adjustment process yielding a second set of camera locations; constructing a mesh via the plurality of key points based in part on a depth of each key point of the plurality of key points; detecting one or more features in the set of images; projecting one or more regions occupied by the one or more detected features onto the mesh using a camera projection matrix and the second set of camera locations; and determining a set of counts, the set of counts comprising a number of each of the one or more features, wherein the counts are based on a total number of the respective one or more features projected to different locations on the mesh.

2. The method of Claim 1, further comprising: discretizing the mesh into at least one of a plurality of regions and a plurality of pixels; determining a color value of at least one of the plurality of regions and pixels of the mesh by projecting pixel color values of the images onto the mesh from the second set of camera locations, wherein the determined color value of the at least one of the plurality of regions and pixels of the mesh is based on color values of all pixels projected thereon.

3. The method of Claim 2, further comprising: projecting the pixel color values of at least one of the plurality of regions and pixels of the mesh onto a designated plane to produce an orthographic panoramic perspective, the projection comprising an orthogonal projection onto the designated plane.

40

SUBSTITUTE SHEET (RULE 26)

4. The method of Claim 3, further comprising: darkening one or more pixels of the orthographic panoramic perspective based on depth of the one or more pixels, wherein depth and darkness of the one or more pixels is directly related.

5. The method of Claim 1, further comprising: determining one or more key point correspondences between the images when performing the bundle adjustment process, wherein determining one or more key point correspondences comprises identifying a first key point in a first image of the set of one or more images and identifying a second key point in a second image of the set of one or more images, wherein the first key point and the second key point key point depict a same feature; and determining the second set of camera locations based on an image-space location of the first key point and an image-space location of the second key point using epipolar geometry; wherein the correspondences used by the bundle adjustment process are removed if the resulting second set of camera locations deviates from the first set of camera locations by greater than a threshold amount.

6. The method of Claim 1, further comprising, determining that two or more feature detections overlap on the mesh following the projection; comparing the overlap to a first threshold, wherein the overlap being greater than the first threshold resolves the two or more detections as a singular count; and comparing the overlap to a second threshold, wherein the overlap being less than the second threshold resolves the two or more detections as two or more counts.

7. A robotic system for generating counts of features detected by the robotic system, comprising: a memory comprising computer readable instructions stored thereon; and at least one processor configured to execute the computer readable instructions to, receive a set of one or more images from a robot, wherein the set of images are each localized to a first set of corresponding camera locations; perform a bundle adjustment process to determine a depth of a

41

SUBSTITUTE SHEET (RULE 26) plurality of key points in the set of images, the bundle adjustment process yielding a second set of camera locations; construct a mesh via the plurality of key points based in part on a depth of each key point of the plurality of key points; detect one or more features in the set of images; project one or more regions occupied by the one or more detected features onto the mesh using a camera projection matrix and the second set of camera locations; and determine a set of counts, the set of counts comprising a number of each of the one or more features, wherein the counts are based on a total number of the respective one or more features projected to different locations on the mesh.

8. The robotic system of Claim 7, wherein the at least one processor is further configured to execute the computer readable instructions to, discretize the mesh into at least one of a plurality of regions and a plurality of pixels; determine a color value of at least one of the plurality of regions and pixels of the mesh by projecting pixel color values of the images onto the mesh from the second set of camera locations, wherein the determined color value of the at least one of the plurality of regions and pixels of the mesh is based on color values of all pixels projected thereon.

9. The robotic system of Claim 8, wherein the at least one processor is further configured to execute the computer readable instructions to, project the pixel color values of at least one of the plurality of regions and pixels of the mesh onto a designated plane to produce an orthographic panoramic perspective, the projection comprising an orthogonal projection onto the designated plane.

10. The robotic system of Claim 9, wherein the at least one processor is further configured to execute the computer readable instructions to, darken one or more pixels of the orthographic panoramic perspective based on depth of the one or more pixels, wherein depth and darkness of the one or more pixels

42

SUBSTITUTE SHEET (RULE 26) is directly related.

11. The robotic system of Claim 7, wherein the at least one processor is further configured to execute the computer readable instructions to, determine one or more key point correspondences between the images when performing the bundle adjustment process, wherein determining one or more key point correspondences comprises identifying a first key point in a first image of the set of one or more images and identifying a second key point in a second image of the set of one or more images, wherein the first key point and the second key point key point depict a same feature; and determine the second set of camera locations based on an image-space location of the first key point and an image-space location of the second key point using epipolar geometry; wherein the correspondences used by the bundle adjustment process are removed if the resulting second set of camera locations deviates from the first set of camera locations by greater than a threshold amount.

12. The robotic system of Claim 7, wherein the at least one processor is further configured to execute the computer readable instructions to, determine that two or more feature detections overlap on the mesh following the projection; compare the overlap to a first threshold, wherein the overlap being greater than the first threshold resolves the two or more detections as a singular count; and comparing the overlap to a second threshold, wherein the overlap being less than the second threshold resolves the two or more detections as two or more counts.

13. A non-transitory computer readable medium comprising computer readable instructions stored thereon, that when executed by at least one processor configure the at least one processor to, receive a set of one or more images from a robot, wherein the set of images are each localized to a first set of corresponding camera locations; perform a bundle adjustment process to determine a depth of a plurality of key points in the set of images, the bundle adjustment process yielding a second set of camera locations;

43

SUBSTITUTE SHEET (RULE 26) construct a mesh via the plurality of key points based in part on a depth of each key point of the plurality of key points; detect one or more features in the set of images; project one or more regions occupied by the one or more detected features onto the mesh using a camera projection matrix and the second set of camera locations; and determine a set of counts, the set of counts comprising a number of each of the one or more features, wherein the counts are based on a total number of the respective one or more features projected to different locations on the mesh.

44

SUBSTITUTE SHEET (RULE 26)