HK1171531A1

HK1171531A1 - Detection of body and props

Info

Publication number: HK1171531A1
Application number: HK12112171.2A
Authority: HK
Inventors: ．伊扎迪; S.伊扎迪; ．肖頓; J．肖顿; ．溫; J.温; ．克裡米尼斯; A．克里米尼斯; ．希爾戈斯; O．希尔戈斯; ．科克; M.科克; ．莫利紐克斯; D．莫利纽克斯
Original assignee: 微軟技術許可有限責任公司; 微软技术许可有限责任公司
Priority date: 2010-12-20
Filing date: 2012-11-27
Publication date: 2013-03-28
Also published as: CN102591456A; CN102591456B

Abstract

The present invention discloses detection of body and props. A system and method for detecting and tracking targets including body parts and props is described. In one aspect, the disclosed technology acquires one or more depth images, generates one or more classification maps associated with one or more body parts and one or more props, tracks the one or more body parts using a skeletal tracking system, tracks the one or more props using a prop tracking system, and reports metrics regarding the one or more body parts and the one or more props. In some embodiments, feedback may occur between the skeletal tracking system and the prop tracking system.

Description

Detection of body and props

Technical Field

The present invention relates to computer applications, and more particularly to object detection techniques.

Priority requirement

The present invention claims priority from a united states patent application entitled "Human Body position Estimation" filed on 5/20/2009, application No. 12/454,628, which claims priority from a provisional patent application entitled "Human Body position Estimation" filed on 5/1/2009, application No. 61/174,878. The entire contents of each of the above applications are incorporated herein by reference.

Background

In a typical computing environment, a user of a computing application, such as a multimedia application or a computer game, uses an input device to control aspects of the computing application. Common input devices used to control computing applications include controllers, keyboards, joysticks, remote controls, mice, and the like. Recently, computing gaming applications have begun to use cameras and gesture recognition software to provide a natural user interface. Using a natural user interface, a user's body parts and movements can be detected, interpreted, and used to control a game character or other aspect of a computing application.

Disclosure of Invention

Techniques for detecting, analyzing, and tracking targets, including body parts and props, are described. In one embodiment, the natural user interface system includes a target detection and tracking system. In one embodiment, the target detection and tracking system includes a target suggestion system and a target tracking system. The target suggestion system identifies one or more candidate body parts and one or more candidate prop locations within a particular field of view. In one example, the target suggestion system assigns a probability of belonging to one or more candidate body parts and/or props to one or more pixels in a particular depth image. Since the target suggestion system may generate many false positives, the target tracking system is used to coordinate one or more candidate body parts and/or props and correctly output the identified body parts and/or props.

In one embodiment, the disclosed technology obtains one or more depth images, generates one or more classification maps associated with one or more body parts and one or more props, tracks the one or more body parts using a skeletal tracking system, tracks the one or more props using a prop tracking system, and reports metrics related to the one or more body parts and the one or more props. In some embodiments, feedback may occur between the skeletal tracking system and the prop tracking system.

In some embodiments, the physical movement of one or more game players holding one or more items (e.g., game items such as plastic toy sword or guitar) is tracked and interpreted as real-time user controls that adjust and/or control portions of the electronic game. For example, a game player holding a real tennis racket or similar physical object may control the virtual racket in real time in the game space while playing a virtual tennis game.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Drawings

FIGS. 1A and 1B depict one embodiment of a target detection and tracking system that tracks users.

FIG. 1C depicts one embodiment of a target detection and tracking system that tracks users.

FIG. 2 depicts one embodiment of a target detection and tracking system.

FIG. 3 illustrates an example embodiment of a depth image.

FIG. 4 illustrates an example of a computing environment in accordance with embodiments of the invention.

FIG. 5 illustrates an example of a computing environment in accordance with embodiments of the invention.

FIG. 6A is a flow chart describing one embodiment of a process for detecting and tracking one or more targets.

FIG. 6B is a flow chart describing one embodiment of a process for generating one or more classification maps.

FIG. 6C is a flow chart describing one embodiment of a process for generating one or more classification maps.

Fig. 7 depicts an original image and a corresponding segmented image.

FIG. 8 depicts three training images that have been modified with a 3-D model.

Fig. 9A-9C depict depth images and corresponding segmented images.

Detailed Description

FIGS. 1A and 1B depict one embodiment of the object detection and tracking system 10 by a user 18 playing a boxing game. Target detection and tracking system 10 may be used to detect, identify, analyze, and/or track human targets, such as user 18, and/or non-human targets, such as props (not shown) held by user 18.

As shown in FIG. 1A, the target detection and tracking system 10 may include a computing environment 12. The computing environment 12 may include a computer, a gaming system or console, and so forth. In one embodiment, the computing environment 12 may include hardware components and/or software components such that the computing environment 12 may be used to execute an operating system and applications such as gaming applications, non-gaming applications, and the like. In one embodiment, computing system 12 may include a processor, such as a standardized processor, a specialized processor, a microprocessor, or the like that may execute instructions stored on a processor readable storage device for performing the processes described herein.

As shown in FIG. 1A, the target detection and tracking system 10 may also include a capture device 20. In one embodiment, capture device 20 may include one or more targets that may be used to visually monitor including one or more users, such as user 18. Gestures (including gestures) performed by one or more users may be captured, analyzed, and tracked in order to perform one or more controls or actions on a user interface of an operating system or application.

The user may create gestures by moving his or her body. Gestures may include motions or gestures of a user, which may be captured as image data and interpreted in meaning. The gesture may be dynamic, including motion, such as mimicking a pitch. The gesture may be a static gesture, such as keeping their forearms crossed. Gestures may also incorporate props, such as waving a simulated sword.

In one embodiment, the capture device 20 may capture image and audio data related to one or more users and/or objects. For example, the capture device 20 may be used to capture information related to partial or full body movements, gestures, and speech of one or more users. The information captured by the capture device 20 may be received by the computing environment 12 and/or a processing element within the capture device 20 and used to present, interact with, and control aspects of the gaming or other application. In one example, the capture device 20 captures image and audio data related to a particular user, and the computing environment 12 processes the captured information to identify the particular user by executing facial and voice recognition software.

In one embodiment, the object detection and tracking system 10 may be connected to an audiovisual device 16, such as a television, a monitor, a high-definition television (HDTV), or the like, that may provide game or application visuals and/or audio to a user, such as the user 18. For example, the computing environment 12 may include a video adapter such as a graphics card and/or an audio adapter such as a sound card that may provide audiovisual signals associated with the game application, non-game application, or the like. The audiovisual device 16 may receive the audiovisual signals from the computing environment 12 and may then output game or application visuals and/or audio associated with the audiovisual signals to the user 18. In one embodiment, the audiovisual device 16 may be connected to the computing environment 12 via, for example, an S-video cable, a coaxial cable, an HDMI cable, a DVI cable, a VGA cable, or the like.

As shown in FIGS. 1A and 1B, the application executing on the computing environment 12 may be a boxing game that the user 18 may be playing. The computing environment 12 may use the audiovisual device 16 to provide a visual representation of a boxing opponent 22 to the user 18. The computing environment 12 may also use the audiovisual device 16 to provide a visual representation of a player avatar 24 that the user 18 may control with his or her movements. For example, as shown in FIG. 1B, the user 18 may throw a punch in physical space to cause the player avatar 24 to throw a punch in game space. In one embodiment, the computer environment 12 and the capture device 20 of the target detection and tracking system 10 may be used to recognize and analyze a punch of the user 18 in physical space such that the punch may be interpreted as a game control of the player avatar 24 in game space.

In one embodiment, the user movements may be interpreted as controls that may correspond to actions other than controlling the player avatar 24. For example, the user 18 may use the movements to end the game, pause the game, save the game, select a level, view high scores, communicate with friends, and so forth. In another embodiment, the target detection and tracking system 10 interprets movement of the target as operating system and/or application control outside the realm of gaming. For example, virtually any controllable aspect of an operating system and/or application may be controlled by movement of an object, such as user 18. In another embodiment, the user 18 may use the movement to select a game or other application from the main user interface. Thus, the full range of motion of user 18 may be obtained, used, and analyzed in any suitable manner to interact with an application or operating system.

As shown in FIG. 1C, a human target such as user 18 may hold an object such as a racket 21. In one embodiment, user 18 may hold an object, such as a prop, while interacting with the application. In such embodiments, movement of both the person and the object may be used to control the application. For example, the motion of a player holding the racket 21 may be tracked and utilized to control an on-screen racket 23 in an application simulating a tennis game. In another embodiment, the motion of a player holding a toy weapon, such as a plastic sword, may be tracked and utilized to control a corresponding weapon in an electronic combat game. In certain embodiments, other objects including one or more gloves, balls, bats, clubs, guitars, microphones, shafts, pets, animals, drums, and the like, may also be tracked. The tracked objects may map closely to a particular game or application (e.g., a real tennis racket used in a virtual tennis game) or may be a more abstract representation (e.g., a torch or flash representing a light knife).

In certain embodiments, the one or more objects tracked by the target detection and tracking system 10 may be active objects. The moving object may include one or more sensors for providing information, such as acceleration or orientation information, to the target detection and tracking system 10. In contrast, inactive objects (passive objects) provide no additional information to the object detection and tracking system 10. The ability to combine visual tracking information with real-time position, acceleration, and/or orientation information from a moving object may allow the target detection and tracking system 10 to improve its target tracking performance, especially if motion blur may be an issue when the capture device is capturing high speed movements (e.g., swinging a baseball bat). In one embodiment, the play objects include accelerometers, magnetometers, and gyroscopes, and transmit acceleration, magnetic field, and orientation information to a target detection and tracking system.

In some embodiments, one or more objects tracked by the target detection and tracking system 10 may be inactive objects. In one embodiment, inactive objects may be augmented with one or more markers, such as IR retro-reflective markers, to improve object detection and tracking. In another embodiment, inactive and active play objects may be augmented by one or more IR retro-reflective markers.

Suitable examples of the target detection and tracking system 10 and its components are found in the following co-pending patent applications, all of which are hereby incorporated by reference: U.S. patent application serial No.12/475,094 entitled "Environment And/Or object Segmentation" filed on 29.5.2009; U.S. patent application serial No.12/511,850 entitled "automated generation a Visual Representation" filed on 29.7.2009; U.S. patent application serial No.12/474,655 entitled "gestural Tool" filed on 29.5.2009; U.S. patent application serial No.12/603,437 entitled "position Tracking Pipeline" filed on 21/10/2009; U.S. patent application serial No.12/475,308 entitled "Device for Identifying and tracking Multiple human machines Over Time" filed on 29.5.2009; U.S. patent application serial No.12/575,388 entitled "Human Tracking System" filed on 7.10.2009; U.S. patent application serial No.12/422,661 entitled "Gesture Recognizer system architecture" filed on 13.4.2009; U.S. patent application serial No.12/391, 150 entitled "Standard getroots" filed on 23.2.2009; and U.S. patent application serial No.12/474,655 entitled "gestural tool" filed on 29.5.2009.

FIG. 2 illustrates one embodiment of the target detection and tracking system 10 including a capture device 20 and a computing environment 12, which target detection and tracking system 10 may be used to identify (with or without specialized sensing devices attached to the subject) human or non-human targets in a capture area, uniquely identify them, and track them in three-dimensional space. In an embodiment, the capture device 20 may be a depth camera (or depth sensing camera) configured to capture video with depth information including a depth image that may include depth values via any suitable technique including, for example, time-of-flight, structured light, stereo image, or the like. In one embodiment, capture device 20 may include a depth sensing image sensor. In one embodiment, the capture device 20 may organize the calculated depth information into "Z layers" or layers that may be perpendicular to a Z axis extending from the depth camera along its line of sight.

As shown in FIG. 2, the capture device 20 may include an image camera component 32. In one embodiment, the image camera component 32 may be a depth camera that may capture a depth image of a scene. The depth image may include a two-dimensional (2-D) pixel area of the captured scene where each pixel in the 2-D pixel area may represent a depth value, such as a distance in, for example, centimeters, millimeters, or the like of an object in the captured scene from the camera.

As shown in FIG. 2, the image camera component 32 may include an IR light component 34, a three-dimensional (3-D) camera 36, and an RGB camera 38 that may be used to capture the depth image of the capture area. For example, in time-of-flight analysis, the IR light component 34 of the capture device 20 may emit an infrared light onto the capture area and may then use sensors to detect the backscattered light from the surface of one or more targets and objects in the capture area with, for example, the 3-D camera 36 and/or the RGB camera 38. In certain embodiments, the capture device 20 may include an IR CMOS image sensor. In some embodiments, pulsed infrared light may be used so that the time difference between an outgoing light pulse and a corresponding incoming light pulse may be measured and used to determine a physical distance from the capture device 20 to a particular location on a target or object in the capture area. Furthermore, the phase of the outgoing light wave may be compared to the phase of the incoming light wave to determine the phase shift. The phase shift may then be used to determine a physical distance from the capture device to a particular location on the targets or objects.

In one embodiment, time-of-flight analysis may be used to indirectly determine a physical distance from the capture device 20 to a particular location on the targets or objects by analyzing the intensity of the reflected beam of light over time via various techniques including, for example, shuttered light pulse imaging.

In another example, the capture device 20 may use a structured light to capture depth information. In this analysis, patterned light (i.e., light displayed as a known pattern such as a grid pattern or a stripe pattern) may be projected onto the capture area via, for example, the IR light component 34. Upon striking the surface of one or more targets or (objects) in the capture area, the pattern may deform in response. Such a deformation of the pattern may be captured by, for example, the 3-D camera 36 and/or the RGB camera 38 and analyzed to determine a physical distance from the capture device to a particular location on the targets or objects.

In some embodiments, two or more different cameras may be incorporated into an integrated capture device. For example, a depth camera and a video camera (e.g., an RGB video camera) may be incorporated into a common capture device. In some embodiments, two or more separate capture devices may be used cooperatively. For example, a depth camera and a separate video camera may be used. When a video camera is used, the video camera may be used to provide: target tracking data, confirmation data to correct for target tracking, image capture, facial recognition, high precision tracking of a finger (or other small feature), light sensing, and/or other functions.

In one embodiment, the capture device 20 may include two or more physically separated cameras that may view the capture area from different angles to obtain visual stereo data that may be resolved to generate depth information. Depth may also be determined by capturing images using multiple detectors (which may be monochromatic, infrared, RGB) or any other type of detector, and performing parallax calculations. Other types of depth image sensors may also be used to create the depth image.

As shown in FIG. 2, the capture device 20 may include a microphone 40. Microphone 40 may include a transducer or sensor that may receive sound and convert it into an electrical signal. In one embodiment, the microphone 40 may be used to reduce feedback between the capture device 20 and the computing environment 12 in the target detection and tracking system 10. Additionally, the microphone 40 may be used to receive audio signals that may also be provided by the user to control applications such as gaming applications, non-gaming applications, etc. that may be executed by the computing environment 12.

In one embodiment, the capture device 20 may include a processor 42 that may be in operative communication with the image camera component 32. The processor 42 may include a standard processor, a special purpose processor, a microprocessor, or the like. Processor 42 may execute instructions that may include instructions for storing a profile, receiving a depth image, determining whether a suitable target may be included in a depth image, converting a suitable target into a skeletal representation or model of the target, or any other suitable instruction.

It will be appreciated that at least some of the target analysis and tracking operations may be performed by processors contained within one or more capture devices. The capture device may include one or more onboard processing units configured to perform one or more target analysis and/or tracking functions. Further, the capture device may include firmware that facilitates updating such onboard processing logic.

As shown in FIG. 2, the capture device 20 may include a memory component 44, and the memory component 44 may store instructions executable by the processor 42, images or frames of images captured by a 3-D camera or an RGB camera, a user profile, or any other suitable information, images, or the like. In one example, the memory component 44 may include Random Access Memory (RAM), Read Only Memory (ROM), cache, flash memory, a hard disk, or any other suitable storage component. As shown in FIG. 2, the memory component 44 may be a separate component in communication with the image capture component 32 and the processor 42. In another embodiment, the memory component 44 may be integrated into the processor 42 and/or the image capture component 32. In one embodiment, some or all of the components 32, 34, 36, 38, 40, 42, and 44 of the capture device 20 shown in FIG. 2 are housed in a single housing.

As shown in FIG. 2, the capture device 20 may communicate with the computing environment 12 via a communication link 46. The communication link 46 may be a wired connection including, for example, a USB connection, a firewire connection, an ethernet cable connection, etc., and/or a wireless connection such as a wireless 802.11b, 802.11g, 802.11a, or 802.11n connection, etc. The computing environment 12 may provide a clock to the capture device 20 that may be used to determine when to capture, for example, a scene via the communication link 46.

In one embodiment, the capture device 20 may provide depth information and images captured by, for example, the 3-D camera 36 and/or the RGB camera 38 to the computing environment 12 via the communication link 46. The computing environment 12 may then use the depth information and captured images to, for example, create virtual screens, change user interfaces, and control applications such as games or word processors.

As shown in FIG. 2, the computing environment 12 includes a gesture library 192, structural data 198, a gesture recognition engine 190, a depth image processing and object reporting module 194, and an operating system 196. The depth image processing and object reporting module 194 uses the depth images to track the motion of objects, such as users and other objects. To assist in tracking objects, the depth image processing and object reporting module 194 uses the gesture library 190, the structure data 198, and the gesture recognition engine 190.

In one example, the structure data 198 includes structural information about objects that can be tracked. For example, a skeletal model of a human may be stored to help understand the user's movements and recognize body parts. In another example, structural information about inanimate objects (such as props) may also be stored to help identify these objects and to help understand movement.

In one example, the gestures library 192 may include a collection of gesture filters, each comprising information about a gesture that may be performed by the skeletal model. The gesture recognition engine 190 may compare the skeletal model captured by the capture device 20 and the data in the form of movements associated therewith to gesture filters in the gesture library 192 to identify when a user (as represented by the skeletal model) has performed one or more gestures. Those gestures may be associated with various controls of an application. Thus, the computing environment 12 may use the gesture recognition engine 190 to interpret movements of the skeletal model and control the operating system 196 or applications based on the movements.

In one embodiment, depth image processing and object reporting module 194 may report the identity of each object detected and the position and/or orientation of the object for each frame to operating system 196. The operating system 196 will use this information to update the position or movement of objects (e.g., avatars) or other images in the display, or to perform actions on a provided user interface.

For more information on the Gesture Recognizer engine 190, see U.S. patent application 12/422,661, "Gesture Recognizer System Architecture," filed on 13.4.2009, which is incorporated herein by reference in its entirety. More information about recognized Gestures may be found in U.S. patent application 12/391,150, "Standard Gestures (Standard Gestures)", filed on 23.2.2009; and us patent application 12/474,655 "gettrue Tool", filed on 29/5/2009, both of which are incorporated herein by reference in their entirety. More information on motion detection and Tracking can be found in U.S. patent application 12/641,788, "motion detection Using Depth Images," filed 12, 18, 2009, and U.S. patent application 12/475,308, "Device for Identifying and Tracking Multiple human devices over Time," both of which are incorporated herein by reference in their entirety.

FIG. 3 illustrates an example embodiment of a depth image 60 that may be received by a target detection and tracking system, such as the target detection and tracking system 10 and/or the computing environment 12 of FIGS. 1A-1C. In one embodiment, the depth image 60 may be an image or frame of a scene captured by the 3-D camera 36 and/or the RGB camera 38 of the capture device 20, such as described above with reference to FIG. 2. As shown in fig. 3, depth image 60 may include a human target 62 and one or more non-human targets 64 in a captured scene, such as a wall, table, monitor, and so forth. In one example, depth image 60 may include a plurality of observed pixels, where each observed pixel has an associated depth value. For example, the depth image 60 may include a two-dimensional (2-D) pixel area of the captured scene where each pixel in the 2-D pixel area may represent a depth value, such as a length or distance, e.g., in centimeters, millimeters, or the like, of an object or target in the captured scene from the capture device.

Referring back to FIG. 2, in one embodiment, once a depth image is received, the depth image may be downsampled to a lower processing resolution so that the depth image may be more easily used and/or processed faster with less computational overhead. In addition, one or more highly variable and/or noisy depth values may be removed and/or smoothed from the depth image, and portions of missing and/or removed depth information may be filled in and/or reconstructed. In one embodiment, a depth image (such as depth image 60) may be downsampled for use in combination with an image from an RGB camera (such as camera 38) or an image captured by any other detector to determine the shape and size of the target.

FIG. 4 illustrates an example of a computing environment that may be used to implement the computing environment 12 of FIG. 2, including a multimedia console (or gaming console) 100. As shown in FIG. 4, the multimedia console 100 has a Central Processing Unit (CPU)101 having a primary cache 102, a secondary cache 104, and a flash ROM (read Only memory) 106. The level one cache 102 and the level two cache 104 temporarily store data and thus reduce the number of memory access cycles, thereby improving processing speed and throughput. CPU 101 may be configured to have more than one core and thus add level one and level two caches 102 and 104. The flash ROM 106 may store executable code that is loaded during an initial phase of a boot process when the multimedia console 100 is powered ON.

A Graphics Processing Unit (GPU)108 and a video encoder/video codec (coder/decoder) 114 form a video processing pipeline for high speed and high resolution graphics processing. Data is carried from the graphics processing unit 108 to the video encoder/video codec 114 via a bus. The video processing pipeline outputs data to an a/V (audio/video) port 140 for transmission to a television or other display. A memory controller 110 is connected to the GPU 108 to facilitate processor access to various types of memory 112, such as, but not limited to, RAM (random access memory).

The multimedia console 100 includes an I/O controller 120, a system management controller 122, an audio processing unit 123, a network interface controller 124, a first USB host controller 126, a second USB controller 128, and a front panel I/O subassembly 130 that are preferably implemented on a module 118. The USB controllers 126 and 128 serve as hosts for peripheral controllers 142(1) -142(2), a wireless adapter 148, and an external memory device 146 (e.g., flash memory, external CD/DVD ROM drive, removable media, etc.). The network interface 124 and/or wireless adapter 148 provide access to a network (e.g., the Internet, home network, etc.) and may be any of a wide variety of various wired or wireless adapter components including an Ethernet card, a modem, a Bluetooth module, a cable modem, and the like.

System memory 143 is provided to store application data that is loaded during the boot process. A media drive 144 is provided and may comprise a DVD/CD drive, hard drive, or other removable media drive, among others. The media drive 144 may be internal or external to the multimedia console 100. Application data may be accessed via the media drive 144 for execution, playback, etc. by the multimedia console 100. The media drive 144 is connected to the I/O controller 120 via a bus, such as a Serial ATA bus or other high speed connection (e.g., IEEE 1394).

The system management controller 122 provides various service functions related to ensuring availability of the multimedia console 100. The audio processing unit 123 and the audio codec 132 form a corresponding audio processing pipeline with high fidelity and stereo processing. Audio data is transmitted between the audio processing unit 123 and the audio codec 132 via a communication link. The audio processing pipeline outputs data to the A/V port 140 for reproduction by an external audio player or device having audio capabilities.

The front panel I/O subassembly 130 supports the functionality of the power button 150 and the eject button 152, as well as any LEDs (light emitting diodes) or other indicators exposed on the outer surface of the multimedia console 100. The system power supply module 136 provides power to the components of the multimedia console 100. A fan 138 cools the circuitry within the multimedia console 100.

The CPU 101, GPU 108, memory controller 110, and various other components within the multimedia console 100 are interconnected via one or more buses, including serial and parallel buses, a memory bus, a peripheral bus, and a processor or local bus using any of a variety of bus architectures. By way of example, these architectures may include a Peripheral Component Interconnect (PCI) bus, a PCI-Express bus, and the like.

When the multimedia console 100 is powered ON, application data may be loaded from the system memory 143 into memory 112 and/or caches 102, 104, and executed on the CPU 101. The application may present a graphical user interface that provides a consistent user experience when navigating to different media types available on the multimedia console 100. In operation, applications and/or other media contained within the media drive 144 may be launched or played from the media drive 144 to provide additional functionalities to the multimedia console 100.

The multimedia console 100 may be operated as a standalone system by simply connecting the system to a television or other display. In the standalone mode, the multimedia console 100 allows one or more users to interact with the system, watch movies, or listen to music. However, with the integration of broadband connectivity made available through the network interface 124 or the wireless adapter 148, the multimedia console 100 may further be operated as a participant in a larger network community.

When the multimedia console 100 is powered on, a set amount of hardware resources may be reserved for system use by the multimedia console operating system. These resources may include memory reserves (e.g., 16MB), CPU and GPU cycle reserves (e.g., 5%), network bandwidth reserves (e.g., 8kbs), and so on. Because these resources are reserved at system boot, the reserved resources are not present from an application perspective.

In particular, the memory reservation is preferably large enough to contain the launch kernel, concurrent system applications, and drivers. The CPU reservation is preferably constant such that if the reserved CPU usage is not used by the system applications, the idle thread will consume any unused cycles.

With regard to the GPU reservation, lightweight messages generated by system applications (e.g., popups) are displayed by using a GPU interrupt to schedule code to render popup into an overlay. The amount of memory required for the overlay depends on the overlay area size, and the overlay preferably scales with the screen resolution. Where the concurrent system application uses a full user interface, it is preferable to use a resolution that is independent of the application resolution. A scaler may be used to set this resolution so that there is no need to change the frequency and cause a TV resynch.

After the multimedia console 100 boots and system resources are reserved, concurrent system applications execute to provide system functionality. The system functions are encapsulated in a set of system applications that execute within the reserved system resources described above. The operating system kernel identifies whether the thread is a system application thread or a game application thread. The system applications are preferably scheduled to run on the CPU 101 at predetermined times and intervals in order to provide a consistent system resource view for the application. The scheduling is done to minimize cache disruption caused by the gaming application running on the console.

When the concurrent system application requires audio, audio processing is scheduled asynchronously with respect to the gaming application due to time sensitivity. The multimedia console application manager controls the gaming application audio level (e.g., mute, attenuate) when system applications are active.

The input devices (e.g., controllers 142(1) and 142(2)) are shared by the gaming application and the system application. Rather than reserving resources, the input devices are switched between the system application and the gaming application so that each has a focus of the device. The application manager preferably controls the switching of input stream without knowledge of the gaming application's knowledge, and the driver maintains state information regarding focus switches. In some embodiments, the capture device 20 of FIG. 2 may be an additional input device to the multimedia console 100.

FIG. 5 illustrates another example of a computing environment that may be used to implement the computing environment 12 of FIG. 2. The computing environment of FIG. 5 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the disclosed subject matter. Neither should the computing environment 12 of FIG. 2 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment of FIG. 5. In some embodiments, each illustrated computing element may include circuitry configured to instantiate certain aspects of the present disclosure. For example, the term circuitry used in this disclosure may include dedicated hardware components configured to perform functions through firmware or switches. In other examples, the term circuitry may include a general purpose processing unit, memory, etc., configured by software instructions that implement logic that may be used to perform functions. In an embodiment where circuitry includes a combination of hardware and software, an implementer may write source code embodying logic and the source code can be compiled into machine readable code that can be processed by the general purpose processing unit.

In FIG. 5, computing system 220 includes a computer 241, which typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 241 and includes both volatile and nonvolatile media, removable and non-removable media. The system memory 222 includes computer storage media in the form of volatile and/or nonvolatile memory such as Read Only Memory (ROM)223 and Random Access Memory (RAM) 260. A basic input/output system 224(BIOS), containing the basic routines that help to transfer information between elements within computer 241, such as during start-up, is typically stored in ROM 223. RAM 260 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 259. By way of example, and not limitation, FIG. 5 illustrates operating system 225, application programs 226, other program modules 227, and program data 228.

The computer 241 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example, FIG. 5 illustrates a hard disk drive 238 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 239 that reads from or writes to a removable, nonvolatile magnetic disk 254, and an optical disk drive 240 that reads from or writes to a removable, nonvolatile optical disk 253 such as a CD ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like. The hard disk drive 238 is typically connected to the system bus 221 through a non-removable memory interface such as interface 234, and magnetic disk drive 239 and optical disk drive 240 are typically connected to the system bus 221 by a removable memory interface, such as interface 235.

The drives and their associated computer storage media discussed above and illustrated in FIG. 5, provide storage of computer readable instructions, data structures, program modules and other data for the computer 241. In FIG. 5, for example, hard disk drive 238 is illustrated as storing operating system 258, application programs 257, other program modules 256, and program data 255. Note that these components can either be the same as or different from operating system 225, application programs 226, other program modules 227, and program data 228. Operating system 258, application programs 257, other program modules 256, and program data 255 are given different numbers here to illustrate that, at a minimum, they are different copies. A user may enter commands and information into the computer 241 through input devices such as a keyboard 251 and pointing device 252, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 259 through a user input interface 236 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a Universal Serial Bus (USB). The cameras 34, 36 and capture device 20 of FIG. 2 may define additional input devices for the computer 241. A monitor 242 or other type of display device is also connected to the system bus 221 via an interface, such as a video interface 232. In addition to the monitor, computers may also include other peripheral output devices such as speakers 244 and printer 243, which may be connected through an output peripheral interface 233.

The computer 241 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 246. The remote computer 246 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 241, although only a memory storage device 247 has been illustrated in fig. 5. The logical connections depicted in FIG. 5 include a Local Area Network (LAN)245 and a Wide Area Network (WAN)249, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet.

When used in a LAN networking environment, the computer 241 is connected to the LAN 245 through a network interface or adapter 237. When used in a WAN networking environment, the computer 241 typically includes a modem 250 or other means for establishing communications over the WAN 249, such as the Internet. The modem 250, which may be internal or external, may be connected to the system bus 221 via the user input interface 236, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 241, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, FIG. 5 illustrates remote application programs 248 as residing on memory device 247. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.

In one embodiment, computing system 220 may be configured to represent each target with a model. As described in more detail below, information derived from such a model may be compared to information obtained from a capture device, such as a depth camera, so that the base scale or shape of the model, as well as its current pose, may be adjusted to more accurately represent the modeled target. The model may be represented by one or more polygonal meshes, by a set of mathematical primitives, and/or by other suitable machine representations of the object being modeled.

FIG. 6A is a flow chart describing one embodiment of a process for detecting and tracking one or more targets. In some embodiments, the one or more targets may include body parts and props of the human game player. In some embodiments, the particular object of one or more targets may represent a combination of a body part and a prop. For example, the particular object may include a baseball glove and an upper portion of a forearm.

The process of fig. 6A may be performed by one or more computing devices. Each step in the process of FIG. 6A may be performed by the same or different computing devices as those used in the other captures, and each step need not be performed by a single computing device. In one embodiment, the process of FIG. 6A is performed by a computing environment, such as the computing environment of FIG. 2.

At step 602, one or more depth images are obtained from a source, such as capture device 20 of FIG. 2. In some embodiments, the source may be a depth camera configured to obtain depth information about the target through suitable techniques such as time-of-flight analysis, structured light analysis, stereo vision analysis, or other suitable techniques. In one embodiment, the obtained depth image may include a plurality of observed pixels, where each observed pixel has one or more observed depth values including depth information for a target viewed from a source. The obtained depth image may optionally be represented as a pixel matrix including, for each pixel address, a depth value indicating a world space depth from the plane of the depth camera or another suitable reference plane to the surface at that pixel address. In one embodiment, the obtained depth image may be downsampled to a lower resolution image. In another embodiment, the obtained depth image may be filtered to remove and or smooth one or more highly variable and/or noisy depth values. Such highly variable and/or noisy depth values in the obtained depth image may originate from a number of different sources, such as random and/or systematic errors occurring during the image capture process, imperfections and/or distortions due to the capture device, and so forth.

At step 604, one or more of the obtained depth images may be processed to distinguish foreground objects to be tracked from non-target objects or other background elements. As used herein, the term "background" is used to describe anything in an image that is not part of one or more objects to be tracked. The background may include elements in front of (i.e., closer to the depth camera) the target or targets to be tracked. Distinguishing foreground elements to be tracked from negligible background elements may increase tracking efficiency and/or simplify downstream processing.

In one embodiment, each data point (e.g., pixel) in the obtained depth image may be assigned a segmentation value (or index) that identifies whether the particular data point belongs to a foreground element or to a non-target background element. The segment value may represent a discrete index value or a fuzzy index value indicating the probability that a pixel belongs to a particular target and/or background element. In one example, each of one or more targets included within the foreground image may be assigned a different segment value. For example, a pixel corresponding to a first game player may be assigned a player index equal to 1, a pixel corresponding to a second player may be assigned a player index equal to 2, and a pixel not corresponding to a target player may be assigned a background index equal to 0. In another embodiment, pixels or other data points assigned with background indices may be excluded from consideration in one or more subsequent processing steps. In some embodiments, the processing step of distinguishing foreground pixels from background pixels may be omitted.

At step 606, foreground pixel assignment is performed. Foreground pixel assignment may include analyzing one or more foreground pixels to determine which of one or more targets (including body parts and props) are likely to be associated with the one or more foreground pixels. Various different foreground pixel assignment techniques may be used to evaluate to which of one or more targets (or machine representations of one or more targets) a particular pixel is likely to belong. In one embodiment, both depth information and color information are used in determining which probabilities to assign to a particular foreground pixel or a particular group of foreground pixels.

In one embodiment, machine learning may be used to assign a target index and/or a target probability distribution to each foreground pixel. Machine learning methods use information learned from analyzing a previously trained set of known gestures (e.g., training a set of segmented images) to analyze foreground objects. In one example, a stateless approach may be used to assign a target index or distribution to each foreground pixel without any prior context (i.e., without knowledge of the prior frame). In some embodiments, a machine learning method of foreground pixel assignment may utilize one or more decision trees to analyze each foreground pixel of interest in the obtained depth image. Such an analysis may determine the best guess for the target assignment for the pixel, and the confidence that the best guess is correct.

In some embodiments, the best guess may include a probability distribution over two or more possible targets, and the confidence may be represented by the relative probabilities of the different possible targets. At each node of the decision tree, an observed depth value comparison between two pixels is made, and depending on the result of this comparison, a subsequent depth value comparison between two pixels is made at a child node of the decision tree. These comparison results at each node determine the pixels to be compared at the next node. The end nodes of each decision tree result in a target classification and associated confidence in that classification.

In some embodiments, subsequent decision trees may be used to iteratively refine the best guess assigned to the one or more targets for each pixel and the confidence that the best guess is correct. For example, once pixels have been classified with a first classification tree (based on neighboring depth values), refined classification may be performed to classify each pixel using a second decision tree looking at previously classified pixels and/or depth values. A third traversal may be used to further refine the classification of the current pixel by looking at previously classified pixels and/or depth values. It will be appreciated that virtually any number of iterations may be performed, with fewer iterations resulting in less computational expense, and more iterations likely to provide more accurate classification and/or confidence.

In some embodiments, a decision tree may be constructed during a training mode, wherein samples of known models of known poses (e.g., a training set of segmented images) are analyzed to determine questions (i.e., tests) that may be asked at each node of the decision tree to produce an accurate pixel classification.

In one embodiment, the foreground pixel assignments are stateless, meaning that pixel assignments are made without reference to a prior state (or prior image frame). One example of stateless processing for assigning a probability that a particular pixel or group of pixels represents one or more objects is sample processing. The sample processing uses a machine learning approach that employs depth images and classifies each pixel by assigning it a probability distribution over one or more objects to which it may correspond. For example, a given pixel (which is actually a tennis racket) may be assigned a 70% likelihood that it belongs to a tennis racket, a 20% likelihood that it belongs to a ping-pong racket, and a 10% likelihood that it belongs to the right arm. Sample processing may input millions of pre-classified training samples (e.g., segmented images), learn relationships between sets of pixels within the pre-classified training samples, and generate segmented images based on a particular depth image. In one example, sample processing may produce a classification map in which pixels are classified by probability of belonging to a particular object (e.g., a body part or prop). Sample processing is also described in U.S. patent application serial No. 12/454,628, entitled "Human Body position Estimation," which is hereby incorporated by reference in its entirety.

In another embodiment, sample processing and centroid generation are used to generate probabilities for correctly identifying particular objects such as body parts and/or props. The centroid may have an associated probability that the captured object is correctly identified as a given object (such as a hand, face, or prop). In one embodiment, centroids of the user's head, shoulders, elbows, wrists, and hands are generated. Sample processing and Centroid generation is further described in U.S. patent application No. 12/825,657 entitled "skelestal joint recognition and Tracking System" and U.S. patent application No. 12/770,394 entitled "Multiple center Condensation of probability distribution Clouds". The entire contents of each of the above applications are incorporated herein by reference.

At step 607, one or more classification maps are generated. As shown in fig. 6A, step 607 may receive input from steps 602, 604, and 606. In one embodiment, a first classification map corresponding to a body part target is generated, and a second classification map corresponding to a prop target is generated. In another embodiment, a unified classification map is generated that covers multiple targets, including both body part targets and prop targets. In one example of a method for generating a unified classification map, the training set provided to the machine learning technique for implementing step 606 includes segmented images comprising one or more body parts and one or more props. In one example, each pixel in the segmented image is identified as one of a body part, an object, or a background.

FIG. 6B is a flow chart describing another embodiment of a process for generating one or more classification maps. The process described in FIG. 6B is merely one example of a process for implementing step 607 in FIG. 6A. The process of fig. 6B may be performed by one or more computing devices. Each step of the process of fig. 6B may be performed by the same or different computing devices as those used in the other steps, and each step need not be performed by a single computing device. In one embodiment, the process of FIG. 6B is performed by a game console.

In fig. 6B, the classification map is generated from a depth image of the body-part target. In one embodiment, the classification map of step 654 may be generated using the probability assignment of step 606, whereby foreground pixels are assigned probabilities of belonging to one or more body part targets. In step 656, body parts may be identified from the classification map generated in step 654. In one embodiment, a particular body part is identified where the assigned probability of one or more pixels representing the particular body part is greater than 90%. At step 657, the identified body part is removed from the depth image (or a derivative of the depth image). In some embodiments, the background is also removed. At step 658, object recognition is performed on the depth image and the identified body parts are removed to identify one or more props. In one embodiment, sample processing may be used to perform object recognition. Other suitable object recognition techniques may also be used. At step 659, a classification map of one or more props is generated based on the results of step 658. One advantage of using a training set without props to perform step 606 (followed by an object recognition process) is that the object recognition process of step 658 is more efficient in detecting objects than performing step 606 using a training set that includes props.

FIG. 6C is a flow chart describing another embodiment of a process for generating one or more classification maps. The process described in FIG. 6C is merely one example of a process for implementing step 607 in FIG. 6A. The process of fig. 6C may be performed by one or more computing devices. Each step of the process of fig. 6C may be performed by the same or different computing devices as those used in the other steps, and each step need not be performed by a single computing device. In one embodiment, the process of FIG. 6C is performed by a game console.

In FIG. 6C, the classification map is generated from the depth image of the prop target. Prop targets include active props and/or inactive props. In one embodiment, the classification map of step 663 may be generated using the probability assignment of step 606, whereby foreground pixels are assigned probabilities of belonging to one or more prop objects. At step 665, props may be identified from the classification map generated at step 663. In one embodiment, a particular prop is identified where a probability assigned to one or more pixels that the one or more pixels represent the particular prop is greater than 90%. At step 667, the identified prop is removed from the depth image (or a derivative of the depth image). In some embodiments, the background is also removed. In one embodiment, a "don't care" value is assigned to the pixel associated with one or more of the removed props. This "don't care" value may be used by subsequent processing steps to disregard depth information associated with the removed pixel. This information may be helpful to subsequent classification steps, as the removed pixels may have been associated with one or more props, which may have been in front of the body part (i.e., the body part being identified or classified in subsequent processing steps may have been occluded by one or more props). At step 668, object recognition is performed on the depth image and the identified props are removed to identify one or more body parts. In one embodiment, sample processing may be used to perform object recognition. In one example, steps 604 and 606 may be used with a new training set including segmented body part images. Other suitable object recognition techniques may also be used. At step 669, a classification map for one or more body parts is generated based on the results of step 668.

Referring back to FIG. 6A, in step 610, model parsing and tracking is performed. In one embodiment, model parsing and tracking includes model fitting 608, skeletal tracking 620, and prop tracking 622. In one embodiment, model parsing and tracking 610 may receive one or more classification maps based on one or more of the original depth images from step 602, the foreground/background information from step 604, the foreground pixel probability assignments from step 606.

In one embodiment, model fitting 608 is used to fit one or more possible computer models to one or more obtained images and/or one or more classification maps. The one or more computer models may include machine representations of modeled targets (e.g., machine representations of body parts or props). In certain embodiments, model fitting involving lines, planes, or more complex geometries may be applied to track objects in three-dimensional space. In some examples, the model may include one or more data structures representing the target as a three-dimensional model including rigid or deformable shapes, or body parts. Each target (e.g., a human and/or prop) or portion of a target may be characterized as a mathematical primitive, examples of which include, but are not limited to, a sphere, an anisotropically scaled sphere, a cylinder, an anisotropic cylinder, a smooth cylinder, a square, a beveled square, a prism, and so forth. In some examples, the target may be modeled using a parameterized three-dimensional model. In some examples, a model may include a negative space (i.e., a space that should have nothing). In one example, a steering wheel containing an empty space may be modeled with a three-dimensional model that includes a negative space associated with the empty space. In another example, the space at the end of a baseball bat may be modeled with a negative space.

In one embodiment, during model fitting 608, the human target is modeled as a skeleton comprising a plurality of skeleton points, each skeleton point having a three-dimensional position in world space. Each skeletal point may correspond to an actual joint of the human target, an extremity of the human target, and/or a point that is not anatomically directly linked to the human target. Each skeletal point has at least three degrees of freedom (e.g., world space x, y, z). In one example, a skeleton with 31 skeleton points may be defined by 93 values.

In some embodiments, various model fitting methods may use depth information, background information, prop information, body part information, and/or previously trained anatomical and motion information to map one or more computer models onto the obtained images. For example, the body part information may be used to find one or more candidate locations for one or more skeletal bones. Subsequently, multiple plausible skeletons may be assembled to include skeletal bones at different combinations of one or more candidate locations. Each plausible skeleton may then be scored and the scored suggestions may be incorporated into the final assessment. In one embodiment, model fit 608 includes two components: a body-part advisor that extracts candidate locations from the foreground pixel assignments 606 independently for each body part (e.g., finding the centroid of each body part); and a skeleton generator that merges the candidates into a complete skeleton.

Referring back to fig. 6A, in one embodiment, the process for detecting and tracking one or more targets may be implemented by a target suggestion system and a target tracking system. The goal suggestion system may implement steps 602, 604, 606, and 607 to identify one or more candidate goals. One or more candidate targets may be identified within one or more classification maps. The target tracking system may implement steps 610 and 612 in order to reconcile one or more candidate targets and correctly report the identified targets. In one example, skeletal tracking system 620 consumes one or more candidate targets assigned as candidate body parts, while prop tracking system 622 consumes one or more candidate targets assigned as candidate props. In another example, skeletal tracking system 620 consumes a first classification map associated with one or more candidate body parts, while prop tracking system 622 consumes a second classification map associated with one or more candidate props.

Referring back to fig. 6A, in one embodiment, skeletal tracking system 620 works by variously connecting one or more body part suggestions (or candidates) in order to generate a large number of (partial or entire) skeletal hypotheses. To reduce computational complexity, it is possible to parse certain parts of the skeleton first (such as the head and shoulders), followed by parsing other parts (such as the arms). These skeletal hypotheses are then scored in any way, and the scores and other information are used to select the best hypothesis and coordinate where the correct body part is actually. Similarly, prop tracking system 622 considers one or more prop suggestions (or candidates), generates prop hypotheses, scores the generated prop hypotheses, and selects the best hypothesis to determine the correct prop. In one embodiment, at step 610, the position and/or orientation of one or more previous high scoring hypotheses from a previous image are used to help score the generated hypotheses. For example, a previous determination of the position and orientation of the tennis racket in a previous image may be used to score the position and orientation of the tennis racket in the current image.

In one embodiment, feedback may occur between skeletal tracking system 620 and prop tracking system 622. In one example, skeletal tracking system 620 receives prop tracking information from prop tracking system 622. Prop tracking information includes location and orientation information related to one or more props. When scoring the generated skeletal hypotheses, prop tracking information is considered. For example, in the case where a particular object (e.g., a tennis racket or baseball bat) is located close to a particular body part (e.g., a hand or arm), a scored hypothesis may be awarded. The location may be a 3-D location in three-dimensional space or a 2-D location in two-dimensional space. Similarly, in the event that a particular subject is not within a threshold distance of a particular body part with which the particular subject is typically associated, the score given to a particular hypothesis may be reduced (or penalized). In some embodiments, the reward or penalty (e.g., the number of points given to a particular body part hypothesis) given to a particular cost function may be linear or non-linear.

In another example, skeletal tracking system 622 receives prop tracking information from prop tracking system 620. Skeletal tracking information includes position and orientation information related to one or more body parts. Skeletal tracking information is considered when scoring the generated prop hypotheses. For example, a scored hypothesis may be awarded where the location of a particular body part (e.g., head) is near a particular prop (e.g., hat). The location may be a 3-D location in three-dimensional space or a 2-D location in two-dimensional space. Similarly, the score given to a particular hypothesis may be reduced (or penalized) in the event that a particular body part is not within a threshold distance of a particular prop typically associated with that particular body part. In some embodiments, the reward or penalty (e.g., the score given to a particular prop hypothesis) given to a particular cost function may be linear or non-linear. Feedback data relating to the user's body may be particularly helpful in situations where tracking objects is difficult (e.g., when an object is moving into and out of view quickly, or when an object is moving at high speed relative to the capture device's ability to capture the object's motion). For example, in the case where a game player swings a baseball bat, if tracking the baseball bat is lost, the grip of the baseball bat may be recovered by considering the position of the game player's hand. In certain embodiments, prop tracking 622 is performed in parallel with skeletal tracking 620.

At step 612, the determination of the correctly identified target from step 610 is reported and made available to other applications. Reporting may be performed in any suitable manner. In one example, an Application Programming Interface (API) may be used to report one or more selected targets. For example, such APIs may be configured to communicate position, velocity, acceleration, confidence in position, velocity, and/or acceleration, and/or other information related to one or more selected targets.

Fig. 7 depicts an original image 750 and a segmented body image 752 based on the original image 750. In one embodiment, the segmented body image 752 distinguishes one or more pixel regions associated with a particular body part target by assigning a particular color to each of the one or more pixel regions. Raw image 750 may come from a number of sources, including a capture device, such as capture device 20 in FIG. 2, or a graphics package or other 3-D rendering program. In one embodiment, the original image 750 represents a particular gesture from a user (such as the user 18 in FIGS. 1A-1C). In one embodiment, the target detection and tracking system 10 of FIG. 2 may receive the raw image 750 and generate the segmented body image 752 using the process described with reference to FIG. 6A. In one example, the classification map generated by step 607 of FIG. 6A may include segmented images. In one embodiment, one or more segmented images, each comprising a particular pose, may be used as part of a training set (i.e., training examples) of a machine learning method. The training set may include thousands, millions, or any number of segmented images.

In one embodiment, one or more training images of a training set may be modified with a 3-D model of a particular object or prop. The 3-D model may include one or more data structures that represent a particular object as a three-dimensional shape. In another embodiment, one or more training images of the training set may be presented using a 3-D model of a particular object or prop.

In FIG. 8, three training images 942, 944 and 946 have been modified with a 3-D model of each prop. The segmented image 942 has been modified with a tennis racket. The segmented image 944 has been modified with a sword. In this case, the modified segmented image may be discarded from the training set as the object is penetrating the user's body. The segmented image 946 has been modified with a baseball bat. Retrofitting an existing human pose training set with props, and/or automatically generating a new training set including props based on the existing human pose training set, is less expensive than creating a training set involving new captured movements of human poses and props. In some embodiments, the human subject does not touch or hold an object or prop in the training image. For example, a ball transferred between two game players will be in mid-air and not in direct contact with either player.

Because there is a tradeoff between the number of body parts and objects that can be detected simultaneously, in some embodiments, the number of body part targets may be limited. For example, rather than searching for 32 different body parts, the body part targets may include only the head, neck, left and right shoulders, left and right upper torso, and upper and lower arms, and hands. In some embodiments, one or more prop targets may include multiple locations. For example, a tennis racket may be composed of a handle and a head.

Once a detection and tracking system (such as detection and tracking system 10 of FIG. 2) has been trained with a training set including segmented body parts and props, a classification map of the classifications of both body part targets and prop targets may be generated. In fig. 9A, an original depth image of a gloved human is used to generate a segmented image that includes both predicted body parts and props. As shown in the segmented image of fig. 9A, the glove on the user's right hand may be classified as a target and a plurality of body part targets (e.g., left and right shoulders). In FIG. 9B, the original depth image of a user holding a baseball bat classified as a target may be used to generate a segmented image.

In one embodiment, a plurality of props may be classified along with a plurality of body parts. In fig. 9C, a depth image 912 of a user holding a baseball bat and throwing a football into the air is used to generate a segmented image 916, where the baseball bat and football are classified as targets. In one embodiment, color and/or style information received from a capture device may be used to help further distinguish between targets that are similar in shape and size. In one example, basketball and soccer may be distinguished based on color information. In another example, a soccer ball style that includes alternating black pentagons and white hexagons may be used to help distinguish a soccer ball from other objects having similar shapes and sizes.

In one embodiment, detecting and/or tracking user selection, selection of a particular prop, or introduction of a particular prop into the field of view may trigger an application to select a particular application mode. In one example, a game player picking a football will cause the sports application to select a game mode associated with the football. In another example, a particular game may allow a game player to select and use three different objects (e.g., a gun, a baseball bat, and a power saw) based on which of the one or more associated items the game player is holding.

The disclosed technology is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the technology include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The disclosed technology may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, software and program modules as described herein include routines, programs, objects, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Hardware or a combination of hardware and software may be substituted for the software modules described herein.

The disclosed technology may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

For purposes of this document, references in the specification to "an embodiment," "one embodiment," "some embodiments," or "another embodiment" are used to describe different embodiments and do not necessarily refer to the same embodiment.

For purposes herein, a connection may be a direct connection or an indirect connection (e.g., via another party).

For purposes herein, the term "set" of objects refers to a "set" of one or more objects.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method for detecting one or more targets, comprising:

obtaining (602) one or more depth images (750, 912) from one or more depth sensing image sensors, a first depth image of the one or more depth images comprising a plurality of pixels;

generating (607) a classification map (752, 916) associated with the first depth image, the generating step comprising assigning to one or more pixels of the plurality of pixels a probability that the one or more pixels are associated with a particular target of the one or more targets, the one or more targets including a first target representing at least a portion of a first body part and a second target representing at least a portion of a first prop;

fitting (608) at least one of one or more computer models to at least a portion of the classification map, the one or more computer models including at least a first model of the first target and at least a second model of the second target;

performing skeletal tracking (620) on the first target;

performing prop tracking (622) on the second target; and

reporting (612) a first location of the first object and a second location of the second object.

2. The method of claim 1, further comprising:

obtaining one or more color images from one or more color sensing image sensors, the step of performing prop tracking comprising using color information from the one or more color images to assist in tracking the second target.

3. The method of claim 1, wherein:

the generating step is performed using a machine learning technique that uses a training set of segmented images that includes one or more modified images.

4. The method of claim 1, wherein:

the step of performing skeletal tracking includes receiving location information about the second target, the location information being considered in determining a first location of the first target.

5. The method of claim 1, wherein:

the step of performing prop tracking includes receiving location information about the first target, the location information being considered in determining a second location of the second target.

6. The method of claim 1, further comprising:

switching a game mode based on the reporting step.

7. The method of claim 1, further comprising:

receiving orientation information from the first prop, the step of performing prop tracking using the orientation information to assist in tracking the first prop.

8. The method of claim 1, wherein:

the second model includes one or more negative spaces.

9. A system for detecting one or more targets, comprising:

means for obtaining one or more depth images from one or more depth sensing image sensors, a first depth image of the one or more depth images comprising a plurality of pixels;

means for generating a classification map associated with the first depth image, the means for generating comprising means for assigning to one or more pixels of the plurality of pixels a probability that the one or more pixels are associated with a particular target of the one or more targets, the one or more targets including a first target representing at least a portion of a first body part and a second target representing at least a portion of a first prop;

means for fitting at least one of one or more computer models to at least a portion of the classification map, the one or more computer models including at least a first model of the first target and at least a second model of the second target;

means for performing skeletal tracking on the first target;

means for performing prop tracking on the second target; and

means for reporting a first location of the first target and a second location of the second target.

10. A method for detecting one or more targets, the method comprising the steps of:

obtaining (602) one or more depth images from one or more depth sensing image sensors, a first depth image of the one or more depth images comprising a plurality of pixels;

generating (607) a classification map associated with the first depth image, the generating step comprising assigning to one or more pixels of the plurality of pixels a probability that the one or more pixels are associated with a particular target of one or more targets, the one or more targets including a first target representing at least a portion of a first body part and a second target representing at least a portion of a first prop;

performing skeletal tracking (620) on the first target, the step of performing skeletal tracking comprising receiving location information about the second target, the location information being considered in determining a first location of the first target;

performing prop tracking (622) on the second target, the step of performing prop tracking comprising receiving location information about the first target, the location information being taken into account in determining a second location of the second target;