WO2002013535A2

WO2002013535A2 - Video encoder using image from a secondary image sensor

Info

Publication number: WO2002013535A2
Application number: PCT/EP2001/008538
Authority: WO
Inventors: Michael Bakhmutsky
Original assignee: Koninklijke Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 2000-08-08
Filing date: 2001-07-23
Publication date: 2002-02-14
Anticipated expiration: 2003-02-08
Also published as: EP1310102A2; KR20020064794A; JP2004506354A; WO2002013535A3; CN1393111A

Abstract

A secondary sensor is provided that senses the same scene as a video camera. The image from the secondary sensor is used to identify areas of the video image corresponding to objects of interest. The identified areas of interest can then be encoded at a finer level of detail than the other areas in the video image. A preferred secondary sensor for detecting animate objects, such as humans in a videoconference scene, is a conventional infrared heat sensor matrix. By encoding the areas of the video image corresponding to ambient temperature regions of the heat sensor matrix at a generally lower level of detail, the available bandwidth can be allocated for transmitting the higher temperature regions at a finer level of detail or a higher frame rate. The secondary image may also be used as a 'front end filter' to conventional object recognition applications.

Description

Using a secondary sensor for optimized video communications

1. Field of the Invention

This invention relates to the field of video communications, and in particular to a method and system that facilitates an optimized transmission of images based on a coupling of a video camera with a secondary sensor, such as a heat sensor mosaic. 2. Description of Related Art

Video communications consume a relatively large transmission bandwidth, and a number of systems have been developed and continue to be developed to reduce the required bandwidth, or to optimize the use of existing bandwidth. An MPEG encoding of a stream of images, for example, uses a variety of techniques to reduce the amount of data that needs to be transmitted or stored. For ease of reference, the term bandwidth is used herein to include the amount of encoded data required to either store or transmit video images. A discrete cosine transform (DCT) is used to reduce the size of the encoded information spatially within each image frame, or portion of a frame. Motion estimation techniques are used to reduce the size of the encoded information temporally, based on the amount of difference, or movement, between successive images. Quantization is used to reduce the size of the encoded information based on the degree of detail required, or to reduce the size, and thus the detail, based on available bandwidth. Each of these techniques are intended to optimize the allocation of bandwidth to different characteristics of the image, without introducing noticeable visible anomalies when the received image is decoded and displayed. Even with the bandwidth optimizing techniques of MPEG encoding, some compromises are required for low bandwidth systems. For example, video images communicated over the Internet are typically constrained to small size images, providing substantially less resolution than a full-resolution DND version of the same stream of images. Video images communicated for videoconferencing are typically encoded at less than half the frame rate of conventional television broadcasts, and produces delayed and discontinuous images on the display.

US patent 5,475,433 "FUZZY-COΝTROLLED CODING METHOD AND APPARATUS THEREFOR" (sic), issued 12 December 1995, for Je-chang Jeong, and incorporated by reference herein, teaches a further method of optimizing the MPEG encoding of video images by adjusting the parameters of the aforementioned encoding techniques based on a combination of characteristics. For example, a sequence of images with a large amount of motion is encoded at less level of detail than a relatively static image, based on the premise that the lack of detail will not be visibly detectable in a fast moving scene. In like manner, the degree of complexity of the image and its brightness, and the amount of available bandwidth are used to adjust the quantization level, and therefore the amount of detail, of the transmitted image.

Other techniques have been proposed for improving the bandwidth allocation process, most of which rely on the segregation of images into "objects", or "object regions". MPEG-4, for example, allows for the separation of an object from its background, and thereby allows the object to be encoded at a different, typically finer, level of detail than the background. This encoding technique is expected to be particularly well suited for videoconferencing, wherein the majority of the limited bandwidth is allocated to the human 'objects' in the scene, with minimal bandwidth being allocated to background scenes. In this manner, although movements in the background may appear staggered and potentially blurred, the human objects in the scene will appear clearly, and potentially at a higher frame rate that reduces delays and discontinuities. These object-dependent encoding techniques are also expected to facilitate graphic art effects, wherein select objects can be encoded with different emphasis than the background scene, or other objects. These advanced techniques for allocating bandwidth or providing graphic art effects to objects of interest in an encoded image, however, requires the recognition of each object in the image. Object recognition, however, is a complex processing task that currently requires processing equipment that is beyond the feasible cost range for consumer devices. The high cost and relatively low accuracy of current object recognition devices precludes its use in most applications that could benefit from an optimized encoding, such as video conferencing and Internet video communications.

It is an object of this invention to provide a method and system for object recognition that facilitates an optimization of bandwidth allocation of video images. It is a further object of this invention to provide a low cost video system having an object-based resource allocation. It is a further object of this invention to provide a low cost video system that facilitates an optimization of bandwidth allocation. It is a further object of this invention to provide a means of distinguishing an object from the background of an image. These objects and others are achieved by providing a secondary sensor that senses the same scene as a video camera. The secondary image is used to identify areas of the video image corresponding to objects of interest. The identified areas of interest can then be encoded at a finer level of detail than the other areas in the video image. A preferred secondary sensor for detecting animate objects, such as humans in a videoconference scene, is a conventional infrared heat sensor matrix. By encoding the areas of the video image corresponding to ambient temperature regions of the heat sensor matrix at a very coarse level of detail, the available bandwidth can be allocated for transmitting the higher temperature regions at a higher level of detail, or at a higher frame rate. The secondary image may also be used as a "front end filter" to conventional object recognition applications, thereby increasing the efficiency and accuracy of these applications.

The invention is explained in further detail, and by way of example, with reference to the accompanying drawings wherein:

Fig. 1 illustrates an example block diagram of an encoding system in accordance with this invention.

Fig. 2 illustrates an example camera system in accordance with this invention. Fig. 3 illustrates an example flow diagram of an encoding system in accordance with this invention.

Throughout the drawings, the same reference numerals indicate similar or corresponding features or functions.

FIG. 1 illustrates an example block diagram of an encoding system 100 in accordance with this invention. The encoding system 100 includes a source of a video image 110, a corresponding secondary image 120, and an encoder 150. For ease of reference, the term image is used herein to define an array of values corresponding to items within a field of view of a collection device. For example, the video image 110 generally corresponds to an array of values associated with the collection of visible light within the field of view of a video camera. This array of values may be in any of a variety of formats, and although represented as an array of values in the figures, may be a serial stream of values.

As discussed further below, the secondary image 120 in accordance with this invention is not a derivative of the video image 110, but is a representation of substantially the same scene as the video image, collected via an alternative sensor to the sensor that is used to collect the video image 110. In a preferred embodiment, the secondary image 120 is a representation of the scene collected via an infrared heat sensor, although other secondary sensing devices may be used as well. Preferably, the secondary sensor captures a characteristic of the scene that facilitates the recognition of potential objects of interest 101 in the video image 110. An infrared sensor is particularly well suited for detecting animate objects, such as humans, even when the object is fully clothed. Another sensor, such as a detector of particular visible colors, may be used, for example, when the particular color is associated with potential objects of interest. As illustrated, the resolution of the secondary image 120 may be different than the video image 110. In a low-cost embodiment, for example, the secondary image 120 may be a 64x64 array of thermal values, whereas the video image 110 may be a 330x485, or larger, array of luminance and chrominance values. The resolution of the secondary image 120 is selected based on a cost/performance tradeoff. The resolution of the secondary image 120 determines the accuracy of determining the shape of the object of interest 101, and thereby the degree of encoding optimization that can be achieved, but the cost of a sensor to produce a high-resolution image 120 may be substantially higher than the cost of a sensor that produces a low-resolution image 120. Such a high cost may be warranted, for example, in a professional system that is used to identify a newscaster in a scene, and substitutes appropriate background images based on the news content.

For ease of reference and understanding, this invention is presented using the paradigm of an identification of potential objects or regions of interest in the video image based on thermal emissions, and an adjustment of the level of detail provided in the encoding of these objects or regions of interest. As will be evident to one of ordinary skill in the art in view of this disclosure, other characteristics of a secondary image can be used to control the encoding of the video images, such as an identification of objects based on a particular color. In like manner, other encoding parameters, such as brightness, color intensity, frame rate, and so on, may be adjusted in dependence upon the detected characteristics. In the context of this invention, any parameter or characteristic that affects the encoding of an image is termed an "encoding parameter". For example, in lieu of directly adjusting the encoding level of detail for background regions, the luminance and chrominance values within these regions may be set to a constant value, thereby minimizing the information content that needs to be encoded for these regions. As illustrated in FIG. 1, the characteristics of the secondary image 120 are used to control the encoding parameters 160 that are used by the encoder 150 for encoding the video image 110. For example, the object 101 is illustrated as being overlaid on the images 110, 120. In the aforementioned infrared sensor example, if this object 101 is a source of heat, the sensor regions corresponding to infrared image 120 that are overlaid by the infrared emitting object 101 will have higher sensed values than the surrounding regions. Regions that are partially overlaid by the infrared emitting object 101 will have an average sensed value that is lower than the regions that are overlaid completely by the infrared emitting object 101, but higher than the regions that do not contain a source of infrared emissions. If, as illustrated, a region 121 of the secondary image 120 contains a characteristic (high thermal sensed value) that corresponds to the presence of an animate object (warm body) in that region 121, the encoder 150 encodes the regions 111 of the video image 110 corresponding to this region 121 at a finer level of detail than regions in the secondary image 120 that do not exhibit the presence of a warm body. This level of detail can be changed, for example, by modifying the quantization step size used in the quantization of DCT values in an MPEG encoding. Other encoding parameters 160 may be adjusted, in addition to, or in lieu of the quantization parameter. For example, the perception of a higher frame rate can be achieved by transmitting frames containing the regions of interest more often than frames that contain the other regions. Note that the characteristic of the region 121 may be one of many parameters

160 that affect the level of detail of the encoding of the corresponding regions 111 in the video image 110. For example, a "fuzzy-logic" system such as presented in the aforementioned US patent 5,475,433 may be used to determine an encoding level of detail that is dependent upon a variety of factors, including one or more characteristics of the secondary image 120. Copending US patent application "MOTION-ANALYSIS BASED

BUFFER REGULATION SCHEME", serial number 09/220,292, filed 23 December 1998 for Shing-Chi Tzou, Zhiyong Wang, and Janwun Lee, Attorney Docket PHA 23,597, incorporated by reference herein, discloses the use of an image map that contains a nominal value that is used to determine the quantization step size for each MPEG-sized block in a video image. The nominal value of each block is dynamically adjusted, based on the current as well as prior characteristics of the block. As in the cited US patent 5,474,433, this nominal value is adjusted to produce a coarser level of detail for a "dynamic" block whose content changes quickly. The use of an image map allows for a continuously improved rendering of the video image. For example, a "static" block in an image is progressively encoded in finer and finer detail, subject to bandwidth availability, so that any potential "lulls" in bandwidth utilization can be used to improve the picture quality. A preferred combination of this invention with the copending invention would favor the progressively finer encoding of the regions of interest identified by the secondary image 120, rather than potentially less interesting regions. That is, for example, identified regions of interest would be given higher priority for allocation of the available bandwidth, and the regions of less interest would be allocated bandwidth after the interesting regions are rendered at a predefined acceptable level of detail.

As will be evident to one of ordinary skill in the art in view of this disclosure, a variety of techniques can be used to correlate the characteristics of the regions of the secondary image 120 to the level of detail of the regions of the video image 110. A filtering, or interpolation, of the characteristics of the regions of the secondary image 120 can be used to determine a corresponding quantization factor for each region, or block, of the video image 110, to minimize discontinuities at the edges of each region of the secondary image 120, using techniques common in the art. In an explicit object-identification scheme, the secondary image 120 can be used as a "front-end" filter to a conventional object-recognition application. In such an embodiment, the object-recognition application is configured to prioritize the search for potential objects to the areas of interest identified by the characteristics of the regions of the secondary image 120. Similarly, if the object-recognition application is designed to find objects that are known to correspond to a minimum size area relative to the secondary image 120, the search can be restricted to the areas of the secondary- image that contain contiguous blocks having the desired characteristic that occupy the minimum size area. When the object-recognition application recognizes an object of interest, the encoder 150 can encode the individual regions of the video image 110 at a finer level of detail, or, if the encoding directly supports object-dependent encoding, such as an MPEG-4 encoding, the encoder 150 encodes the identified regions as an explicit object, with an associated quantization parameter. The specific details of the encoding and its associated level of detail dependencies will be dependent upon the particular encoding scheme employed, and other techniques for optimizing the level of detail based on an identification of an obj ect or region of interest will be evident to one of ordinary skill in the art in view of this disclosure.

FIG. 2 illustrates an example camera system 200 in accordance with this invention. The camera system 200 includes a camera 210 for collecting video images (110 in FIG. 1), and a secondary sensor 220 for collecting secondary images (120 in FIG. 1). In order for the secondary image 120 to correspond to the video image 110, the field of view 215 of the camera 210 and the field of view 225 of the sensor 220 should substantially correspond. In an ideal embodiment, the same optic system that is used by the camera to produce the video image 110 would be used to produce the secondary image 120, via a sensor 220 that is integral to the camera 210, so that an exact correspondence can be achieved. However, as illustrated in FIG. 2, an exact correspondence is not required. FIG. 2 illustrates a secondary sensor 220 that is adjacent the camera 210, illustrative of a configuration for a sensor 220 that is provided as an "option" to a conventional video camera 210, or as a removable item on a camera 210 that includes an integral encoder (150 of FIG. 1) in accordance with this invention.

Depending upon the particular configuration of the sensor 220 relative to the camera 210, there will be a region 275 wherein the fields of view 215, 225 substantially correspond. Within this region 275, the correspondence between the images 110, 120 is substantially linear, as illustrated in FIG. 1. Depending upon the accuracy desired, a mapping between the images 110, 120 in regions beyond the substantially corresponding region 275 can be defined in terms of a more complex coordinate transformation, using approximation techniques common in the art. If the camera 210 has a variable-zoom capability, the field of view 215 will contract or expand accordingly. In an ideal embodiment, the change of zoom in the camera 210 will effect a corresponding change of the field of view 225 of the secondary sensor. Alternatively, in a lower-cost embodiment, the field of view 225 may be fixed. In this embodiment, the field of view 225 is set to a "typical" field, within which objects of interest are likely to appear. The regions of video image 110 in the field of view 215 of the camera 210 that are beyond the field of view 225 of the secondary sensor 220, because of a zoomed- out setting of the camera 210, in this embodiment are set to a default coarse level of detail setting. In like manner, regions of the secondary image 120 that are beyond the field of view 215 of the camera 210, because of a zoomed-in setting of the camera 210, are ignored, except as necessary to effect the aforementioned interpolation of characteristic values to prevent edge discontinuities. Ancillary methods for improving the correlation between the images 110, 120 may also be used. For example, the appropriate coordinate transformation may be determined by comparing characteristics of the images 110, 120 and using least-square-error curve fitting techniques, common in the art, to determine the appropriate parameters of the coordinate transformation between the images 110, 120.

Any of a variety of devices, common in the art, may be used to provide the secondary sensor 220 of FIG. 2 for creating the secondary image 120 of FIG. 1. In the infrared field, thermal imaging arrays are commonly available. Commonly available thermal arrays provide images (120 in FIG. 1) having 64x64 regions (121); larger and smaller arrays are also available. US patent 6,031,231 "INFRARED FOCAL PLANE ARRAY", issued 29 February 2000 to Kimata et al, and incorporated by reference herein, provides an overview of two-dimensional infrared focal plane arrays of temperature detecting units that are arranged on semiconductor substrates. US patent 4,868,391 "INFRARED LENS ARRAYS", issued 19 September 1989 to Antoine Y. Messiou, and incorporated by reference herein, provides an array of fresnel lens that are arranged at different angles to provide a wide field of view, the array being configured as a substantially flat sheet. In the '391 patent, each of the lenses have a common focal point, energizing a single temperature detecting unit. In a preferred low-cost embodiment of this invention, an array of fresnel lenses are arranged to direct thermal energy to a plurality of temperature detecting units on a semiconductor substrate. The output from the temperature detecting units corresponds to the image 120 of FIG. 1.

Note that the fields of view of the individual detecting units within the sensor 220 need not be uniform. That is, for example, in a preferred embodiment of this invention, the fresnel lenses corresponding to the perimeter regions of the image 120 have a wider field of view than the fresnel lenses corresponding to the center region of the image 120, because it is likely that objects or regions of interest will generally be located near the center of the video image 110. Note, also, that the sensor 220 may correspond to a conventional infrared camera. In such an embodiment, the infrared camera 220 and the video camera 210 are mounted on a common carrier, and controlled by a common control system. Each of the cameras 210, 220 provide their corresponding images 110, 120 to an encoder 150 for processing as discussed above. The encoder 150 may be located in a device that reads the images 110, 120 directly from the camera 210 and sensor 220, and may be embedded within either the camera 210 or sensor 220. In like manner, the encoder 150, camera 210, and sensor 220 may be embodied as a single device. The encoder 150 may also be an independent device that acquires the images 110, 120 from recordings or transmissions from the camera 210, and sensor 220. Preferably, a time-stamp is provided for each image 110, 120, to facilitate a synchronization between the video images 110 and secondary images 120. Note that the frame rate of the camera 210 and sensor 220 need not be identical, provided only that the secondary images 120 can be substantially correlated in time with the video images 110. These and other system configuration options will be evident to one of ordinary skill in the art in view of this disclosure.

FIG. 3 illustrates an example flow diagram of an encoding system in accordance with this invention. For convenience and ease of understanding, this flow diagram is presented with reference to the objects of FIGs. 1 and 2, and in the context of a straightforward MPEG encoding, without the details of alternative embodiments presented above. As would be evident to one of ordinary skill in the art, the invention is not limited to this example.

At 310, the correspondence between the secondary image 120 and the video image 110 is determined, as discussed above. At 320, a default quantization factor is determined. This default quantization factor corresponds to a quantization step size in a conventional MPEG encoding that produces a relatively coarse level of detail. This default factor may be determined based on available bandwidth, prior image quality, overall complexity or dynamics of prior images, and so on. For convenience, this default quantization factor is allocated to each region of the video image 110, at 330, and then selectively modified, via the loop 340-360, based on the characteristics of the secondary image 120, such as a thermal-object-outline derived from the secondary image 120.

Each region 121 of the secondary image 120 is successively processed in the loop 340-360. In this example, a simple threshold test, at 345, is used to determine whether each region corresponds to a "region of interest". Each region 121 of the secondary image 120 has an associated characteristic, such as a resistance or a voltage corresponding to the detected heat within the region 121, and a measure of this characteristic is used to determine whether or not the region is a "region of interest". If the measure exceeds the threshold, the quantization factor of the corresponding regions 111 of the video image 110 is adjusted so as to effect an encoding at a finer level of detail, at 350. As noted above, the loop 340-360 may be replaced by a continuous determination of an appropriate quantization factor for each region 111 of the video image 110 based on an interpolation of the measures of each region 121. In like manner, the loop 340-360 may be replaced or augmented by a fuzzy logic system as discussed in US patent 5,475,433, or a progressive approach as discussed in copending application 09/220,292, discussed above. In like manner, the loop 340-360 may be replaced by a conventional object-recognition system that uses the measures of the characteristics of the image 120 to facilitate an efficient object search, also discussed above.

At 370, the video image 110 is encoded, using the quantization factors determined above based on the secondary image 120. The encoding and quantization factors may also be dependent on other parameters, such as available bandwidth, degree of complexity and movement, and so on, using techniques common in the art, or as disclosed in the copending US patent application 09/220,292.

The foregoing merely illustrates the principles of the invention. Other embodiments and applications will be evident to one of ordinary skill in the art in view of this disclosure. For example, although the invention is presented in terms of optimizing the bandwidth required to transmit images, the encoding schemes presented herein are equally applicable for optimizing the storage requirements for storing images, and can be used to optimize the capacity of recording media, such as video tape. It will thus be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the invention and are included within the scope of the following claims.

Claims

CLAIMS:

1. A video encoding system (100) that is configured to receive at least one video image (110) and at least one corresponding secondary image (120), comprising: an encoder (150) that encodes each region (111) of a plurality of regions of the video image (110) using an encoding parameter (160) that is dependent upon a characteristic of a corresponding region (121) of the secondary image (120), and produces thereby an encoding of the video image (110).

2. The video encoding system (100) of claim 1, further including: an image detector (210) that is sensitive to visible light within a first field of view (215), and thereby produces the at least one video image (110) corresponding to the first field of view (215), and a heat detector that is sensitive to infrared emissions within a second field of view (225) that substantially corresponds to at least a portion of the first field of view (215) of the image detector (210), and thereby produces the corresponding secondary image (120).

. The video encoding system (100) of claim 1 , wherein the at least one corresponding secondary image (120) provides an object- related pattern (101), and the encoder (150) is configured to encode objects within the video image (110) based on the obj ect-related pattern (101).

4. The video encoding system (100) of claim 3, further including an object-recognition system that facilitates a recognition of the object-related pattern (101) based on the at least one corresponding secondary image (120).

5. The video encoding system (100) of claim 1 , wherein the encoder (150) is further configured to encode each region (111) of the plurality of regions based on at least one of: a motion parameter, a complexity parameter, a brightness parameter, and a bandwidth parameter.

6. The video encoding system (100) of claim 1, wherein the encoding parameter (160) corresponds to a level of detail of the encoding of the video image (110).

7. The video encoding system (100) of claim 6, wherein the characteristic of the corresponding region (121) of the secondary image (120) is a measure of a temperature associated with the corresponding region (121) of the secondary image (120).

8. The video encoding system (100) of claim 7, wherein the encoder (150) is further configured to encode each region (111) of the plurality of regions based on at least one of: a motion parameter, a complexity parameter, a brightness parameter, and a bandwidth parameter.

9. A camera system (200) comprising: a video camera (210) that collects video images (110) corresponding to a first field of view (215) of the video camera (210), a secondary detector (220), operably attached to the video camera (210), that collects secondary images (120) corresponding to a second field of view (225) that substantially corresponds to at least a segment of the first field of view (215), to facilitate a subsequent recognition of regions of interest (101) within the video images (110), based on the associated secondary images (120).

10. The camera system (200) of claim 9, wherein the secondary detector (220) comprises a thermal detector.

11. The camera system (200) of claim 9, further including an encoder (150) that is configured to encode the video images (110) in dependence upon characteristics of the corresponding secondary images (120) and to produce thereby an encoded output.

12. The camera system (200) of claim 11, further including at least one of: a transmitter that is configured to transmit the encoded output to a receiver, and a recorder that is configured to store the encoded output.

13. The camera system (200) of claim 11 , further including an object recognition system that uses the secondary images (120) to facilitate a recognition of an object-related pattern (101), and wherein the encoder (150) is configured to encode objects within the video image (110) based on the obj ect-related pattern (101).

14. The camera system (200) of claim 11 , wherein the encoder (150) is further configured to encode the video images (110) based on at least one of: a motion parameter, a complexity parameter, a brightoess parameter, and a bandwidth parameter.

15. The camera system (200) of claim 11 , wherein the characteristics of the corresponding secondary images (220) correspond to a measure of thermal emissions within the second field of view (225).

16. The camera system (200) of claim 11 , wherein the encoder (150) is configured to encode the video images (110) using quantization factors that are dependent upon the characteristics of the corresponding secondary images (120).

17. The camera system (200) of claim 16, wherein the quantization factors are further dependent upon at least one of: a motion parameter, a complexity parameter, a brightness parameter, and a bandwidth parameter.

18. A method of encoding a video image (110) comprising: receiving a secondary image (120) corresponding to at least a portion of the video image (110), determining (310) the correspondence between the secondary image (120) and the video image (110), associating (350) an encoding factor to each region (111) of a plurality of regions of the video image (110) in dependence upon a characteristic of a corresponding region (121) of the secondary image (120), and encoding (370) each region (111) of the plurality of regions of the video image (110) based on the associated encoding factor.

19. The method of claim 18, wherein the secondary image (120) comprises a thermal map.

20. The method of claim 18 , wherein the encoding parameter (160) affects a level of detail of the encoding of each region (111) of the video image (110).