WO2000056070A1

WO2000056070A1 - Videophone with audio source tracking

Info

Publication number: WO2000056070A1
Application number: PCT/US2000/007384
Authority: WO
Inventors: Daisuke Terasawa
Original assignee: Qualcomm Inc
Current assignee: Qualcomm Inc
Priority date: 1999-03-18
Filing date: 2000-03-20
Publication date: 2000-09-21
Anticipated expiration: 2001-09-18
Also published as: AU4016800A

Abstract

A video teleconferencing device (100) uses pattern recognition technology and image tracking technology to generate control signals indicative of the relative position of the user's mouth with respect to an audio input device (106). The directionality of the audio input device (106) is altered to compensate for changes in the position of the user's mouth. A self-contained wireless version of the device includes a transmitter and receiver. A desktop version of the device may be wireless or may include a connector to provide a network connection.

Description

VIDEOPHONE WITH AUDIO SOURCE TRACKING

FIELD OF THE INVENTION

The present invention is related generally to a videophone and, more particularly, to a videophone having a tracking audio input device.

BACKGROUND OF THE INVENTION

Wireless communication devices, such as cellular telephones and other personal communication devices are widely used as a supplement to, or replacement for, conventional telephone systems. In addition to functioning as a replacement for a conventional telephone, wireless communication devices offer the advantage of portability, thus enabling the user to establish a wireless communication link between virtually any two locations on earth.

In addition to conventional voice communication, wireless communication devices also provide features such as voicemail. Other, more advanced wireless communication devices offer the opportunity for video teleconferencing. These "videophones" include a small, solid-state camera and a video display that enable the user to conduct a video teleconference via a wireless communication link. Operation of such a videophone requires the device be located at some distance from the user to enable the solid-state video input device to capture an image of the user. At such distances, audio reception becomes more difficult. Therefore, it can be appreciated that there is a significant need for a system and method that will allow satisfactory operation of the audio portion of the system. The present invention provides this and other advantages as will be apparent from the following detailed description and accompanying figures.

SUMMARY OF THE INVENTION

The present invention is directed to an audiovisual communication device and comprises a receiver to receive data, including video data, from a location remote from the communication device. A display is coupled to the receiver to display video images corresponding to the received video data. A video input device is provided to sense a video image of a user and to generate video data corresponding to the sensed video image. An audio input device is also provided to sense speech signals from the user. The audio input device is responsive to control signals to orient a directional sensitivity of the audio input device toward the mouth of the user. The system further comprises an image recognition processor to analyze the generated video data and thereby identify and track the position of the user's mouth. The image recognition processor generates control signals for the audio input device to orient the directional sensitivity of the audio input device toward the mouth of the user.

The received data may also include audio data. The device can include an audio output device to provide audible signals relating to the received audio data. The device may also include a transmitter to transmit electrical signals generated by the audio input device and may also transmit the video data generated by the video input device.

The device may further comprise a connector that couples the device to a network, such as a public switched telephone network. Alternatively, the receiver and transmitter may be wireless devices that receive and transmit data via a wireless connection. In one embodiment, the wireless communication device may be contained within a housing sized to be held in the hand of the user.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a perspective view of a wireless communication device having audio and video capabilities.

FIG. 2 illustrates the operation of the wireless communication device of FIG. 1 to capture a video image of the user and to direct the audio input device toward the user's mouth.

FIG. 3 is a functional block diagram of an exemplary embodiment of the wireless communication device of the present invention.

FIG. 4 illustrates the operation of a directional audio input device that is maneuvered to direct the audio input device toward the user's mouth. FIG. 5 is a perspective view of a desktop version of the communication device of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention is directed to a technique for adjusting an audio input device so as to track the position of the user's mouth and thereby enhance the detection of the user's voice. The system described herein operates in conjunction with a video input device and uses image processing technology to track the position of the user's mouth and thereby direct the audio input device in a manner that tracks the position of the user's mouth. Although described herein as a cellular telephone, those skilled in the art can appreciate that the present invention is applicable to other forms of communication, such as PCS, mobile radio telephones, mobile radio, and the like.

The present invention is embodied in a wireless communication system 100 illustrated in FIG. 1. The system 100 includes a housing 102 that contains many components, such as an audio output device 104, an audio input device 106, and a keypad 110. A transmitter 128 and receiver 130 (see FIG. 3) are also contained within the housing 102. The transmitter 128 and receiver 130 are coupled to an antenna 112 illustrated in FIG. 1 in the extended or operational position.

The system 100 also includes a display 116 and a video input device 118. In an exemplary embodiment, the display 116 is a liquid crystal display (LCD). Unlike many conventional wireless communication devices, the display 116 is a high resolution color display to allow the display of video images received by the receiver 130 (see FIG. 3). The video input device 118 may be a conventional vidicon or charge-coupled device (CCD) to detect the image of the user. As will be described in greater detail below, image processing technology is used to track the position of the user's mouth and within the detected image to generate control signals related thereto. The control signals are used to direct the audio input device 106.

The principles of operation of the system 100 may be more readily understood with respect to FIG. 2. It should be noted that FIG. 2 is not drawn to scale and does not accurately represent the relative size and position of the user's head with respect to the system 100. However, FIG. 2 is provided merely to illustrate the fundamental principles of operation of the system 100. In normal operation, the system 100 may be conveniently contained in the user's hand and held at arm's length from the user's head. This allows the video input device 118 to have a sufficient area of coverage 120 that includes the user's entire head. When held at arm's length, the user's arm may become fatigued and shake, resulting in a wobbly picture. Known technologies may be readily implemented with the system 100 to provide a stabilized video image. Another drawback of extended use is that the shaking hand of the user causes the position and orientation of the audio input device 106 to vary with respect to the user's mouth, resulting in voice dropout or other unreliable operation of the audio portion of the system.

To overcome this problem, the system 100 tracks the position of the user's mouth and generates control signals related thereto. For example, image processing and pattern recognition technologies can be used to identify the initial location of the user's mouth and to track changes in the position of the user's mouth with respect to the system 100. The image processing system generates control signals that are used to direct the audio input device 106 in the direction of the user's mouth to maximize the sensitivity of the audio input device in the direction of the user's mouth. The audio input device 106 is steerable (either electronically or mechanically) so that it provides a directivity pattern 122, as illustrated in FIG. 2. Various forms of the audio input device 106 are described below. The system 100 is illustrated in the functional block diagram of

FIG. 3 and includes a central processing unit (CPU) 124, which controls operation of the system. A memory 126, which may include both read-only memory (ROM) and random access memory (RAM), provides instructions and data to the CPU 124. A portion of the memory 126 may also include non- volatile random access memory (NVRAM).

Also contained within the housing 102 is the transmitter 128 and receiver 130 to allow transmission and reception of data, such as audio and video communications, between the system 100 and a remote location, such as a cell site controller (not shown). The transmitter 128 and receiver 130 may be combined into a transceiver 132. The antenna 112 is attached to the housing 102 and electrically coupled to the transceiver 132. Although the antenna 112 illustrated in FIG. 1 is extended from the housing 102, this is not necessary for satisfactory operation of the system 100. The antenna 112 may be a fixed antenna extending from the housing 102 or may be contained completely within the housing. The operation of the transmitter 128, receiver 130, and antenna 112 is well known in the art and need not be described herein.

As previously discussed, the audio output device 104 operates in a conventional manner to provide audio signals to the user. The audio output device 104 may be a conventional speaker or, alternatively, may be a headset worn by the user.

The display 116 may be used to display alphanumeric data as well as the video images. The receiver 130 receives data, which may include audio data, alphanumeric digital data, and video data. Those skilled in the art can appreciate that the three forms of data described above may all be transmitted as digitized data. That is, the audio data, video data, and alphanumeric data may be each digitized and transmitted to the system 100 in a well-known fashion. The receiver 130 may conveniently receive and demodulate the data to recover the audio, alphanumeric, and video data. The video data may be further processed for use with the display 116.

The system 100 also generates video data using the video input device 118. The video input device 118 can be any form of video device, such as a vidicon tube, charge coupled device (CCD) or the like. The present invention is not limited by the specific form of the video input device 118.

For operation as a video teleconferencing device, the video signal generated by the video input device 118 may be transmitted by the transmitter 128 to a location remote from the system 100.

In addition, the signals generated by the video input device 118 are analyzed by an image processor 138. The image processor 138 tracks the position of the user's mouth within the image and generates control signals related thereto. As will be discussed in greater detail below, the control signals generated by the image processor 138 are used to control the directionality of the audio input device 106. Image stabilization circuits are used in some conventional video cameras to stabilize the image even when the camera is shaking. For example, a hand-held camera is subject to vibrations due to the user's inability to hold the camera completely still. However, known image stabilization techniques track the position of a primary object within the field of view and adjust the video signal to compensate for variations in the position of the objects with respect to the video camera. Thus, minor vibrations and jitter are overcome to a large extent by the image stabilization techniques.

Similar techniques may be used by the system 100 to track the location of the user's mouth within the field of view even though the user's mouth may move with respect to the video input device 118 and with respect to the audio input device 106. As the position of the user's mouth changes, the change in position is detected by the video input device 118 and the relative change in position of the user's mouth is determined by the image processor 138. The control signals generated by the image processor 138 are coupled to a directional control circuit 140. The directional control circuit 140 directs the directivity pattern 122 (see FIG. 1) to track the user's mouth and thereby provide more reliable detection of the user's voice.

The various components described above are coupled together by a bus system 142. The bus system 142 may comprise a control bus, address bus, status bus, as well as a data bus. For the sake of clarity, the various buses are illustrated in FIG. 3 as the bus system 142. Those skilled in the art will appreciate that some of the components illustrated in FIG. 3 may be implemented by the CPU 124 executing instructions from the memory 126. For example, the image processor 138 and directional control unit 140 may be implemented by the CPU 124. Alternatively, these components may be independent devices. However, FIG. 3 illustrates each of these elements as a separate component since each performs a separate function.

The specific implementation details of the directional control circuit 140 depend on the form of the audio input device 106. For example, the audio input device 106 may comprise an array of microphones having relatively broad directional response. The directional control circuit 140 may comprise a phase delay circuit such that the combination of the audio input device 106 and directional control circuit 140 form a phased array microphone assembly. The control signals generated by the image processor 138 are used to set the delay times for individual delay lines and thereby focus the directivity pattern 122

(see FIG. 2) of the microphone array toward the mouth of the user. Such phased array microphone systems are known in the art and are described, by way of example, in "The Phased Array Microphone By Charge Transfer Device" by

Minoru Murayama et al., 1981.

Alternatively, the audio input device 106 may comprise a plurality of microphones that have an enhanced directional selectivity using beamforming technologies, such as described in "A Self-Steering Digital Microphone Array," by Walter Kellermann, IEEE, 1991, and in "Calibration, Optimization and DSP Implementation of Microphone Array for Speech Processing," by A. Wang et al., IEEE, 1966. The image processor 138 generates beamforming control signals that are used to focus the directivity pattern 122 (see FIG. 2) toward the user's mouth and thereby track changes in the position of the user's mouth.

In yet another alternative embodiment, illustrated in FIG. 4, the audio input device 106 is a single, highly-directional microphone or other audio input device. In this embodiment, the audio input device 106 is mounted so as to pivot in the X and Y directions. Motors 144 control the pivotal position of the audio input device 106. In this embodiment, the control signals generated by the image processor 138 are used to control the position of the motors 144 and in turn control the directivity pattern 120 (see FIG. 2) of the audio input device 106. Those skilled in the art will appreciate that it is desirable to minimize the power consumed by the motors 144 if the system 100 is battery powered. The motors 144 may be low-power motors or solid-state devices that minimize power consumption. If the system 100 is battery powered, other alternative embodiments described above may be used with an array of microphones.

Although described above as a wireless communication device that may be battery operated, the principles of the present invention may be extended to other forms of video teleconferencing devices, such as illustrated in FIG. 5. The system 100 illustrated in FIG. 5 may be line powered. In this embodiment, the display 116 may be a conventional video cathode ray tube (CRT) display. The video input device 118 may be contained within the housing 102 or may be a separate device, such as a video camera mounted atop the housing 102, or mounted in another convenient place. The audio output device 104 may also be contained within the housing 102 or a separate device, as illustrated in FIG. 5. The audio input device 106 may also be contained within the housing 102 or externally mounted. The external mounted audio input device may be particularly useful with an array of microphones.

Alternatively, the single microphone and motors 144 (see FIG. 4) may be contained within the housing 102 to minimize exposure to dirt and dust. The system 100 illustrated in FIG. 5 may also be a wireless communication system, as illustrated in the functional block diagram of FIG. 3. However, the system 100 may also include a connector 150 to hardwire the system to a network 152, such as a switched telephone network.

The operation of the system 100 advantageously allows the audio input device 106 to track the position of the user's mouth and thereby improve detection of the user's voice. If the user stands up or sits down, the video input device 118 detects the changes in the position of the user's mouth and generates appropriate signals related to such movement. The direction and amount of movement is determined by the image processor 138 and appropriate control signals are sent to the directional control circuit 140 so as to alter the directivity pattern 120 of the audio input device 106 in a corresponding manner. Similarly, if the user moves to the left or right, such movement is detected by the video input device 118 and the amount and direction of movement is determined by the image processor 138. The control signals generated by the image processor 138 are provided to the directional control circuit 140. Thus, the system 100 utilizes video image pattern recognition and image tracking technology to track the position of the user's mouth and generates appropriate control signals whereby the audio input device tracks the user's mouth.

It is to be understood that although various embodiments and advantages of the present invention have been set forth in the foregoing description, the above description is illustrative only, and changes may be made in detail, yet remain within the broad principles of the invention. For example, FIG. 1 illustrates the use of a stand alone wireless communication device while FIG. 4 illustrates a line powered video teleconferencing device. The principles of the present invention may be readily implemented in either embodiment. In addition, a multitude of known components may be used for the audio input and output devices and for the video input and output devices. Furthermore, a variety of known technologies may be used to direct the directivity pattern of the audio input device in the direction of the user's mouth. Therefore, the present invention is to be limited only by the appended claims.

What is claimed is:

Claims

1. A wireless audiovisual communication device comprising: a housing sized to be held in the hand of a user; a receiver to receive data, including video data, from a location remote from the communication device; a display mounted to the housing to display video images corresponding to the received video data; a video input device mounted to the housing to sense a video image of the user and to generate video data corresponding to the sensed video image; an audio input device mounted to the housing to sense speech signals from the user and to generate electrical signals related thereto, the audio input device having a controllable sensitivity in a selected direction; and a directional control circuit to orient the selected direction toward the mouth of the user.

2. The device of claim 1, further including an image recognition processor to analyze the generated video data to thereby identify and track the position of the user's mouth, the image recognition processor generating control signals for the directional control circuit to orient the selected direction toward the mouth of the user.

3. The device of claim 1 wherein the audio input device is a phased array audio input device and the directional control circuit supplies phase control signals to orient the selected direction toward the mouth of the user.

4. The device of claim 3 wherein the phased array audio input device comprises a plurality of microphones and the directional control circuit generates phase delay control signals to orient the selected direction toward the mouth of the user.

5. The device of claim 1 wherein the audio input device comprises a plurality of microphones and the directional control circuit comprises a beamforming circuit that generates control signals to orient the selected direction toward the mouth of the user.

6. The device of claim 1, further comprising a transmitter to transmit the electrical signals generated by the audio input device.

7. The device of claim 6 wherein the transmitter further transmits the video data generated by the video input device.

8. The device of claim 7 wherein the image recognition processor analyzes the video data generated by the video input device and generates an image stabilized video image and the transmitter transmits the stabilized video image.

9. An audiovisual communication device comprising: a receiver to receive data, including video data, from a location remote from the communication device; a display coupled to the receiver to display video images corresponding to the received video data; a video input device mounted to sense a video image of a user and to generate video data corresponding to the sensed video image; an audio input device mounted to sense speech signals from the user, the audio input device being responsive to control signals to orient a directional sensitivity of the audio input device toward the mouth of the user; and an image recognition processor to analyze the generated video data to thereby identify and track the position of the user's mouth, the image recognition processor generating control signals for the audio input device to orient the directional sensitivity toward the mouth of the user.

10. The device of claim 9, further comprising an audio output device coupled to the receiver, the received data also including received audio data that is provided to the audio output device.

11. The device of claim 9, further comprising a transmitter to transmit the electrical signals generated by the audio input device.

12. The device of claim 11 wherein the transmitter further transmits the video data generated by the video input device.

13. The device of claim 9, further comprising a connector to couple the audiovisual communication device to a network, the receiver receiving the video data via the network.

14. The device of claim 13 wherein the network is a switched telephone network.

15. The device of claim 9 wherein the receiver is a wireless receiver and receives the data from the remote location via a wireless connection.

16. A method for communicating with audiovisual communication device, the method comprising: receiving data, including video data, from a location remote from the communication device; displaying video images corresponding to the received video data; sensing a video image of a user and generating video data corresponding to the sensed video image; analyzing the generated video data to thereby identify and track the position of the user's mouth and generating control signals relating to the position of the user's mouth; orienting a directional sensitivity of an audio input device toward the mouth of the user in response to the control signals; and sensing speech signals from the user using the audio input device and generating electrical signals related thereto.

17. The method of claim 16 wherein the received data also includes audio data, the method further comprising generating audible signals based on the received audio data.

18. The method of claim 16, further comprising transmitting the electrical signals generated by the audio input device.

19. The method of claim 16, further comprising transmitting the generated video data.

20. The method of claim 16, further comprising coupling the audiovisual communication device to a network, the received data being received via the network.

21. The method of claim 20 wherein the network is a switched telephone network.

22. The method of claim 16 wherein the receiver is a wireless receiver and receives the data from the remote location via a wireless connection.

23. The method of claim 16 wherein the audio input device is a phased array audio input device and orienting the directional sensitivity comprises generating phase delay control signals to orient the directional sensitivity toward the mouth of the user.

24. The method of claim 16 wherein the audio input device comprises a plurality of microphones and orienting the directional sensitivity comprises generating beamforming signals to orient the directional sensitivity toward the mouth of the user.