WO2025058649A1

WO2025058649A1 - Displaying representation of user based on audio signal during videoconference

Info

Publication number: WO2025058649A1
Application number: PCT/US2023/074340
Authority: WO
Inventors: Ruofei DU; Xun Qian
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2025-03-20
Anticipated expiration: 2026-03-15

Abstract

A method comprises generating, based on images captured by a camera during a videoconference, a three-dimensional model of a first user participating in the videoconference; determining, based on an audio signal from the first user, that the first user is speaking, via the videoconference, with a second user; and based on determining that the first user is speaking to the second user, displaying, to a third user, a representation of the first user and a representation of the second user where the representation of the first user is based on a rotation of the three-dimensional model of the first user so that the representation of the first user is facing toward the representation of the second user.

Description

DISPLAYING REPRESENTATION OF USER BASED ON AUDIO SIGNAL DURING VIDEOCONFERENCE

TECHNICAL FIELD

[0001] This description relates to videoconferencing.

BACKGROUND

[0002] Videoconferences can present representations of multiple participants while outputting audio based on speech by the participants.

SUMMARY

[0003] A system displays representations of users within a videoconference based on audio signals caused by user speech during the videoconference. The system generates three- dimensional models of the users based on images captured by cameras during the videoconference. The system rotates the models to generate representations of the users that are facing toward each other when users are speaking to each other. A determination that the users are speaking to each other is based on the audio signals.

[0004] A method includes generating, based on images captured by a camera during a videoconference, a three-dimensional model of a first user participating in the videoconference; determining, based on an audio signal from the first user, that the first user is speaking, via the videoconference, with a second user; and based on determining that the first user is speaking to the second user, displaying, to a third user, a representation of the first user and a representation of the second user where the representation of the first user is based on a rotation of the three- dimensional model of the first user so that the representation of the first user is facing toward the representation of the second user.

[0005] A non-transitory computer-readable storage medium includes instructions stored thereon that, when executed by at least one processor, are configured to cause a computing system to generate, based on images captured by a camera during a videoconference, a three-dimensional model of a first user participating in the videoconference; determine, based on an audio signal from the first user, that the first user is speaking, via the videoconference, with a second user; and based on determining that the first user is speaking to the second user, display, to a third user, a representation of the first user and a representation of the second user where the representation of the first user is based on a rotation of the three-dimensional model of the first user so that the representation of the first user is facing toward the representation of the second user.

[0006] A computing system includes at least one processor and a non-transitory computer- readable storage medium comprising instructions stored thereon. When executed by the at least one processor, the instructions are configured to cause the computing system to generate, based on images captured by a camera during a videoconference, a three-dimensional model of a first user participating in the videoconference; determine, based on an audio signal from the first user, that the first user is speaking, via the videoconference, with a second user; and based on determining that the first user is speaking to the second user, display, to a third user, a representation of the first user and a representation of the second user where the representation of the first user is based on a rotation of the three-dimensional model of the first user so that the representation of the first user is facing toward the representation of the second user.

[0007] A method includes generating, based on images captured by a camera during a videoconference, a three-dimensional model of a first user participating in the videoconference; determining, based on an audio signal from the first user, that the first user is speaking, via the videoconference, with a second user; and based on determining that the first user is speaking to the second user, displaying, a representation of the first user where the representation of the first user is based on a rotation of the three-dimensional model of the first user so that the representation of the first user is facing toward the second user.

[0008] A non-transitory computer-readable storage medium includes instructions stored thereon that, when executed by at least one processor, are configured to cause a computing system to generate, based on images captured by a camera during a videoconference, a three-dimensional model of a first user participating in the videoconference; determine, based on an audio signal from the first user, that the first user is speaking, via the videoconference, with a second user; and based on determining that the first user is speaking to the second user, display a representation of the first user where the representation of the first user is based on a rotation of the three- dimensional model of the first user so that the representation of the first user is facing toward the second user.

[0009] A computing system includes at least one processor and a non-transitory computer- readable storage medium comprising instructions stored thereon. When executed by the at least one processor, the instructions are configured to cause the computing system to generate, based on images captured by a camera during a videoconference, a three-dimensional model of a first user participating in the videoconference; determine, based on an audio signal from the first user, that the first user is speaking, via the videoconference, with a second user; and based on determining that the first user is speaking to the second user, display a representation of the first user where the representation of the first user is based on a rotation of the three-dimensional model of the first user so that the representation of the first user is facing toward the second user.

[0010] The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] FIG. 1 is a network diagram showing computers of users participating in a videoconference.

[0012] FIG. 2A shows a display representing all three remote users of the videoconference.

[0013] FIG. 2B shows the display representing one remote user who is speaking to a local user during the videoconference.

[0014] FIG. 2C shows the display representing two remote users who are talking to each other during the videoconference.

[0015] FIG. 2D shows the display representing all three remote users of the videoconference, with an expanded representation of a remote user who is speaking.

[0016] FIG. 3 shows an end-to-end workflow for building a shared virtual meeting scene.

[0017] FIG. 4 shows a decision-tree algorithm for determining how to represent remote users during a videoconference.

[0018] FIG. 5 is a block diagram showing a computing system for representing remote users during a videoconference.

[0019] FIG. 6 is a flowchart showing a method for representing remote users during a videoconference.

[0020] FIG. 7 is a block diagram of a pipeline for generating a representation of the local user based on a depth map.

[0021] FIG. 8 is a diagram that includes a neural network for generating a depth map.

[0022] FIG. 9 shows a depth camera and a camera capturing images of a person to train the neural network.

[0023] FIG. 10 shows a pipeline for rendering the representation of the local user.

[0024] FIG. 11 is a block diagram of a computing device that generates a representation of the local user based on the depth map.

[0025] FIG. 12A shows a portrait depth estimation model.

[0026] FIG. 12B shows a resblock, included in the portrait depth estimation model of FIG.

12 A, in greater detail.

[0027] Like reference numbers refer to like elements.

DETAILED DESCRIPTION

[0028] Videoconferences can include multiple users speaking at different times. A technical problem with videoconferences is the difficulty in viewing or identifying the user(s) who is (or are) speaking. A technical solution to this technical problem is to identify the user(s) who is (or are) speaking based on audio signals received based on speech of the user(s). A technical benefit of identifying the user(s) who is (or are) speaking based on audio signals is that a computing system can present representations of the user(s) who is (or are) speaking with greater detail or focus without input from a user who is watching or listening.

[0029] A computing system can generate three-dimensional models of the faces or heads of users participating in the videoconference and send the three-dimensional models to the local computing systems. The three-dimensional models can reduce the data required to send representations of the users compared to images captured by cameras. The computing system can determine, based on audio signals, that a pairwise conversation is taking place between a first user and a second user. Based on determining that the pairwise conversation is taking place between a first user and a second user, the computing system can display representations of the first user and second user to a third user. The representation of the first user and second user can be based on rotations of the three-dimensional models of the faces of the first user and second user so that the representation of the first user is facing toward the second user and the representation of the second user is facing toward the first user. The first user and second user will then appear to the third user to be speaking to each other.

[0030] FIG. l is a network diagram showing computers of users participating in a videoconference. A local user 104 is sitting at a desk in front of a local computing device 102. The local computing device 102 includes a display 106 that can present and/or display representations of remote users participating in the videoconference. The local computing device 102 includes a camera 108 that can capture images and/or video of the local user 104. The local computing device 102 includes one or more human interface devices (HIDs) 110, such as a keyboard, mouse, and/or trackpad, that can receive and/or process input from the local user 104. While not labeled in FIG. 1, the local computing device 102 includes one or more microphones that can capture and/or process audio signals such as speech from the local user 104. While not labeled in FIG. 1, the local computing device 102 includes one or more speakers that can provide audio output, such as speech from remote users participating in the videoconference.

[0031] The local computing device 102 can communicate with a server 116 and multiple remote computing devices 120, 122, 124 via a network 114. The network 114 can include any network via which multiple computing devices communicate, such as a local area network (LAN), metropolitan area network (MAN), wide area network (WAN), or the Internet. The server 116 can facilitate videoconferences between multiple computing devices, such as the computing devices 102, 120, 122, 124. The remote computing devices 120, 122, 124 can have similar functionalities to the local computing device 102. The remote computing devices 120, 122, 124 can serve and/or interact with remote users, that is, users that are located remotely from the local user 104 and/or each other. In an example, a first remote computing device 120 can serve and/or interact with a first remote user 204A (shown in FIGs. 2A, 2B, 2C, and 2D), a second remote computing device 122 can serve and/or interact with a second remote user 204B (shown in FIGs. 2 A, 2C, and 2D), and/or a third remote computing device 124 can serve and/or interact with a third remote user 204C (shown in FIGs. 2A and 2D). While four computing devices serving four users are shown in FIG. 1, any number of users and/or computing devices can participate in a videoconference.

[0032] During the videoconference, the speaker(s) that is or are speaking can change. The local computing device 102 and/or server 116 can determine which speaker(s) is or are speaking based on audio signals received by the computing devices 102, 120, 122, 124. The local computing device 102 and/or server 116 can change the representation of the users based on which speaker(s) is or are speaking. For example, if the local user 104 is speaking, then the local computing device 102 can present representations of all of the other (remote) users participating in the videoconference. If the local user 104 is speaking to a single remote user, then the local computing device 102 can present a representation of only the single remote user with whom the local computing device 102 is speaking. If two remote users are talking to each other, which can be considered a pairwise conversation, then the local computing device 102 can present representations of only the two remote users who are talking to each other. When the local computing device 102 presents representations of only the two remote users who are talking to each other, the local computing device 102 and/or server 116 can rotate the representations of the heads and/or faces of the two remote users who are talking to each other so that the representations of the heads and/or faces of the two remote users are facing toward each other. If one of the remote users is speaking to the other users in general, then the local computing device 102 and/or server 116 can present representations of all of the remote users with an expanded version of the remote user who is speaking.

[0033] In some examples, the local computing device 102 and/or server 116 generates a three-dimensional model of a face, head, and/or other portion of the local user 104. The remote computing devices 120, 122, 124 and/or server 116 can generate three-dimensional models of the faces, heads, and/or other portions of the remote users interacting with the respective remote computing devices 120, 122, 124. In some examples, the local computing device 102 and remote computing devices 120, 122, 124 send the three-dimensional models to the server 116, reducing the data needed to send a representation of the respective user. The rotations of the heads and/or portions of the heads and/or faces can be performed on the three-dimensional models. In some examples, the rotations are performed by the server 116. In some examples, the server 116 sends the three-dimensional models to the computing devices 102, 120, 122, 124, and the rotations are performed by the computing devices 102, 120, 122, 124.

[0034] FIG. 2A shows the display 106 representing all three remote users 204 A, 204B, 204C of the videoconference. The videoconference includes the local user 104, a first remote user 204A interacting with the first remote computing device 120, a second remote user 204B interacting with the second remote computing device 122, and a third remote user 204C interacting with the third remote computing device 124.

[0035] In this example, the local computing device 102 and/or server 116 has determined that the local user 104 is speaking to the other (remote) users 204A, 204B, 204C participating in the videoconference. In this example, the local user 104 can be considered to be speaking to a general audience. In some examples, when the local user 104 is speaking to the general audience as in FIG. 2A, sizes of representations of users 204A, 204B, 204C can be reduced compared to sizes of representations of two users in a one-on-one (pairwise) conversation shown in FIG. 2C. The local computing device 102 and/or server 116 has determined that the local user 104 is speaking to the other (remote) users 204A, 204B, 204C participating in the videoconference based on audio signals, such as based on speech being recognized only from audio signals received by microphones included in the local computing device 102. Based on determining that the local user 104 is speaking to the other (remote) users 204A, 204B, 204C participating in the videoconference, the display 106 included in the local computing device 102 displays representations of all of the other (remote) users 204A, 204B, 204C participating in the videoconference. The representations of the other (remote) users 204A, 204B, 204C can be based on three-dimensional models of the users 204A, 204B, 204C. The representations of the users 204A, 204B, 204C can be presented against and/or within a shared background 202A.

[0036] FIG. 2B shows the display 106 representing one remote user 204A who is speaking to the local user 104 during the videoconference. The display 106 displays and/or presents the representation of the first remote user 204A against a background 202B generated by the local computing device 102, the server 116, and/or the first remote computing device 120.

[0037] In this example, the local computing device 102 and/or server 116 has determined that the first remote user 204A is speaking to the local user 104. The first remote user 204A and the local user 104 are engaging in a one-on-one (pairwise) conversation. The local computing device 102 and/or server 116 has determined that the first remote user 204A is speaking to the local user 104 based on audio signals received by the first remote computing device 120 and/or local computing device 102. In some examples, the local computing device 102 and/or server 116 determined that the first remote user 204A is speaking to the local user 104 based on audio signals and/or speech signals received and/or processed by speakers included in both the local computing device 102 and the first remote computing device 120, indicating that the local user 104 and the first remote user 204A are speaking to each other. In some examples, the local computing device 102 and/or server 116 determined that the first remote user 204A is speaking to the local user 104 based on recognizing a name, nickname, or title of the local user 104 spoken by the first remote user 204A. In some examples, the local computing device 102 and/or server 116 determined that the first remote user 204Ais speaking to the local user 104 based on a gaze of the first remote user 204A, such as the first remote user 204A looking at a representation of the local user 104 on a display included in the first remote computing device 120.

[0038] FIG. 2C shows the display 106 representing two remote users 204 A, 204B who are talking to each other during the videoconference. The two remote users 204 A, 204B are engaging in a one-one-one (pairwise) conversation with each other. In some examples, determining that the two remote users 204A, 204B are engaging in a one-one-one (pairwise) conversation with each other is based on silence from the local user 104 for at least a silence threshold period of time. In this example, the heads and/or faces of the remote users 204A, 204B are rotated so that the remote users 204A, 204B are facing each other. In this example, the head and/or face of the first remote user 204A is presented on a left portion of the display 106 and is rotated to face to the right, toward the right portion of the display 106 where the head and/or face of the second remote user 204B is presented. In this example, the head and/or face of the second remote user 204B is presented on a right portion of the display 106 and is rotated to face to the left, toward the left portion of the display 106 where the head and/or face of the first remote user 204A is presented. In the example shown in FIG. 2C, the representation of the first remote user 204A is shown against a background 202C generated by the local computing device 102, server 116, and/or first remote computing device 120. In the example shown in FIG. 2C, the representation of the second remote user 204B is shown against a background 203C generated by the local computing device 102, server 116, and/or second remote computing device 122. The display 106 presents a divider 206 between the backgrounds 202C, 203C.

[0039] In this example, the local computing device 102 and/or server 116 has determined that the first remote user 204A is speaking to the second remote user 204B in a pairwise conversation. The local computing device 102 and/or server 116 has determined that the first remote user 204A is speaking to the second remote user 204B based on audio signals received by the first remote computing device 120 and/or second remote computing device 122. In some examples, the local computing device 102 and/or server 116 determined that the first remote user 204Ais speaking to the second remote user 204B based on audio signals and/or speech signals received and/or processed by speakers included in both the first remote computing device 120 and second remote computing device 122, indicating that the first remote user 204A and second remote user 204B are speaking to each other. In some examples, the local computing device 102 and/or server 116 determined that the first remote user 204Ais speaking to the second remote user 204B based on recognizing a name, nickname, or title of the second remote user 204B spoken by the first remote user 204A and/or a name, nickname, or title of the first remote user 204A spoken by the second remote user 204B. In some examples, the local computing device 102 and/or server 116 determined that the first remote user 204A is speaking to the second remote user 204B based on a gaze of the first remote user 204A, such as the first remote user 204A looking at a representation of the second remote user 204B on a display included in the first remote computing device 120, and/or based on a gaze of the second remote user 204B, such as the second remote user 204B looking at a representation of the first remote user 204A on a display included in the second remote computing device 122.

[0040] In some examples, the second user, previously referred to as the second remote user, is in a same room and/or location as the local user 104. When the second user is in the same room as the local user 104, the display 106 will present the representation of the first remote user 204A, without presenting a representation of the second user. When the local computing device 102 and/or server 116 has determined that the first remote user 204A is speaking to the second user in a pairwise conversation, the local computing device 102 and/or server 116 will rotate the three-dimensional model so that the representation of the first remote user 204A appears to be looking at the second user. The local computing device 102 and/or server 116 can determine that the first remote user 204Ais speaking to the second user in a pairwise conversation in a similar manner as described herein with respect to the first remote user 204A and second remote user 204B. The local computing device 102 and/or server 116 may have determined a location of the second user based, for example, on images captured by the camera 108 included in the local computing device 102. The local computing device 102 and/or server 116 can rotate the three- dimensional model so that the representation of the first remote user 204A appears to be looking at the second user while the local user 104 is providing input to the local computing device 102 and/or in between inputs by the local user 104 into the local computing device 102.

[0041] FIG. 2D shows the display representing all three remote users 204A, 204B, 204C of the videoconference, with an expanded representation of the third remote user 204C who is speaking. In this example, the third remote user 204C can be considered to be speaking to a general audience. In this example, the display 106 presents representations of all three of the remote users 204A, 204B, 204C. The representation of the third remote user 204C, who is speaking, is enlarged, magnified, and/or expanded. The enlargement, magnification, and/or expansion of the representation of the third remote user 204C brings focus to the representation of the third remote user 204C so that the local user 104 will look at the representation of the user (the third remote user 204C) who is speaking. The display 106 presents the three remote users 204 A, 204B, 204C against a shared background 202D.

[0042] In this example, the local computing device 102 and/or server 116 has determined that the third remote user 204C is speaking to the other users 104, 204 A, 204B participating in the videoconference. The local computing device 102 and/or server 116 has determined that the third remote user 204C is speaking to the other users 104 204 A, 204B participating in the videoconference based on audio signals, such as based on speech being recognized only from audio signals received by microphones included in the third remote computing device 124.

[0043] FIG. 3 shows an end-to-end workflow for building a shared virtual meeting scene. The end-to-end workflow is shown from the perspective of the local computing device 102. A server 308 can correspond to the server 116 shown and described with respect to FIG. 1.

[0044] In some examples, a microphone 326 corresponds to the microphone included in the local computing device 102. In some examples, audio input 328 is received by the local computing device 102. In some examples, a camera 330 corresponds to the camera 108 included in the local computing device 102. In some examples, a color (e.g. red-green-blue/RGB) image 332 is captured by the camera 108 included in the local computing device 102. In some examples, a two-dimensional (2D) screen corresponds to the display 106. In some examples, a rendering camera 336 is presented by the 2D screen 334. In some examples, a portrait avatar 338 is presented by the 2D screen 334.

[0045] In some examples, a local data channel 302 is implemented by the local computing device 102. In some examples, a local audio channel 304 is implemented by the local computing device 102. In some examples, a local video channel 306 is implemented by the local computing device 102.

[0046] In some examples, a remote video channel 310 is implemented by the server 116, 308 and/or by one of the remote computing devices 120, 122, 124. In some examples, a remote audio channel 312 is implemented by the server 116, 308 and/or by one of the remote computing devices 120, 122, 124. In some examples, a remote data channel 314 is implemented by the server 116, 308 and/or by one of the remote computing devices 120, 122, 124.

[0047] The local computing device 102 sends audio input 328 to the local audio channel 304. The local computing device 102, server 116, and/or remote computing devices 120, 122, 124 can determine that the first user is speaking based on audio input 328 corresponding to speech by the first user 104.

[0048] The local computing device 102 can perform speech recognition on the audio input 328 to generate speech text 316. The speech text 316 can include a transcript or other text generated based on the audio input 328. The local computing device 102 can provide the speech text 316 to the local data channel 302. The local computing device 102, server 116, 308, and/or remote computing device(s) 120, 122, 124 can determine which remote user 204 A, 204B, 204C the local user 104 is speaking to based on content of the speech text 316. For example, if the speech text 316 includes a title, name, and/or reference to one of the remote users 204A, 204B, 204C, then the local computing device 102, server 116, 308, and/or remote computing device(s) 120, 122, 124 can determine that the local user 104 is speaking to the remote user 204A, 204B, 204C for whom the speech text 316 includes the title, name, and/or reference.

[0049] The camera 108, 330 can capture one or more color images 332 of the local user 104. The local computing device 102 and/or server 116, 308 can generate a depth image 318 of the local user 104 based on the color image(s) 332. The depth image 318 can include a three- dimensional model of the face, head, and/or other body parts of the local user 104. Generation of the depth image 318 and/or three-dimensional model is described below with respect to FIGs. 7 through 12B. The local computing device 102 can provide the color image(s) 332 and/or depth image 318 to the server 116, 308 via the local video channel 306.

[0050] The local computing device 102 can perform face detection on the color image(s) 332. The local computing device 102 can perform face detection by a facial detection algorithm, such as a genetic algorithm or eigen-face technique. After detecting the face of the local user 104, the local computing device 102 can determine a head position 320 of the local user 104. The head position 320 can include a location of the head of the local user 104 with respect to the camera 108, 330. The local computing device 102 can render the camera 108, 330 based in part on the determined head position 320. The local computing device 102 can render the camera 108, 330 by, for example, focusing on a face of the local user 104.

[0051] The server 116, 308 can communicate via the computing devices 102, 120, 122, 124 via video channels, including the local video channel 306 and a remote video channel 310 associated with each of the remote computing devices 120, 122, 124. In some examples, the server 116, 308 communicates with the computing devices 102, 120, 122, 124 via an audio and video communication protocol such as WebRTC (Web Real-Time Communication). The server 116, 308 can implement a custom shader 322 via the remote video channel 310. The custom shader 322 can modify the portrait avatar 338, such as by changing a color and/or shading of the portrait avatar 338. The local computing device 102 can also receive the three-dimensional models of the remote users 204A, 204B, 204C from the server 116, 308 and/or remote computing devices 120, 122, 124.

[0052] The portrait avatar 338 can include representations of the remote users 204 A, 204B, 204C. The portrait avatar 338 can be generated based on the determination of which remote users 204A, 204B, 204C should be presented in the 2D screen 334 and/or display 106. The determination of which remote users 204A, 204B, 204C should be presented in the 2D screen 334 and/or display 106 can be based on audio data received from the remote computing devices 120, 122, 124 via the remote audio channel 312, and/or based on determinations made by a speech- aware attention transition algorithm 324.

[0053] The speech-aware attention transition algorithm 324 can determine which remote users 204A, 204B, 204C the local computing device 102 should present representations of. The speech-aware attention transition algorithm 324 can determine which remote users 204A, 204B, 204C the local computing device 102 should present representations of based on audio signals, based on the speech text 316, and/or based on remote data received from the remote computing devices 120, 122, 124 via remote data channel 314. The remote data can include text, such as transcriptions of speech spoken by the remote uses 204 A, 204B, 204C. An example of the speech- aware attention transition algorithm 324 is described in greater detail with respect to FIG. 4.

[0054] FIG. 4 shows a decision-tree algorithm for determining how to represent remote users during a videoconference. The decision-tree algorithm shown in FIG. 4 is an example of the speech-aware attention transition algorithm 324. The decision-tree algorithm shown in FIG. 4 can be performed by the local computing device 102, the remote computing devices 120, 122, 124, the server 116, or tasks within the decision-tree algorithm can be distributed between the local computing device 102, the remote computing devices 120, 122, 124, and/or the server 116.

[0055] The decision-tree algorithm determines which users 104, 204A, 204B, 204C should be presented, and how the users 104, 204 A, 204B, 204C should be presented, on the display 106. The decision-tree algorithm determines which and how the users 104, 204A, 204B, 204C are presented by determining which users 104, 204A, 204B, 204C are speaking. In some examples, the decision-tree algorithm determines which users 104, 204A, 204B, 204C are speaking, and/or to whom the users 104, 204A, 204B, 204C are speaking, based on which computing devices 102, 120, 122, 124 receive audio input. In some examples, the decision-tree algorithm determines which users 104, 204 A, 204B, 204C are speaking, and/or to whom the users 104, 204 A, 204B, 204C are speaking, based on which computing devices 102, 120, 122, 124 receive audio input recognizable as human speech. In some examples, the decision-tree algorithm determines which users 104, 204A, 204B, 204C are speaking, and/or to whom the users 104, 204A, 204B, 204C are speaking, based on recognizable text transcribed from audio input received by the computing devices 102, 120, 122, 124, such as text addressing, identifying, and/or referring to other users 104, 204A, 204B, 204C. In some examples, the decision-tree algorithm determines which users 104, 204A, 204B, 204C are speaking, and/or to whom the users 104, 204A, 204B, 204C are speaking, based on gaze detection, such as determining which user representations a user 104, 204A, 204B, 204C is looking at. In some examples, a user can be considered to be speaking to another user for whom the user is looking at the representation. In some examples, a user can be considered to be listening to another user for whom the user is looking at the representation. The computing devices 102, 120, 122, 124 and/or server 116 can generate representations of the users, such as the representations shown in FIGs. 2A, 2B, 2C, and/or 2D, based on the determination of who is speaking to whom. When a determination of who is speaking to whom changes, the computing devices 102, 120, 122, 124 and/or server 116 can change the representations of the users.

[0056] The algorithm starts (402). After starting (402), the algorithm includes determining a speech state of the local user 104 (404). In some examples, the algorithm can determine the speech state of the local user 104 (404) based on whether audio signals are received and/or processed by the microphone included in the local computing device 102. In some examples, the algorithm can determine the speech state of the local user 104 (404) based on whether audio signals recognized as human speech are received and/or processed by the microphone included in the local computing device 102. In some examples, the algorithm can determine the speech state of the local user 104 (404) based on whether audio signals transcribed into speech text by the local computing device 102 are received and/or processed by the microphone included in the local computing device 102. In some examples, the algorithm can determine that the local user 104 is speaking based on receiving audio signals for at least a speech threshold period of time.

[0057] If the algorithm determines that the speech state of the local user 104 is talking and/or speaking, then the algorithm determines whether the local user 104 is making an announcement (406) to the remote users 204 A, 204B, 204C or the local user 104 is talking to a particular user (410). In some examples, the algorithm can determine that the local user 104 is talking to a particular user (410) based on gaze detection determining that the local user 104 is focusing a gaze of the local user 104 on a representation of a particular remote user in the display 106. In some examples, the algorithm can determine that the local user 104 is talking to a particular user (410) based on audio signals received by the local computing device 102 including a name, title, or reference to the particular user. In some examples, the algorithm can determine that the local user 104 is talking to a particular user (410) based on a gaze of the particular user focusing on a representation of the local user 104 in the remote computing device but gazes of the other remote users not focusing on the representation of the local user 104. If the algorithm determines that the local user 104 is talking to a particular user (410), then the display 106 of the local computing device 102 can present a representation of only the particular user (user ‘A’ in FIG. 4) and present the one-on-one view 454 with the one remote user 204A as shown in FIG. 2B. If the algorithm determines that the speech state of the local user 104 is talking but the local user 104 is not talking to a particular user, then the algorithm can determine that the local user 104 is making an announcement (406). The local user 104 making an announcement (406) can be considered to be speaking to a general audience. If the algorithm determines that the local user 104 is making an announcement (406), then the display 106 of the local computing device 102 can present a full view 452 of the remote users 204A, 204B, 204C, as shown in FIG. 2A.

[0058] If the algorithm determines that the speech state of the local user 104 is not talking and/or mute (412), then the algorithm can determine whether an announcement exists (414). The algorithm can determine whether an announcement exists (414) based on whether a single remote user 204A, 204B, 204C is talking to the remaining users 104, 204A, 204B, 204C. The algorithm can determine whether the single remote user 204A, 204B, 204C is talking to the remaining users 104, 204A, 204B, 204C based on audio signals and/or video signals. The algorithm can determine whether an announcement exists in a similar manner to determining whether the local user 104 is making an announcement (406) or talking to a particular user (410). In some examples, if no other users are talking, then the algorithm determines that an announcement exists. In some examples, if other users are talking, then an announcement does not exist. In some examples, if gazes of the remaining users are focused on the single remote user 204A, 204B, 204C who is talking, then the algorithm determines that an announcement exists. In some examples, if gazes of the remaining users are divided between two different users, then an announcement does not exist. If the algorithm determines that an announcement does exist, then the display 106 of the local computing device 102 will present all of the remote users (labeled uses ‘A’ and ‘B’ in FIG. 4) in a full view 456 of the remote users 204A, 204B, 204C shown in FIG. 2D. In the full view 456, the representation of the remote user 204C is expanded to draw attention of the local user 104 to the user 204C.

[0059] If the algorithm determines that an announcement does not exist, then the algorithm will determine whether a pair exists (416). A pair is two users talking to each other. The algorithm can determine whether a pair exists (416) based on determining whether two users, other than the local user 104, are talking to each other. In some examples, the algorithm determines that the two users are talking to each other based on alternations of receiving audio inputs from the remote computing devices 120, 122, 124 associated with the two remote users 204 A, 204B, 204C. In some examples, the algorithm determines that the two users are talking to each other based on gazes of the two users looking at representations of each other in the displays of their respective computing devices 120, 122, 124. In some examples, the algorithm determines that the two users are talking to each other based on text transcribed from audio signals received from the respective computing devices 120, 122, 124 identifying the other user, such as by name, nickname, or title.

[0060] If the algorithm determines that a pair does exist, then the algorithm can determine whether a single pair exists (418). The algorithm can determine whether a single pair exists (418) by determining whether pairs can be identified between two distinct pairs of users. If a single pair exists, then the display 106 of the local computing device 102 presents the two users in the pair (indicated as ‘A’ and ‘B’ in FIG. 4) in a pairwise view of 458. An example of the pairwise view is shown in FIG. 2C. In some examples, the local computing device 102, server 116, and/or remote computing device 120, 122, 124 rotates the representations of the users 204 A, 204B so that users 204A, 204A appear to be looking toward and/or facing each other. [0061] If the algorithm determines that the number of pairs is not equal to one, and/or more than one pair exists, then the display 106 of the local computing device 102 presents all of the remote users 204A, 204B, 204C (there could be four or more remote users) in a full view 460. In some examples, the local computing device 102, server 116, and/or remote computing device 120, 122, 124 rotates the representations of each pair of remote users so that the users in each pair appear to be looking toward and/or facing each other.

[0062] If the algorithm determines that a pair does not exist, then the algorithm will determine whether a talk-to state exists (420). The algorithm can determine whether the talk-to state exists (420) based on whether audio signals recognized as speech are received from any of the remote computing devices 120, 122, 124. If no audio signals recognized as speech are received from any of the remote computing devices 120, 122, 124, then the algorithm will determine that a talk-to state does not exist. If the algorithm determines that a talk-to state does not exist, then the display 106 of the local computing device 102 will present a full view 468 of all the remote users 204A, 204B, 204C. An example of the full view 468 is shown in FIG. 2A.

[0063] If audio signals recognized as speech are received from at least one of the remote computing devices 120, 122, 124, then the algorithm will determine that a talk-to state does exist. If the algorithm determines that a talk-to state does exist, then the algorithm will determine whether the number of users being talked and/or spoken to is one (422). In some examples, the algorithm determines whether the number of users being talked and/or spoken to is one (422) based, for example, on whether the user speaking identifies more than one user by name, title, or nickname. In some examples, the algorithm determines whether the number of users being talked and/or spoken to is one (422) based, for example, on whether a gaze of the user speaking is directed to a representation of more than one user. If the number of users being talked and/or spoken to is not equal to one, then the display 106 of the local computing device 102 can present a full view 466 of the remote users 204A, 204B, 204C (denoted ‘A’ and ‘B’) in FIG. 4. An example of the full view 466 is shown in FIG. 2D. In the example of the full view 466, the user who is speaking can be considered to be speaking to a general audience.

[0064] If the algorithm determines that the number of users being talked or spoken to is one, then the algorithm can determine whether the discussion involves the local user 104 (424). In some examples, the algorithm can determine whether the discussion involves the local user 104 (424) based, for example, on whether text transcribed from speech by the user speaking identifies the local user 104 by name, nickname, or title. In some examples, the algorithm can determine whether the discussion involves the local user 104 (424) based, for example, on whether a gaze of the user speaking is directed to a representation of the local user 104 on a display of the remote computing device 120, 122, 124 associated with the user speaking. If the algorithm determines that the discussion involves the local user 104, then the display 106 included in the local computing device 102 can present only the user who is speaking in a one-one-one view 462. An example of the one-on-one view 462 is shown in FIG. 2B. If the algorithm determines that the discussion does not involve the local user 104, then the display 106 included in the local computing device 102 can present a pairwise view 464 of the user who is speaking (denoted ‘A’ in FIG. 4) and the user who is being spoken to (denoted ‘B’ in FIG. 4). An example of the pairwise view 464 is shown in FIG. 2C. In some examples, the representations of the users in the pairwise view can be rotated so that the users appear to be looking at and/or toward each other.

[0065] FIG. 5 is a block diagram showing a computing system 500 for representing remote users during a videoconference. The computing system 500 can be an example of the local computing device 102, the server 116, and/or any of the remote computing devices 120, 122, 124. The computing system 500 can implement methods, functions, and/or techniques performed individually by any of the local computing device 102, the server 116, and/or any of the remote computing devices 120, 122, 124, and/or methods, functions, and/or techniques performed in combination by any of the local computing device 102, the server 116, and/or any of the remote computing devices 120, 122, 124.

[0066] The computing system 500 can include a model generator 502. The model generator 502 can generate a three-dimensional model, such as a depth model, of one or more users 104, 204A, 204B, 204C. The model generator 502 can generate a three-dimensional model, such as a depth model, of one or more users 104, 204 A, 204B, 204C based on one or more images captured by cameras included in the local computing device 102 and/or remote computing devices 120, 122, 124 during the videoconference. The generation of the three-dimensional model such as the depth model is described further below with respect to FIGs. 7 through 12B.

[0067] The computing system 500 can include a model rotator 504. The model rotator 504 can rotate the three-dimensional model and/or depth model of a user. The model rotator 504 can rotate the three-dimensional model and/or depth model of a user to make the user appear to be facing and/or looking toward another user. In some examples, the model rotator 504 rotates the three-dimensional model and/or depth model of the user using a three-dimensional point rotation algorithm.

[0068] The computing system 500 can include a speech state determiner 506. The speech state determiner 506 can determine a speech state of the videoconference. The speech state determiner 506 can determine, for example, whether the local user 104 is speaking and/or making an announcement to the remote users 204A, 204B, 204C (as shown in the example shown in FIG. 2A), whether the local user 104 is having a one-on-one (pairwise) conversation with a remote user 204A (as shown in the example shown in FIG. 2B), whether two remote uses 204 A, 204B are having a one-one-one (pairwise) conversation (as shown in the example of FIG. 2C), or whether a remote user 204C is speaking and/or making an announcement to the remaining users in the videoconference (as shown in the example shown in FIG. 2D).

[0069] The speech state determiner 506 can determine the speech state based, for example, on audio signals received by the local computing device 102 and/or remote computing devices 120, 122, 124, and/or determinations of gazes of the users 104, 204A, 204B, 204C toward representations of other users presented by the displays of associated computing devices 102, 120, 122, 124. In some examples, the speech state determiner 506 implements the decision-tree algorithm described above with respect to FIG. 4.

[0070] The computing system 500 can include a layout determiner 508. The layout determiner 508 can determine a layout for presenting representations of the users based on the speech state determined by the speech state determiner 506. In some examples, the layout determiner 508 can determine that a full presentation of the other users, as shown in FIG. 2A, should be presented if the local user 104 is speaking and/or making an announcement to the remote users 204A, 204B, 204C. In some examples, the layout determiner 508 can determine that a presentation of only one remote user should be presented, as shown in FIG. 2B, should be presented if the local user 104 is speaking with the one remote user. In some examples, the layout determiner 508 can determine that a presentation of two remote users facing toward each other, as shown in FIG. 2C, should be presented if the two remote users are speaking to each other. In some examples, the layout determiner 508 can determine that a full presentation of the other users, with an expanded representation of a remote user who is speaking, as shown in FIG. 2D, should be presented if one of the remote users is speaking to the other users in the videoconference.

[0071] The computing system 500 can include an avatar state determiner 510. An avatar can be a representation of a user. The avatar can be generated based on the three-dimensional model and/or depth model of the user. The avatar state determiner 510 can determine how the user should be represented on a display. The avatar state determiner 510 can determine how the user should be represented on a display based on the speech state determined by the speech state determiner 506 and/or based on the presentation determined by the layout determiner 508. In some examples, the avatar state determiner 510 can determine that a user should be represented as facing forward if the local user 104 is talking as shown in FIGs. 2A and 2B or if a remote user is talking to the group as shown in FIG. 2D. In some examples, the avatar state determiner 510 can determine that a first remote user should be represented as facing toward a second remote user if the first remote user is talking to the second remote user, as shown in FIG. 2C.

[0072] The computing system 500 can include a representation generator 512. The representation generator 512 can generate representations of the remote users. The representation generator 512 can generate representations of the remote users based on the three-dimensional models and/or depth models of the users and the avatar state determined by the avatar state determiner 510 and/or the layout determined by the layout determiner 508.

[0073] The computing system 500 can include at least one processor 514. The at least one processor 514 can execute instructions, such as instructions stored in at least one memory device 516, to cause the computing system 500 to perform any combination of methods, functions, and/or techniques described herein.

[0074] The computing system 500 can include at least one memory device 516. The at least one memory device 516 can include a non-transitory computer-readable storage medium. The at least one memory device 516 can store data and instructions thereon that, when executed by at least one processor, such as the processor 514, are configured to cause the computing system 500 to perform any combination of methods, functions, and/or techniques described herein. Accordingly, in any of the implementations described herein (even if not explicitly noted in connection with a particular implementation), software (e.g., processing modules, stored instructions) and/or hardware (e.g., processor, memory devices, etc.) associated with, or included in, the computing system 500 can be configured to perform, alone, or in combination with computing system 500, any combination of methods, functions, and/or techniques described herein.

[0075] The computing system 500 may include at least one input/output node 518. The at least one input/output node 518 may receive and/or send data, such as from and/or to, a server, and/or may receive input and provide output from and to a user. The input and output functions may be combined into a single node, or may be divided into separate input and output nodes. The input/output node 518 can include, for example, a microphone, a camera, a display such as a touchscreen, a speaker, a microphone, one or more buttons, and/or one or more wired or wireless interfaces for communicating with other computing devices.

[0076] FIG. 6 is a flowchart showing a method for representing remote users during a videoconference. The method can be performed by the computing system 500.

[0077] The method includes generating a model (602). Generating the model (602) can include generating, based on images captured by a camera during a videoconference, a three- dimensional model of a first user participating in the videoconference. The method includes determining that the first user is speaking to or with a second user (604). Determining that the first user is speaking to or with a second user (604) can include determining, based on an audio signal from the first user, that the first user is speaking, via the videoconference, to or with the second user. The method includes displaying a representation of the first user and second user (606). Displaying the representation of the first user and second user (606) includes, based on determining that the first user is speaking to the second user, displaying, to a third user, a representation of the first user and a representation of the second user where the representation of the first user is based on a rotation of the three-dimensional model of the first user so that the representation of the first user is facing toward the representation of the second user.

[0078] In some examples, the audio signal includes a name of the second user spoken by the first user.

[0079] In some examples, the audio signal includes speech by the first user for at least a speech threshold period of time.

[0080] In some examples, the audio signal is a first audio signal; the representation of the first user is a first representation of the first user; and the method further includes determining, based on a second audio signal, that the first user is speaking to a general audience within the videoconference; and based on determining that the first user is speaking to the general audience, displaying, to the third user, a second representation of the first user, the second representation of the first user presenting the first user as facing toward the third user.

[0081] In some examples, the method further includes, based on determining that the first user is speaking to the general audience reducing a size of the representation of the first user; and reducing a size of the representation of the second user.

[0082] In some examples, the audio signal is a first audio signal; the representation of the second user is a first representation of the second user; and the method further includes determining, based on a second audio signal from the second user, that the second user is engaging in a one-on-one conversation with the third user within the videoconference; and based on determining that the second user is engaging in the one-on-one conversation with the third user, displaying, to the third user, a second representation of the second user, the second representation of the second user presenting the second user as facing toward the third user.

[0083] In some examples, the determining that the second user is engaging in the one-on- one conversation with the third user within the videoconference is further based on silence from the first user for a silence threshold period of time.

[0084] In some examples, the determining that the second user is engaging in the one-on- one conversation with the third user within the videoconference is further based on the second audio signal including speech by the second user for at least a speech threshold period of time.

[0085] In some examples, the determining that the first user is speaking to the second user is based on the audio signal from the first user and an eye gaze of the first user.

[0086] In some examples, the determining that the first user is speaking to the second user is based on the audio signal from the first user and an audio signal from the second user.

[0087] In some examples, the determining that the first user is speaking to the second user is based on the audio signal from the first user and an eye gaze from the second user.

[0088] In some examples, the displaying the representation of the first user and the second user is based on determining that the first user is speaking to the second user and that the second user is speaking to the first user.

[0089] As discussed above, the local computing device 102 and/or server 116 can generate a three-dimensional model of a face, head, and/or other body part of the local user 104. The local computing device can receive a video stream of the local user 104 and generate a depth map based on the video stream. The local computing device 102 and/or server 116 can generate a representation of the local user 104, such as a three-dimensional model, based on the depth map and the video stream. The representation can include a video representing the face of the local user 104, and can include head movement, eye movement, mouth movement, and/or facial expressions. The local computing device 102 can send the representation to a remote computing device 120, 122, 124 for viewing by a remote user 204A, 204B, 204B with whom the local user 104 is communicating via videoconference. The remote computing devices 120, 122, 124 can similarly generate three-dimensional models and/or depth maps of the remote users 204A, 204B, 204C.

[0090] To reduce data required for videoconferencing, a computing device such as the local computing device 102, server 116, and/or remote computing device 120, 122, 124 can generate a three-dimensional model in a form of a depth map based on the video stream, and generate a representation of the local user 104 based on the depth map and the video stream. The representation of the local user can include a three-dimensional (3D) model and/or avatar generated in real time that includes head movement, eye movement, mouth movement, and/or facial expressions corresponding to such movements by the local user.

[0091] The computing device can generate the depth map based on a depth prediction model. The depth prediction model may have been previously trained based on images, for example same images, of persons captured by both a depth camera and a color (such as red-green- blue (RGB)) camera. The depth prediction model can include a neural network that was trained based on images of persons captured by both the depth camera and the color camera. The computing device can generate the depth map based on the depth prediction model and a single color (such as red-green-blue (RGB)) camera. The generation of the depth map based on the depth prediction model and the single color camera reduces the hardware needed to generate the representation (e.g., a 3D representation) of the local user. The generation of the depth map based on the depth prediction model and the single color camera reduces the hardware needed to generate the representation of the local user for viewing by a remote user in, for example, a video conference with the local user. In other words, multiple cameras capturing images of the local user (e g., multiple cameras from different perspectives capturing images of the local user) may not be needed to produce a 3D representation of the local user for viewing by, for example, a remote user.

[0092] The local computing device 102 can send the representation of the local user 104 to one or more remote computing devices 120, 122, 124. The representation can realistically represent the local user 104 while relying on less data than an actual video stream of the local user 104. In some examples, a plugin for a web browser can implement the methods, functions, and/or techniques described herein.

[0093] The representation of the local user 104 can be a three-dimensional model and/or representation of the local user. The three-dimensional representation of the local user can be valuable in the context of virtual reality (VR) and/or augmented reality (AR) glasses, because the remote computing device can rotate the three-dimensional representation of the local user in response to movement of the VR and/or AR glasses. For example, a single camera can be used to capture a local user and a 3D representation of the local user can be generated for viewing a remote user using, for example, VR (e.g., a VR head mounted display) and/or AR glasses.

[0094] FIG. 7 is a block diagram of a pipeline for generating a representation of the local user 104 based on a depth map. The pipeline can include the camera 108. The camera 108 can capture images of the local user 104. The camera 108 can capture images of the face and/or other body parts of the local user 104. The camera 108 can capture images and/or photographs that are included in a video stream of the local user 104.

[0095] The camera 108 can send a first video stream to a facial landmark detection model 702. The facial landmark detection model 702 can be included in the local computing device 102 and/or the server 116. The facial landmark detection model 702 can include Shape Preserving with GAts (SPIGA), AnchorFace, Teacher Supervises Students (TS3), or Joint Voxel and Coordinate Regression (JVCR), as non-limiting examples. The facial landmark detection model 702 can determine a location of the face of the local user 104 within a frame and/or first video stream. The facial landmark detection model 702 can determine a location of the face of the local user 104 based on facial landmarks, which can also be referred to as facial features of the user. In some examples, the local computing device 102 and/or server 116 can crop the image and/or frame based on the determined location of the face of the local user 104. The local computing device 102 and/or server 116 can crop the image and/or frame based on the determined location of the face of the local user 104 to include only portions of the image and/or frame that are within a predetermined distance of the face of the local user 104 and/or within a predetermined distance of predetermined portions (such as chin, cheek, or eyes) of the face of the local user 104.

[0096] In some examples, the local computing device 102 and/or server 116 can determine a head pose 704 based on the facial landmarks determined by the facial landmark detection model 702. The head pose 704 can include a direction that the local user 104 is facing and/or a location of a head of the local user 104. [0097] In some examples, the local computing device 102 can adjust the camera 108 (706) and/or the server 116 can instruct the local computing device 102 to adjust the camera 108 (706). The local computing device 102 can adjust the camera 108 (706) by, for example, changing a direction that the camera 108 is pointing and/or by changing a location of focus of the camera 108.

[0098] After and/or while adjusting the camera 108, the local computing device 102 can add the images of the local user 104 captured by the camera 108 within the first video stream to a rendering scene 708. The rendering scene 708 can include images and/or representations of the users and/or persons participating in the videoconference, such as the representations 204A, 204B, 204C of remote users shown in FIGs. 2A, 2B, 2C, and 2DThe representations of remote users 204A, 204B, 204C received from remote computing devices 120, 122, 124 and/or the server 116 and presented by and/or on the display 106 can be modified representations of the images captured by cameras included in the remote computing devices 120, 122, 124 to reduce the data required to transmit the images.

[0099] The camera 108 can send a second video stream to a depth prediction model 710. The second video stream can include a representation of the face of the local user 104. The depth prediction model 710 can create a three-dimensional model of the face of the local user 104, as well as other body parts and/or objects held by and/or in contact with the local user 104. The three-dimensional model created by the depth prediction model 710 can be considered a depth map 712, discussed below. In some examples, the depth prediction model 710 can include a neural network model. An example neural network model that can be included in the depth prediction model 710 is shown and described with respect to FIG. 8. In some examples, the depth prediction model 710 can be trained by capturing simultaneous and/or concurrent images of persons with both a depth camera and a color (such as red-green-blue (RGB)) camera. An example of training the depth prediction model 710 by capturing simultaneous and/or concurrent images of persons with both a depth camera and a color camera is shown and described with respect to FIG. 9.

[00100] The depth prediction model 710 can generate a depth map 712 based on the second video stream. The depth map 712 can include a three-dimensional representation of portions of the local user 104. In some examples, the depth prediction model 710 can generate the depth map 712 by generating a segmented mask using the body segmentation application programming interface (API) of, for example, TensorFlow.js, masking the images and/or frames with the segmented mask, and passing the masked images and/or frames into a Portrait Depth API to obtain the depth map.

[00101] In some examples, the depth prediction model 710 can generate the depth map 712 by creating a grid of triangles with vertices. In some examples, the grid is a 256x192x2 grid. In some examples, each cell in the grid includes two triangles. In each triangle, an x value can indicate a value for a horizontal axis within the image and/or frame, a y value can indicate a value for a vertical axis within the image and/or frame, and a z value can indicate a distance from the camera 108. In some examples, the z values are scaled to have values between zero (0) and one (1). In some examples, the depth prediction model 710 can discard, and/or not render, triangles for which a standard deviation of the three z values exceeds a discrepancy threshold, such as 0.1. The discarding and/or not rendering of triangles for which the standard deviation of the three z values exceeds the discrepancy threshold avoids bleeding artifacts between the face of the local user 104 and the background included in the frame.

[00102] The depth map 712 can include distances of various portions of the face and/or other body parts of the local user 104 with respect to the camera 108 and/or distances of various portions of the face and/or other body parts of the local user 104 with respect to each other. In some examples, the depth map 712 is a lower-resolution tensor, such as a 256x192x1 tensor. In some examples, the depth map 712 can include values between zero (0) and one (1) to indicate relative distances from the pixel to the camera 108 that captured the representation of the local user 104, such as zero indicating the closest to the camera 108 and one indicating the farthest from the camera 108. In some examples, the depth map 712 is stored on a graphics processing unit (GPU) and rendered into a GPU buffer. In some examples, the depth map 712 is stored together with the frame for streaming to remote clients, such as the remote computing devices 120, 122, 124.

[00103] The local computing device 102 and/or server 116 can combine the depth map 712 with the second video stream and/or a third video stream to generate a representation 714 of the local user 104. The representation 714 can include a three-dimensional avatar that looks like the local user 104 and simulates movements by the local user 104. The representation 714 can represent and/or display head movements, eye movements, mouth movements, and/or facial expressions by the local user 104. In some examples, the representation 714 can include a grid of vertices and/or triangles. The cells in the grid can include two triangles with each triangle including three z values indicating distances and/or depths from the camera 108.

[00104] The local computing device 102 and/or server 116 can send the representation 714 to a remote computing device 716, such as any of the remote computing devices 120, 122, 124. The remote computing device 716 can present the representation 714 on a display included in the remote computing device 716. The remote computing device 716 can also send to the local computing device 102, either directly to the local computing device 102 or via the server 116, a representation of another person participating in the videoconference, such as the representations of the remote users 204A, 204B, 204C. The local computing device 102 can include the representation of the remote user 204A, 204B, 204C in the rendering scene 708, such as by including the representation in the display 106.

[00105] FIG. 8 is a diagram that includes a neural network 808 for generating a depth map. The methods, functions, and/or modules described with respect to FIG. 8 can be performed by and/or included in the local computing device 102, the server 116, and/or distributed between the local computing device 102 and server 116. The neural network 808 can be trained using both a depth camera and a color (such as RGB) camera as described with respect to FIG. 9.

[00106] Video input 802 can be received by the camera 108. The video input 802 can include, for example, high-resolution red-green-blue (RGB) input, such as 1,920 pixels by 720 pixels, received by the camera 108. The video input 802 can include images and/or representations of the local user 104 and background images. The representations of the local user 104 may not be centered within the video input 802. The representations of the local user 104 may be on a left or right side of the video input 802, causing a large portion of the video input 802 to not include any portion of the representations of the local user 104.

[00107] The local computing device 102 and/or server 116 can perform face detection 804 on the received video input 802. The local computing device 102 and/or server 116 can perform face detection 804 on the received video input 802 based on a facial landmark detection model 702, as discussed above with respect to FIG. 7. Based on the face detection 804, the local computing device 102 and/or server 116 can crop the images included in the video input 802 to generate cropped input 806. The cropped input 806 can include smaller images and/or frames that include the face of the local user 104 and portions of the images and/or frames that are a predetermined distance from the face of the local user 104. In some examples, the cropped input 806 can include lower resolution than the video input 802, such as including low-resolution color (such as RGB) input and/or video, such as 192 pixels by 256 pixels. The lower resolution and/or lower number of pixels of the cropped input 806 can be the result of cropping the video input 802.

[00108] The local computing device 102 and/or server 116 can feed the cropped input 806 into the neural network 808. The neural network 808 can perform background segmentation 810. The background segmentation 810 can include segmenting and/or dividing the background into segments and/or parts. The background that is segmented and/or divided can include portions of the cropped input 806 other than the representation of the local user 104, such as a wall and/or chair. In some examples, the background segmentation 810 can include removing and/or cropping the background from the image(s) and/or cropped input 806.

[00109] A first layer 812 of the neural network 808 can receive input including the cropped input cropped input 806 and/or the images in the video stream with the segmented background. The input received by the first layer 812 can include low-resolution color input similar to the cropped input 806, such as 256x192x3 RGB input. The first layer 812 can perform a rectified linear activation function (ReLU) on the input received by the first layer 812, and/or apply a three-by-three (3x3) convolutional filter to the input received by the first layer 812. The first layer 812 can output the resulting frames and/or video to a second layer 814.

[00110] The second layer 814 can receive the output from the first layer 812. The second layer 814 can apply a three-by-three (3x3) convolutional filter to the output of the first layer 812, to reduce the size of the frames and/or video stream. The size can be reduced, for example, from 256 pixels by 192 pixels to 128 pixels by 128 pixels. The second layer 814 can perform a rectified linear activation function on the reduced frames and/or video stream. The second layer 814 can also perform max pooling on the reduced frames and/or video stream, reducing the dimensionality and/or number of pixels included in the frames and/or video stream. The second layer 814 can output the resulting frames and/or video stream to a third layer 816 and to a first half 826A of an eighth layer.

[00111] The third layer 816 can perform additional convolutional filtering (such as three- by-three convolutional filtering), perform a rectified linear activation function, and/or max pooling on the frames and/or video stream received from the second layer 814 to further reduce the dimensionality and/or number of pixels included in the frames and/or video stream. The number of pixels included in the frames and/or video stream can be reduced, for example, from 128 pixels by 128 pixels to 128 pixels by 64 pixels. The third layer 816 can output the resulting frames and/or video stream to a fourth layer 818 and to a first half 824A of a seventh layer.

[00112] The fourth layer 818 can perform additional convolutional filtering (such as three- by-three convolutional filtering), perform a rectified linear activation function, and/or max pooling on the frames and/or video stream received from the third layer 816 to further reduce the dimensionality and/or number of pixels included in the frames and/or video stream. The number of pixels included in the frames and/or video stream can be reduced, for example, from 128 pixels by 64 pixels to 64 pixels by 32 pixels. The fourth layer 818 can output the resulting frames and/or video stream to a fifth layer 820 and to a first half 822A of a sixth layer.

[00113] The fifth layer 820 can perform additional convolutional filtering (such as three- by-three convolutional filtering), perform a rectified linear activation function, and/or max pooling on the frames and/or video stream received from the fourth layer 818 to further reduce the dimensionality and/or number of pixels included in the frames and/or video stream. The number of pixels included in the frames and/or video stream can be reduced, for example, from 64 pixels by 32 pixels to 32 pixels by 32 pixels. The fifth layer 820 can output the resulting frames and/or video stream to a second half 822B of a sixth layer.

[00114] The sixth layer, which includes the first half 822Athat received the output from the fourth layer 818 and the second half 822B that received the output from the fifth layer 820, can perform up convolution on the frames and/or video stream to increase the dimensionality and/or number of pixels included in the frames and/or video stream. The up convolution can double the dimensionality and/or number of pixels in each dimension, such as by increasing the number of pixels from 32x32 to 64x(32+32). The sixth layer can output the up-convolved frames and/or video stream to a second half 824B of the seventh layer.

[00115] The seventh layer, which includes the first half 824A that received the output from the third layer 816 and the second half 824B that received the output from the second half 822B of the sixth layer, can perform further up convolution on the frames and/or video stream to increase the dimensionality and/or number of pixels included in the frames and/or video stream. The up convolution can double the dimensionality and/or number of pixels in each dimension, such as by increasing the number of pixels from 64x64 to 128x(64+64). The seventh layer can output the up-convolved frames and/or video stream to a second half 826B of the eighth layer.

[00116] The eighth layer, which includes the first half 826A that received the output from the second layer 814 and the second half 826B that received the output from the second half 824B of the seventh layer, can perform further up convolution on the frames and/or video stream to increase the dimensionality and/or number of pixels included in the frames and/or video stream. The up convolution can double the dimensionality and/or number of pixels, such as by increasing the number of pixels from 128x128 to 128x(128+128). The eighth layer can output the up- convolved frames and/or video stream to a ninth layer 828.

[00117] The ninth layer 828 can receive the output from the eighth layer. The ninth layer 828 can perform further up convolution on the frames and/or video stream received from the eighth layer. The ninth layer 828 can also reshape the frames and/or video stream received from the eighth layer. The up-convolving and reshaping performed by the ninth layer 828 can increase the dimensionality and/or pixels in the frames and/or video stream. The frames and/or video stream with the increased dimensionality and/or pixels can represent a silhouette 830 of the local user 104.

[00118] The local computing device 102 and/or server 116 can generate a depth map 832 based on the silhouette 830. The depth map can include distances of various portions of the local user 104. The distances can be distances from the camera 108 and/or distances and/or directions from other portions of the local user. The local computing device 102 and/or server 116 can generate the depth map 832 by generating a segmented mask using the body segmentation application programming interface (API) of TensorFlow.js, masking the images and/or frames with the segmented mask, and passing the masked images and/or frames into, for example, a Portrait Depth API to obtain the depth map.

[00119] The local computing device 102 and/or server 116 can generate a real-time depth mesh 836 based on the depth map 832 and a template mesh 834. The template mesh 834 can include colors of the representation of the local user 104 captured by the camera 108. The local computing device 102 and/or server 116 can project the colors from the frames onto the triangles within the depth map 832 to generate the real-time depth mesh 836. The real-time depth mesh 836 can include a three-dimensional representation of the local user 104, such as a three-dimensional avatar, that represents the local user 104. The three-dimensional representation of the local user 104 can mimic the movements and facial expressions of the local user 104 in real time.

[00120] The local computing device 102 and/or server 116 can generate an image 838, and/or stream of images 838, based on the real-time depth mesh 836. The server 116 and/or remote computing device 120, 122, 124 can add the image 838 and/or stream of images to an image 840 and/or video stream that includes multiple avatars. The image 840 and/or video stream that includes the multiple avatars can include representations of multiple uses.

[00121] FIG. 9 shows a depth camera 904 and a camera 906 capturing images of a person 902 to train the neural network 808. The depth camera 904 and camera 906 can each capture multiple images and/or photographs of the person 902. The depth camera 904 and camera 906 can capture the multiple images and/or photographs of the person 902 concurrently and/or simultaneously. The images and/or photographs can be captured at multiple angles and/or distances, which can be facilitated by the person 902 rotating portions of the body and/or face of the person 902 and moving toward and away from the depth camera 904 and camera 906. In some examples, the images and/or photographs captured by the depth camera 904 and camera 906 can be timestamped to enable matching the images and/or photographs that were captured at same times. The person 902 can move, changing head poses and/or facial expressions, so that the depth camera 904 and camera 906 capture images of the person 902 (particularly the person’s 902 face) from different angles and with different facial expressions.

[00122] The depth camera 904 can determine distances to various locations, portions, and/or points on the person 902. In some examples, the depth camera 904 includes a stereo camera, with two cameras that can determine distances based on triangulation. In some examples, the depth camera 904 can include a structured light camera or coded light depth camera that projects patterned light onto the person 902 and determines the distances based on differences between the projected light and the images captured by the depth camera 904. In some examples, the depth camera 904 can include a time of flight camera that sweeps light over the person 902 and determines the distances based on a time between sending the light and capturing the light by a sensor included in the depth camera 904.

[00123] The camera 906 can include a color camera, such as a red-green-blue (RGB) camera, that generates a two-dimensional grid of pixels. The camera 906 can generate the two- dimensional grid of pixels based on light captured by a sensor included in the camera 906.

[00124] A computing system 908 can receive the depth map from the depth camera 904 and the images (such as grids of pixels) from the camera 906. The computing system 908 can store the depth maps received from the depth camera 904 and the images received from the camera 906 in pairs. The pairs can each include a depth map and an image that capture the person 902 at the same time. The pairs can be considered training data to train the neural network 808. The neural network 808 can be trained by comparing depth data based on images captured by the depth camera 904 to color data based on images captured by the camera 906. The computing system 908, and/or another computing system, can train the neural network 808 based on the training data to determine depth maps based on images that were captured by a color (such as RGB) camera, such as the camera 108 included in the local computing device 102.

[00125] The computing system 908, and/or another computing system in communication with the computing system 908, can send the training data, and/or the trained neural network 808, along with software (such as computer-executable instructions), to one or more other computing devices, such as the local computing device 102, server 116, and/or remote computing devices 120, 122, 124, enabling the one or more other computing devices to perform any combination of methods, functions, and/or techniques described herein.

[00126] FIG. 10 shows a pipeline 1000 for rendering the representation of the local user 104. The cropped input 806 can include a reduced portion of the video input 802, as discussed above with respect to FIG. 8. The cropped input 806 can include a captured image and/or representation of a face of a user and some other body parts, such as the representation of the local user 104. In some examples, the cropped input 806 can include virtual images of users, such as avatars of users, rotated through different angles of view.

[00127] The pipeline can include segmenting the foreground to generate a modified input 1002. The segmentation of the foreground can result in the background of the cropped input 806 being eliminated in the modified input 1002, such as causing the background to be all black or some other predetermined color. The foreground that is eliminated can be parts of the cropped input 806 that are not part of or in contact with the local user 104.

[00128] The pipeline 1000 can pass the modified input 1002 to the neural network 808. The neural network 808 can generate the silhouette 830 based on the modified input 1002. The neural network 808 can also generate the depth map 832 based on the modified input 1002. The pipeline 1000 can generate the real-time depth mesh 836 based on the depth map 832 and the template mesh 834. The real-time depth mesh 836 can be used by the local computing device 102, server 116, and/or remote computing device 120, 122, 124 to generate an image 838 that is a representation of the local user 104.

[00129] FIG. 11 is a block diagram of a computing device 1100 that generates a representation of the local user 104 based on the depth map 712. The representation of the local user 104 can include a three-dimensional model of the local user 104. Features and/or functionalities of the computing device 1100 can be included in the computing system 500.

[00130] The computing device 1100 can include a camera 1102. The camera 1102 can be an example of the camera 108. The camera 1102 can capture color images, including digital images, of a user, such as the local user 104 or the remote users 204A, 204B, 204C.

[00131] The computing device 1100 can include a stream processor 1104. The stream processor 1104 can process streams of video data captured by the camera 1102. The stream processor 1104 can send, output, and/or provide the video stream to the facial landmark detection model 702, to a cropper 1110, and/or to the depth prediction model 710.

[00132] The computing device 1100 can include the facial landmark detection model 702. The facial landmark detection model 702 can find and/or determine landmarks on a representation of a face of the local user 104.

[00133] The computing device 1100 can include a location determiner 1106. The location determiner 1106 can determine a location of the face of the local user 104 within a frame based on the landmarks found and/or determined by the facial landmark detection model 702.

[00134] The computing device 1100 can include a camera controller 1108. The camera controller 1108 can control the camera 1102 based on the location of a face of the local user 104 determined by the location determiner 1106. The camera controller 1108 can, for example, cause the camera 1102 to rotate and/or change direction and/or depth of focus.

[00135] The computing device 1100 can include a cropper 1110. The cropper 1110 can crop the image(s) captured by the camera 1102. The cropper 1110 can crop the image(s) based on the location of the face of the local user 104 determined by the location determiner 1106. The cropper 1110 can provide the cropped image(s) to a depth map generator 1112.

[00136] The computing device 1100 can include the depth prediction model 710. The depth prediction model 710 can determine depths of objects for which images were captured by the camera 1102. The depth prediction model 710 can include a neural network, such as the neural network 808 described above.

[00137] The computing device 1100 can include the depth map generator 1112. The depth map generator 1112 can generate the depth map 712 based on the depth prediction model 710 and the cropped image received from the cropper 1110. [00138] The computing device 1100 can include an image generator 1114. The image generator 1114 can generate the image and/or representation 714 that will be sent to the remote computing device 120, 122, 124. The image generator 1114 can generate the image and/or representation 714 based on the depth map 712 generated by the depth map generator 1112 and a video stream and/or images received from the camera 1102.

[00139] The computing device 1100 can include at least one processor 1116. The at least one processor 1116 can execute instructions, such as instructions stored in at least one memory device 1118, to cause the computing device 1100 to perform any combination of methods, functions, and/or techniques described herein.

[00140] The computing device 1100 can include at least one memory device 1118. The at least one memory device 1118 can include a non-transitory computer-readable storage medium. The at least one memory device 1118 can store data and instructions thereon that, when executed by at least one processor, such as the processor 1116, are configured to cause the computing device 1100 to perform any combination of methods, functions, and/or techniques described herein. Accordingly, in any of the implementations described herein (even if not explicitly noted in connection with a particular implementation), software (e.g., processing modules, stored instructions) and/or hardware (e.g., processor, memory devices, etc.) associated with, or included in, the computing device 1100 can be configured to perform, alone, or in combination with computing device 1100, any combination of methods, functions, and/or techniques described herein.

[00141] The computing device 1100 may include at least one input/output node 1120. The at least one input/output node 1120 may receive and/or send data, such as from and/or to, a server, and/or may receive input and provide output from and to a user. The input and output functions may be combined into a single node, or may be divided into separate input and output nodes. The input/output node 1120 can include, for example, a microphone, a camera, a display such as a touchscreen, a speaker, a microphone, one or more buttons, and/or one or more wired or wireless interfaces for communicating with other computing devices.

[00142] FIG. 12A shows a portrait depth estimation model 1200. The portrait depth estimation model 1200 reduces a data size of the representation of the user. The portrait depth estimation model 1200 can be included in the computing device 1100 described with reference to FIG. 11. The portrait depth estimation model 1200 reduces a data size of the representation of the user before sending the representation of the user to the remote computing device. Reducing the data size of the representation of the user reduces the data sent between the computing device and the remote computing device.

[00143] The portrait depth estimation model 1200 performs foreground segmentation 1204 on a captured image 1202. The captured image 1202 is an image and/or photograph captured by a camera included in the computing device 1100, such as the camera 108, 1102. The foreground segmentation 1204 can be performed by a body segmentation module. The foreground segmentation 1204 includes removing a background from the image and/or photograph of the user captured by the camera. The foreground segmentation 1204 segments the foreground by removing the background from the captured image 1202 so that the foreground is remaining. The foreground segmentation 1204 can have similar features to the foreground segmentation that generates the modified input 1002 based on the cropped input 806, as discussed above with respect to FIG. 10. The foreground is the portrait (face, hair, and/or bust) of the user. The foreground segmentation 1204 results in a cropped image 1206 of the user. The cropped image 1206 includes only the image of the user, without the background.

[00144] After performing the foreground segmentation 1204, the portrait depth estimation model 1200 performs downscaling 1276 on the cropped image 1206 to generate a downscaled image 1274. The downscaling can be performed by a deep learning method such as U-Net. The downscaled image 1274 is a version of the cropped image 1206 that includes less data to represent the image of the user than the captured image 1202 or cropped image 1206.

[00145] The downscaling 1276 can include receiving the cropped image 1206 as input 1208. The input 1208 can be provided to a convolution module 1210. The convolution module 1210 can iteratively perform convolution, normalization, convolution (a second time), and addition on the input 1208.

[00146] The output of the convolution module 1210 can be provided to a series of residual blocks (Resblock) 1212, 1214, 1216, 1218, 1220, 1222 and to concatenation blocks 1226, 1230 1234, 1238, 1242, 1246. A residual block 1280 is shown in greater detail in FIG. 12B. The residual blocks 1212, 1214, 1216, 1218, 1220, 1222, as well as residual blocks 1228, 1232, 1236, 1240, 1244, 1248 either perform weighting operations on values within layers or skip the weighting operations and provide the value to a next layer.

[00147] After the values of the input 1208 have passed through the residual blocks 1212, 1214, 1216, 1218, 1220, 1222, the resulting values are provided to a bridge 1224. The bridge 1224 performs normalization, convolution, normalization (a second time), and convolution (a second time) on the values received from the residual blocks 1212, 1214, 1216, 1218, 1220, 1222. The residual blocks 1212, 1214, 1216, 1218, 1220, 1222 also provide their respective resulting values to the concatenation blocks 1226, 1230 1234, 1238, 1242. The values of the residual blocks 1228, 1232, 1236, 1240, 1244, 1248 are provided to normalization blocks 1250, 1254, 1258, 1262, 1266, 1270, which generate outputs 1252, 1256, 1260, 1264, 1268, 1272. The final output 1272 generates the downscaled image 1274.

[00148] FIG. 12B shows a resblock 1280, included in the portrait depth estimation model of FIG. 12A, in greater detail. The resblock 1280 is an example of the residual blocks 1212, 1214, 1216, 1218, 1220, 1222, 1228, 1232, 1236, 1240, 1244, 1248. The residual block 1280 can include a normalization block 1282, convolution block 1284, normalization block 1286, convolution block 1288, and addition block 1290. The resblock 1280 can perform normalization, convolution, normalization, convolution (a second time), and addition, or skip these operations and provide an output value that is equal to the input value.

[00149] Implementations of the various techniques described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. Implementations may implemented as a computer program product, i.e., a computer program tangibly embodied in an information carrier, e.g., in a machine-readable storage device, for execution by, or to control the operation of, data processing apparatus, e.g., a programmable processor, a computer, or multiple computers. A computer program, such as the computer program(s) described above, can be written in any form of programming language, including compiled or interpreted languages, and can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.

[00150] Method steps may be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Method steps also may be performed by, and an apparatus may be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

[00151] Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. Elements of a computer may include at least one processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer also may include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory may be supplemented by, or incorporated in special purpose logic circuitry.

[00152] To provide for interaction with a user, implementations may be implemented on a computer having a display device, e.g., a cathode ray tube (CRT) or liquid crystal display (LCD) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.

[00153] Implementations may be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation, or any combination of such back-end, middleware, or front-end components. Components may be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.

[00154] While certain features of the described implementations have been illustrated as described herein, many modifications, substitutions, changes and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the embodiments of the invention.

Claims

WHAT IS CLAIMED IS:

1. A method comprising: generating, based on images captured by a camera during a videoconference, a three- dimensional model of a first user participating in the videoconference; determining, based on an audio signal from the first user, that the first user is speaking, via the videoconference, with a second user; and based on determining that the first user is speaking to the second user, displaying, to a third user, a representation of the first user and a representation of the second user where the representation of the first user is based on a rotation of the three-dimensional model of the first user so that the representation of the first user is facing toward the representation of the second user.

2. The method of claim 1, wherein the audio signal includes a name of the second user spoken by the first user.

3. The method of either of claims 1 or 2, wherein the audio signal includes speech by the first user for at least a speech threshold period of time.

4. The method of either of any of the preceding claims, wherein: the audio signal is a first audio signal; the representation of the first user is a first representation of the first user; and the method further includes: determining, based on a second audio signal, that the first user is speaking to a general audience within the videoconference; and based on determining that the first user is speaking to the general audience, displaying, to the third user, a second representation of the first user, the second representation of the first user presenting the first user as facing toward the third user.

5. The method of claim 4, further comprising, based on determining that the first user is speaking to the general audience: reducing a size of the representation of the first user; and reducing a size of the representation of the second user.

6. The method of any of the preceding claims, wherein: the audio signal is a first audio signal; the representation of the second user is a first representation of the second user; and the method further includes: determining, based on a second audio signal from the second user, that the second user is engaging in a one-on-one conversation with the third user within the videoconference; and based on determining that the second user is engaging in the one-on-one conversation with the third user, displaying, to the third user, a second representation of the second user, the second representation of the second user presenting the second user as facing toward the third user.

7. The method of claim 6, wherein the determining that the second user is engaging in the one-on-one conversation with the third user within the videoconference is further based on silence from the first user for a silence threshold period of time.

8. The method of either of claims 6 or 7, wherein the determining that the second user is engaging in the one-on-one conversation with the third user within the videoconference is further based on the second audio signal including speech by the second user for at least a speech threshold period of time.

9. The method of any of the preceding claims, wherein the determining that the first user is speaking to the second user is based on the audio signal from the first user and an eye gaze of the first user.

10. The method of any of the preceding claims, wherein the determining that the first user is speaking to the second user is based on the audio signal from the first user and an audio signal from the second user.

11. The method of any of the preceding claims, wherein the determining that the first user is speaking to the second user is based on the audio signal from the first user and an eye gaze from the second user.

12. The method of any of the preceding claims, wherein the displaying the representation of the first user and the second user is based on determining that the first user is speaking to the second user and that the second user is speaking to the first user.

13. A method comprising: generating, based on images captured by a camera during a videoconference, a three- dimensional model of a first user participating in the videoconference; determining, based on an audio signal from the first user, that the first user is speaking, via the videoconference, with a second user; and based on determining that the first user is speaking to the second user, displaying a representation of the first user where the representation of the first user is based on a rotation of the three-dimensional model of the first user so that the representation of the first user is facing toward the second user.

14. The method of claim 13, wherein the displaying the representation of the first user is performed by a computing system that is in a same room as the second user.

15. The method of claim 13, wherein the displaying the representation of the first user is performed by a computing system that is in a same room as the second user and a third user, the third user providing input to the computing system.

16. A non-transitory computer-readable storage medium comprising instructions stored thereon that, when executed by at least one processor, are configured to cause a computing system to perform the method of any of the preceding claims.

17. A computing system comprising: at least one processor; and a non-transitory computer-readable storage medium comprising instructions stored thereon that, when executed by the at least one processor, are configured to cause the computing system to perform the method of any of claims 1-15.