WO2014071076A1

WO2014071076A1 - Conferencing for participants at different locations

Info

Publication number: WO2014071076A1
Application number: PCT/US2013/067877
Authority: WO
Inventors: Ronald David GUTMAN
Original assignee: Individual
Current assignee: Individual
Priority date: 2012-11-01
Filing date: 2013-10-31
Publication date: 2014-05-08
Anticipated expiration: 2015-05-01
Also published as: WO2014071152A1

Abstract

In a teleconference, audio data represent audio signals (air oscillations) from different participants without mixing the audio signals from different locations even if participants speak simultaneously; each participant's audio is not obscured by other participants. All participants' audio data are queued in a common queue (150) based on the time the audio was generated, and/or on the participants' priorities, and/or other information. The audio is played at each location in the queue's order. Other features are also provided.

Description

CONFERENCING FOR PARTICIPANTS AT DIFFERENT LOCATIONS

Ronald David Gutman

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] The present application claims priority and benefit of U.S. provisional application no. 61/721,032 filed November 1, 2012, incorporated herein by reference.

BACKGROUND OF THE INVENTION

[0002] The present invention relates to telecommunications networks, and more particularly to teleconferencing.

[0003] Teleconferencing seeks to extend the usual telephone capability to more than two people, so as to allow any number of people to communicate remotely as if they were talking face to face, e.g. in the same room. Teleconferencing equipment receives the audio signals from each person (each conference participant), mixes the audio, and sends the mixed audio to each participant. Additionally, if video conferencing is also available, then a participant may receive images (e.g. photographs or computer screen images) from one or more other participants. The images can be displayed on a computer monitor.

[0004] Improved teleconferencing facilities are desirable.

SUMMARY

[0005] This section summarizes some features of the invention. Other features may be described in the subsequent sections. The invention is defined by the appended claims, which are incorporated into this section by reference.

[0006] Some teleconferencing embodiments of the present invention step away from imitating face-to-face interaction between participants, and such embodiments enhance a teleconference with features not available in face-to-face interaction. In particular, in some embodiments, the participants' audio is not mixed and hence not obscured by other participants. Thus, some embodiments allow people at different locations to have a discussion using a voice network or data network in such a way that the following benefits and conveniences are provided:

[0007] 1. No person's spoken contribution is missed by any other person. To be heard in the discussion, a speaker makes very little effort. In some embodiments, the speaker may start to speak as soon as the words come to mind. In some embodiments, the speaker can start speaking at any pause in the discussion and his contribution will be heard regardless of what other participants do - other participants may start speaking at the same time and/or be distracted without missing any participant's contribution. [0008] 2. Even if two or more participants speak at the same time, they are not heard as speaking at the same time; their contributions to the discussion can be heard one at a time, in sequence.

[0009] 3. Participation schedules need not be precisely coordinated: a person can join the conference late yet still participate and hear all of the discussion. This can be achieved, for example, by recording each participant's contribution for later reproduction by any participant including those who join late.

[0010] 4. Interruptions, such as a call on another phone or other distraction, also do not cause any of the discussion to be missed by any person. The person being distracted can hear the other participants' contributions later during the conference if the contributions are recorded. The other participants thus do not have to wait for the distracted person; they can continue the discussion, or they can listen to earlier recorded contributions if desired.

[0011] 5. A speaker can pause briefly without being interrupted by another speaker who starts speaking at the pause - both speakers can speak at the same time.

[0012] 6. A moderator is not needed. A moderator can help establish priorities but is not needed to achieve the previously mentioned benefits.

[0013] 7. Some embodiments do not need a moderator to prioritize speakers.

[0014] 8. If one person is doing a presentation including video, questions can be asked at any time but can be heard by other participants later; any visual context shown by the presenter at the instant of the question can be automatically provided to each participant when the question is played to the participant.

[0015] 9. Muting is automatically applied to reduce noise from locations where no one is speaking to make a contribution.

[0016] There are many situations in which some embodiments are useful. Some of these situations are as follows:

[0017] 1. A meeting among several people of an organization when those people are in different locations. Some of them might be traveling and some working in home offices. An example of a kind of meeting that might benefit significantly is a brainstorming session because each person can contribute at the time his idea occurs. Long pauses in the discussion are less burdensome because if multiple participants speak at the same time, then each participant can listen to other participants' audio during a pause.

[0018] 2. A lecture given over the internet. Students will ask questions. A student can ask a question at any time and the question will placed in a queue. The lecturer might ask questions to be answered by the students and the student's answers can be placed in a queue. [0019] 3. Other kinds of presentations to an audience when questions are asked by the presenter or the audience.

[0020] 4. A chat among sport fans about a game in progress. In cases like this, the system might not provide any video; the users (chat participants) are presumed to each use their own favorite means of observing the game - TV or internet.

[0021] 5. Persons reporting about and coordinating response to an emergency such as one caused by an earthquake or hurricane.

[0022] 6. Any other discussion in which the participants are at different locations and participate at approximately, but not necessarily exactly, the same time.

[0023] In this document, an individual participant is sometimes referred to as "he" meaning "he or she".

[0024] The invention is not limited to the features and advantages described above except as defined by the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0025] Fig. 1 illustrates audio data flow in some embodiments of the present invention.

[0026] Fig. 2 is a block diagram of a teleconferencing system according to some embodiments of the present invention.

[0027] Fig. 3 is a block diagram of a central computing system used in teleconferencing according to some embodiments of the present invention.

[0028] Fig. 4 is a block diagram of a teleconferencing participant's system according to some embodiments of the present invention.

[0029] Fig. 5 is a block diagram of an audio segment according to some embodiments of the present invention.

DESCRIPTION OF SOME EMBODIMENTS

[0030] The embodiments described in this section illustrate but do not limit the invention. The invention is defined by the appended claims.

[0031] Some embodiments of this invention include methods that do at least the following:

[0032] 1. Capture each person's spoken contribution when it occurs, but, instead of transmitting it to other participants immediately, store the contribution as a segment of the overall discussion. For example, Figure 1 illustrates a conference of four participants at respective four locations 110A, 110B, 1 IOC, 110D. In this schematic illustration, each location has a microphone 120 and a speaker device 130. (The microphone converts the audio signals into electrical signals, and the speaker device performs the opposite transformation, as known in the art.) A segment of the discussion is defined as audio data that records one person's continuously spoken contribution. The microphone 120 at each location 110X (i.e. 110A, HOB, HOC, HOD) generates a respective segment 140X as shown in the Figure. Actually, a location 110X may generate multiple segments or no segments, as any participant might contribute more than one segment at any given time in the discussion, but a segment 140 (i.e. 140X) contains an uninterrupted contribution from one person. However, in some cases, where several participants share a single conference room at each location 110 (possibly with a single microphone 120), one segment 140 might contain contributions from more than one of the participants in the conference room. The end of a segment is automatically determined by a minimum length pause.

[0033] 2. Place the segments in a sequence (e.g. sequence 150 in Figure 1). This process can be called serialization; the segments are serialized. The sequence could be created on a single storage device or system (e.g. 264 in Figure 2) by a central computing device such as a computer server (e.g. 260 in Figure 2). Below, reference number 150 is used to refer both to the sequence of serialized segments and to an individual segment 140 in the sequence.

[0034] 3. Provide the serialized segments 140 in the determined sequence 150 to each participant (e.g. each speaker device 130) when each participant is ready to hear it. Every participant hears the same sequence 150, but not necessarily at the same time. An exception is that a participant might or might not hear the segments 140 created from his own contributions. When a participant is speaking to create a new segment 140, the audio output of the serialized discussion is paused for him at his speaker device 130 until he finishes speaking, when the output is resumed for him.

[0035] Many different embodiments of the invention can deal with a variety of situations and improve the usefulness in those situations. For example:

[0036] 1. A user interface at each location 110 can provide users (e.g. participants) with information about the discussion such as the following information:

[0037] - a. The total number of current participants (or of locations 110).

[0038] - b. For a listening participant (listening at his speaker device 130), the identity of the participant that the listening participant is currently hearing, that is, who contributed the segment 140 that is currently played by the speaker device 130.

[0039] - c. For a listening participant, the amount of time required to hear all of the remaining discussion not yet heard by the participant but already recorded by other participants. [0040] - d. When a participant is speaking, whether his speaking has been recognized and is being stored as a discussion segment 140. This information might be displayed visually in prominent way. In the case where the participant speaking is not accepted due to limits imposed on the speaking participant, an audible alarm can be produced at the participant's location 110 or the playback of serialized discussion segments 150 can be continued without pause.

[0041] - e. If the contributions of a participant are limited in time in some way, the amount of additional speaking time that the participant can contribute at this point in time.

Some possible limitations are explained below.

[0042] 2. A user interface at a location 110 can also allow a participant to:

[0043] - a. Pause the playing of the discussion when the participant needs to refer to another source of information, answer a phone call, or attend to some other urgent need.

[0044] - b. Rewind the discussion if the participant needs to hear part of the discussion again to better understand it.

[0045] - c. Skip ahead after rewind to the first point not yet heard by the participant.

[0046] - d. Skip ahead to the end of a segment 140 contributed by the same participant.

In some embodiments, though, it might be the default for the participant to skip his own segment.

[0047] - e. Continue after pause, rewind, or skip.

[0048] 3. Sometimes it is desirable to control the proportion of each person's contribution to the discussion so that no person inappropriately dominates. At the same time, is desirable for each participant to hear all of the other participants. Therefore, some embodiments can place limits on the following:

[0049] - a. The proportion of the discussion time attributed to one speaker.

[0050] - b. The amount of discussion unheard by a participant who is contributing a new segment. This helps to ensure that, when a user (a participant) makes a contribution, he is aware of most of what others have said, and it discourages a participant who wants to contribute from falling behind due to interruptions, rewinding, or listening to his own segments which would thereby take twice as much of his time compared to not listening to them.

[0051] These limits can be applied each time a participant begins to speak to determine whether his speech will be stored as a new segment. As the participant speaks they can be applied to determine a cut-off time if needed.

[0052] 4. For some situations, such as a lecture, the limits applied can depend on the participant. The limits for a lecturer, for example, can be much larger. The lecturer naturally contributes a much larger proportion of the discussion. Another embodiment for lectures can place the lecturer's segments 140 alternately in the discussion sequence 150 since it is natural for the lecturer to answer each question from students and, when he asks a question, respond to each answer from a student.

[0053] 5. In situations where short contributions are preferred, participants can be encouraged to produce short segments by a serialization method that gives priority to shorter segments. Whenever there is more than one segment to be serialized, the shorter segments can be placed in the sequence first. This would also tend to give priority to questions and that is usually desirable. A very simple embodiment of this would sequence segments according to the actual time their production is completed. But a presenter's segments can be prioritized differently so that the presenter's answer to a question can be placed immediately following the question.

[0054] 6. The output of the serialized discussion sequence can include pauses of a desired length between the segments. This provides a participant who wishes to make a contribution with an obvious and convenient moment to do so. This is especially useful in embodiments that do not allow a user to start a new segment while listening to another segment. Such embodiments are useful where the computing devices used are not powerful enough for the speech processing needed to separate the segment being output from the new speech.

[0055] 7. Video can accompany the audio and be serialized with it. This assumes that each speaker has both a microphone available and a means to produce video like a web-cam or a laptop computer.

[0056] 8. When a presentation includes visual media, such as Microsoft PowerPoint slides, the presenter can provide all of the video and the video can be stored with the serialized discussion sequence. When a new segment is created from a question from the audience, it can be associated with a point in time of the video provided by the presenter. Then, when that audio segment is output, it can be accompanied by the video at that point in time providing the context for the question. If the video originally comes from the presenter's computer, the presenter has the option to take control back and switch the video in real-time, but that only affects what subsequent users will see.

[0057] The following discussions and the accompanying diagrams more fully explain some embodiments of the invention.

[0058] Figure 1 shows the main concept in which segments of audio ("new audio segments" 140 below) from different locations 110 are placed into a sequence 150 of segments, the "serialized audio segments" in the figure. The sequence 150 of segments can be streamed back to audio speakers 130 at each location 110 to he heard by the participants at those locations. The streaming to each location 110 can occur independently from the other locations.

[0059] Figure 2 shows four different locations 110A, 110B, 1 IOC, 110D each of which has participants in the same discussion. Each of the four locations has a discussion client 210 which is a system dedicated to capturing contributions from the user or users (i.e. participants) at that location and delivering serialized discussion segments 150 back to the users at that location. Each location 110 in the figure is set up differently:

[0060] 1. Location 110A has only audio devices, and namely an audio speaker 130 and a microphone 120. Other locations 110 might generate and view video as part of the discussion but location 110A does not.

[0061] 2. Location 110B has a speaker device 130 and a microphone 120, and in addition has video devices including a web cam 220 and a display screen 240 for video.

[0062] 3. Location HOC is a mobile phone. The phone can be used for audio only as the phone includes a microphone and a speaker device (not shown), but the phone could be used for video also since the phone includes a screen and may or may not include a camera. The corresponding discussion client 210 is the phone's computer (not shown).

[0063] 4. Location HOD is a laptop computer which may be able to participate fully in the discussion, i.e. to provide both audio and video capture and display. The discussion client is the laptop computer's processor and memory and other associated circuitry (e.g. network interface card, etc.).

[0064] Each of these discussion clients 210 at locations 1 lOA-110D communicates with central computing system 260 that stores the serialized discussion segments 150 and assigns to each segment an index, e.g. a whole number greater than zero, which is a number used to identify the segment. The central computing system 260 can be a computer on the Internet or can be a network of computers. New segments 140 are generated by each discussion client 210 and sent to the central computing system 260, which puts them in sequence 150. All of the segments 140 are sent back to each client 210 in the sequence 150 established by the central computing system 260. The central computing system 260 also provides configuration data 268 to clients 210 and the status 272 of the discussion in progress.

[0065] Figure 3 shows how the central computing system 260 works in detail. At this point, it is useful to know what each discussion segment 140 contains in addition to the audio data 510 (see Fig. 5). Each segment 140, as generated by a discussion client 210 and transmitted to the central computing system 260, contains the following:

[0066] - The audio data 510 that represents the spoken contribution of a participant.

[0067] - The identity 520 of the participant (possibly of the corresponding location 110 or the microphone 120 or the discussion client 210) who made the contribution.

[0068] - The time stamp 530 identifying the absolute time when the contribution was made, that is, when the participant began speaking. The time stamp can be encoded according to a standard such as Unix time which encodes time as the number of seconds after January 1 , 1970.

[0069] - An indicator 540 that indicates whether or not the participant began speaking while his discussion client was paused between the playing of two segments.

[0070] - The index 550 of a related segment, which can be the segment played by the participant's discussion client 210 at the instant the participant began speaking or otherwise began contributing audio. If the participant began during a pause between two segments, the index can be the index of the last segment played by the participant's discussion client. If no segment of the discussion had yet been played by the participant's discussion client, then the related segment's index can be 0.

[0071] - A time stamp 560 relative to the beginning of the related segment if any. The time stamp can be encoded by the number seconds from the start of the other segment.

[0072] Newly created segments 140 are directed to Speech processing unit 310 (Figure 3) within central computing system 260. Speech processing unit 310 cleans up the sound (510) in the segment 140. The speech processing unit might remove noise, but an important part of the cleanup is to remove sound from a related segment (indicated by 550) which was being played by discussion client 210 when the segment 140 was created. The related segment's sound might be picked up by the microphone 120 being used to capture the new segment 140. It is desirable to remove that part of the captured sound, if any, to make the contributor's presentation clearer. If the contributor uses a head phone for microphone 120 or starts in a pause, the unwanted sound might be minor, but otherwise, it can be significant. This processing is similar to echo cancellation and prior art might be used to implement it. Spectral subtraction is a technique, also prior art, that might be used by the unit 310. In removing sound from the related segment (550), the original audio data 510 of the related segment can be used, and the speech processing unit 310 can obtain this audio data from the serialized segment server 320 using the related segment index 550, and may use the relative time stamp 560 to locate the sound within the related segment (550) originally produced by the discussion client 210 at the time when the new segment 140 was captured. Only a short initial portion of the audio 510 of the new segment 140 is processed for this removal because the discussion client 210 pauses play of the related segment (550) shortly after the creation of the new segment 140 begins.

[0073] It should be noted that these functions that improve the sound in the new segment 140 can be performed within the discussion client 210. If so, then the Speech processing unit 310 within the central computing system might not perform these functions. Both systems 310, 210 have access to the related segment data 510. The central system 310 might perform these functions because it may have more powerful processing capabilities, but the client system 210 might perform these functions because it can optimize the processing for the acoustic environment and because it does not need to do processing for other participants. Either way, the Speech processing unit 310 transmits the audio portion 510 of new segments 140 to the Segment Serializer 330.

[0074] It should be noted that segments 140 may arrive simultaneously from different discussion clients 210, and can be processed simultaneously as they arrive.

[0075] The Segment Serializer 330 takes new segment information as described above and places each segment 140 into the sequence 150 by assigning a segment index to the segment. In the embodiment being described, the segment index is a number that is greater for segments later in the sequence 150. A very simple embodiment of the serializer 330 can simply assign increasing indexes in the order that new segments initially arrive at the serializer. Other embodiments can take other factors into account as described above. Also, note that a new segment 140 arrives over time as it is being created and the serializer 330 can wait until it has received all of the new segment before assigning an index to the new segment, especially since some possible rules for assigning the index use the length of the segment or the time of the segment's end. The segment serializer 330 accesses serialization rules 342 serialization from the Discussion configuration unit 340. For example, the serializer 330 can work as follows:

[0076] 1. When a new segment 140 begins to arrive, the serializer 330 checks the configuration 342 for privileges of the contributing participant. For example, the serialization rules make indicate for that the new segment's contributor (identified by 520) might have the privilege that his segments 140 appear alternately in the sequence (for example, if the contributor is a lecturer). If so, and if the last assigned segment 140 (i.e. the last segment assigned an index) was contributed by another participant, then the segment 140 is immediately assigned the next index. [0077] 2. Otherwise, the new segment 140 is placed in a group of unassigned segments to which other rules 342 are applied. If enough discussion clients 210 are waiting for segments because the discussion clients have already played all assigned segments, then the serializer 330 can pick an incomplete new segment 140 and assign to it the next index so the segment can become available to the waiting clients 210. Serializer 330 can apply a rule such as picking a segment 140 based on its absolute time stamp 530. The next index can be assigned to the segment 140 with the earliest time stamp 530. Serializer 330 can also consider the priority of the contributor. The serializer can obtain current information about waiting clients from the Serialized Segment Server 320.

[0078] 3. Otherwise, if the current demand from clients 210 is low, other rules can be applied. As each of these segments 140 is completed, some of the other rules for serialization can be considered. For example, if a completed new segment 140 is short enough it can be assigned an index immediately. The segment can be deemed short enough if all other unassigned new segments are already longer. Alternately, the serializer can simply choose the earliest completed new segment 140 to be assigned the next index.

Serializer 330 can also consider the priority of the contributor.

[0079] 4. Immediately upon assigning an index to a new segment, the serializer starts transmitting the segment to the Serialized segment server 320.

[0080] The Serialized Segment Server 320 does the following:

[0081] 1. Receives newly assigned segments 140 from the Segment Serializer 330 and stores them in storage 264 with all of the associated attributes mentioned above (see Figure 5). The storing can begin while the segment 140 is still being transmitted from the serializer 330.

[0082] 2. Handles requests for serialized segments 150 from Discussion clients 210. A request can come while a serialized segment in sequence 150 has not been fully transmitted from the serializer 330 in which case the Serialized segment server 320 streams the segment 140 directly from the serializer 330.

[0083] 3. Handles request from the Speech processing unit 310 for related speech segments. These might also be streamed or retrieved from the storage 264.

[0084] 4. Responds to requests from discussion clients 210 for discussion status 272. Status 272 can include:

[0085] - a. What serialization indices have been assigned.

[0086] - b. The total time of the audio data in each serialized segment 150 (for those segments which have not yet been completely received from serializer 330, the total time up to current time).

[0087] - c. The total time of the audio data of all serialized segments 150 (i.e. total time of the sound represented by the audio data).

[0088] - d. The participant identity 520 for the creator of each segment.

[0089] In an alternate embodiment, the Segment Serializer 330 streams new segments 140 into the Serialized Segment Server 320 to begin storing each new segment 140 before the segment is assigned an index. The Segment Serializer 330 can later send the serialization sequence index to the Serialized Segment Server 320. Before being assigned a sequence index, a new segment 140 can be assigned a temporary index that the serializer 330 and Serialized Segment Server 320 use to reference the segment, or the segment can be referenced by the identity 520 of the participant from which the segment comes.

[0090] Discussion configuration unit 340 keeps the following information that other parts of the system, such as the Segment Serializer 330 and the Discussion Clients 210, can access:

[0091] - a. Privilege information on participants. The privilege information may include:

[0092] - Information on a presenter whose contributions are alternated with others.

[0093] - Information on other priority for sequencing of contributions.

[0094] - b. Limit on the lengths of each new segment 140, i.e. on the time of the sound of the segment's audio 510. The limit can depend on the contributor.

[0095] - c. Limit on the total proportion of the time taken by all of the segments 140 that can come from one contributor. The limit can depend on the contributor.

[0096] - d. Limit on how much of the discussion sequence can be unheard by any participant when he begins contributing a new segment 140. The limit can depend on the contributor.

[0097] - e. The length of pauses to be inserted by the Discussion client 210 between the playing of consecutive segments 150.

[0098] The term "participant" for these purposes can mean all of persons sharing one Discussion Client 210 since the system does not distinguish those persons from each other.

[0099] Figure 4 shows a Discussion Client 210, which serves two purposes:

[00100] 1. Playing the discussion stream at one location 110.

[00101] 2. Capturing new contributions and transmitting them to the central computing system 260 as new segments 140.

[00102] Video buffer 410 and audio buffer 420 in the discussion client simply capture and store, in main memory (not shown) or elsewhere, the raw data from any video capture device such as a web cam 220 and audio capture device 120 such as a microphone so that the data is not lost before it can be processed.

[00103] User interface 430 displays discussion status 272 described above and accepts commands from the participant in one form or another such as a voice command or touch of a button. (User interface 430 may be combined with screen 240 to display both status 272 and video segments, and/or user interface 430 can be combined with user interfaces of other devices, such as 120 or 220.)

[00104] VAD (Voice Activity Detection) unit 440 performs voice activity detection, which means that it detects when new speech begins to occur. The detection is performed based on the signal from audio capture device 120. When VAD 440 detects new speech, VAD 440 alerts New Segment Control unit 450 to manage creation of a new segment 140. VAD unit 440 can use algorithms from prior art, such as counting zero crossings, to detect the start of new speech. The detection can err toward inferring start of speech when there is none because another unit, the speech processing unit 460, can compensate. In this approach, VAD 440 can make a quick judgment and the more complex analysis is only performed when VAD 440 detects start of speech. This design is useful when hardware is not sufficiently powerful or the playing of segments by Segment player 470 feeds sound back to microphone 120 confusing the VAD algorithm.

[00105] Speech processing unit 460 is very similar to the speech processing unit 310 in the Central computing system 260. As stated before, an embodiment might fully implement only one of the two speech processing units 460, 310 while the other unit simply passes the new segment data 510 through. However, the unit 460 in the Discussion Client 210 may also perform the following tasks:

[00106] 1. It formats the information that accompanies a new segment 140, as described above in connection with Figure 5, based on information from the New Segment Control Unit 450 as described below.

[00107] 2. It transmits the video stream as well the audio. If unit 460 is directed by the New Segment Control Unit 450 not to transmit the new segment as described below (due to limit violations for example), then unit 460 may also block transmission of the associated video, if any, captured device 220.

[00108] If speech processing 460 fully implements reduction of noise and of sound from a related segment, then speech processing 460 can have these special features:

[00109] 1. It can learn parameters that describe how the acoustic environment affects the audio of the new segment. How ambient noise and sound from simultaneously played segments affects the audio of the new segment depends largely on the environment. So speech processing 460 might do a better job in reducing noise and sound from a related segment than the corresponding unit 310 in the Central computing system 260.

[00110] 2. Speech processing 460 can augment the VAD algorithm because speech processing 460 uses information about the sound from a simultaneously played segment and about how the input audio is affected by the simultaneously played segment. After removal of noise and sound of the simultaneously played segment from the audio input, speech processing 460 can test more accurately for the start of new speech. If unit 460 determines that new speech has not occurred (VAD was triggered by noise or playback), unit 460 signals the New Segment Control 450 that a new segment will not be created, and speech processing 460 does not transmit a new segment.

[00111] New Segment Control unit 450 directs the creation of new segments 140 as follows:

[00112] 1. It waits for indications of a new segment 140 from two sources:

[00113] - a. User interface 430 which might provide a means for the user to indicate creation of a new segment (via voice command, button touch, or other human interface). To the user, this is a "record" command.

[00114] - b. VAD unit 440.

[00115] 2. When New Segment Control 450 has indication of a new segment 140, New Segment Control 450 applies rules based on participant status received from the Segment player 470 and on discussion configuration 268 from the Central computing system 260 to determine whether the new segment should be allowed. Possible rules are described above.

For example, New Segment Control 450 can also use the rules to compute how much additional time this participant can contribute to new segments based on current discussion status 272. New Segment Control 450 might use this information to enforce the rules. New

Segment Control 450 can additionally transmit the information to user interface 430 for display.

[00116] 3. If New Segment Control 450 determines that a new segment should be created, New Segment Control 450 signals the Segment player 470 to pause any currently playing segments and signals the Speech processing unit 460 to transmit the new segment. At the same time New Segment Control 450 sends information to the Speech processing 460 on how to format the new segment. Such information may include:

[00117] - a. Identity of the participant.

[00118] - b. Status of playback at the time and information about any related segment being played at the time.

[00119] 4. New Segment Control 450 accepts any signal from the Speech processing unit 460 to abort the new segment and responds by signaling the Segment player 470 to resume any playback.

[00120] Audio segment buffer 482 and video segment buffer 484 monitor the Central computing system 260 for serialized segments and buffer the segments as the segments become available. A video segment contains the video data for the corresponding audio segment 140. Audio buffer 482 includes, for each segment 140 it stores, all segment information as described above including the serialized segment index. The speech processing unit 460 can access audio segment buffer 482 for a related segment 150 played during creation of a new segment 140 and can remove from the new segment 140 the sound played during creation of the new segment.

[00121] Segment player 470 performs the following functions:

[00122] 1. Accepts pause, rewind, skip, and continue commands from the user interface 430 and pause and continue commands from the New Segment Control unit 450.

[00123] 2. In accordance with those commands, transmits data from the audio segment buffer 482 and video segment buffer 484 (if any) to the audio and video playback devices 130, 240.

[00124] 3. Obtains discussion status 272 from the Central computing system 260, tracks the activity of the participant using this client 210, keeps playback status, and provides related information to the user interface 430 and provides participant status, mentioned above, to the New Segment Control unit 450. For example, in some embodiments, segment player 470:

[00125] - a. Provides the index of the currently playing segment 150 to the New Segment Control Unit 450. The index can be used in formatting new segments (note field 550 in Fig. 5).

[00126] - b. Provides the identity of the participant that created the currently playing segment 140 to the user interface 430 for display.

[00127] - c. Obtains the total time for serialized segments 150 not yet played. This information can be displayed and used by the New Segment Control unit 450. This information is approximately the amount of time required to hear all of the remaining discussion created so far.

[00128] - d. Provides the total time contributed by this participant and by all participants, so far, for display and use by the New Segment Control unit 450. [00129] - e. Provide the total number of participants for display.

[00130] Anything described herein as a "unit" can be implemented in hardware with or without the use of software (hardware can include a software-programmed computer).

[00131] All of the systems, sub-systems, and units described herein can be parts of a system that serves many concurrent discussions. This may require suitable scaling, distribution of data and processing, and routing of requests and responses according to the discussions involved.

[00132] The invention is not limited to the embodiments described above. Other embodiments and variations are within the scope of the invention, as defined by the appended claims.

Claims

1. A teleconferencing method comprising performing, by a teleconferencing system, operations of:

during a teleconference conducted by participants located at two or more locations interconnected by a telecommunications network, obtaining segments of audio data representing audio signals generated by the participants, each segment containing audio data from a respective one of the locations, the audio data in each segment being associated with a time at which the audio signals are assumed to have been generated;

serializing the segments received from two or more of the locations to establish an order of the audio data, the order being established based on one or more predefined rules for establishing the order, the serializing being performed even if audio signals of two or more of the segments overlap in time, the serializing thus allowing the audio signals overlapping in time to be reproduced from the audio data separately rather than mixed; and processing the segments taking the order into account.

2. The method of claim 1 wherein processing the segments comprises sending the segments' audio data over the telecommunications network to the locations.

3. The method of claim 2 further comprising sending to a location, for each segment sent to the location, information regarding an identity of the participant who contributed to the audio data currently being played by the teleconferencing system.

4. The method of claim 2 further comprising sending, to the locations, the total number of participants and/or locations.

5. The method of claim 2 further comprising sending, to the locations, the total time contributed by all participants.

6. The method of claim 1 wherein processing the segments comprises sending the segments' audio data over the telecommunications network to at least two of the locations, wherein each location is being sent at least the segments not generated at the location, wherein the segments sent to both of the locations are sent in the same order.

7. The method of claim 6 wherein the teleconference is conducted by three or more participants.

8. The method of claim 6 wherein the order takes into account one or more of: time when the audio signals were generated;

information on a priority and/or privilege of at least one segment's participant; the segments' lengths.

9. The method of claim 6 wherein the order takes into account that segments obtained from one of the locations are to be alternated in the serialization with segments obtained from one or more other locations.

10. The method of claim 6 wherein at least one of the segments is serialized while the segment is still incomplete.

11. The method of claim 6 wherein processing the segments comprises sending the segments' audio data over the telecommunications network to the locations, wherein each location is not being sent the segments generated at the location.

12. The method of claim 6 further comprising performing, by the

teleconferencing system, operations of:

obtaining configuration data specifying, for at least one location, information on at least one of:

a privilege of the location's participant;

a limit on the length of each of the location's segments;

a limit on the total proportion of the time taken by audio data allowed to come from the location;

a limit on how much of audio data can be unheard by the location when the location receives an audio signal for the teleconference;

a length of pauses to be inserted at the location between the playing of consecutive segments.

13. The method of claim 12 wherein the configuration data specify, for at least one location, information on the privileges of the location' s participant, the information on the privileges comprising information on sequencing of the location's participant's segments with segments of other participants.

14. The method of claim 12 wherein the configuration data specify, for at least one location, an indication that the participant's segments are to alternate with segments of other participants in said order.

15. The method of claim 6 wherein the locations include a first location and a second location;

wherein the audio data from the first location comprise first audio data associated with first video data from the first location;

wherein the audio data from the second location comprise second audio data which represent audio signals generated in association with the first audio data being played at the second location;

wherein the method further comprises providing the second audio data for being played at one or more locations other than the second location, and providing the first video data for being displayed while playing the second audio data.

16. The method of claim 6 wherein at least the end or the beginning of at least one of the segments of audio data representing audio signals generated by at least one of the participants ("first participant" below) is determined using a minimum-length pause in the audio signals generated by the first participant.

17. The method of claim 6 wherein at least the end or the beginning of at least one of the segments of audio data representing audio signals generated by at least one of the participants ("first participant" below) is determined using a maximum length for segments of the audio signals generated by the first participant.

18. The method of claim 6 wherein at least the end or the beginning of each of a plurality of segments of audio data representing audio signals generated by different participants is determined using maximum lengths for segments of the audio signals generated by said participants, wherein each maximum length depends on the participant.

19. The method of claim 6 wherein processing the segments comprises inserting pauses of a predetermined length between the segments.

20. The method of claim 6 wherein at least one segment ("first segment") represents audio signals generated at one location while playing, at said location, audio data from another; and

the method further comprises removing the played data from the first segment.

21. The method of claim 6 wherein in obtaining the segments, the audio data for the segments from each location are limited by a proportion of time for audio signals generated at the location relative to the audio signals generated at all the locations.

22. The method of claim 6 wherein in obtaining the segments, the audio data for the segments from each location are limited based on amount of audio data not yet played at the location.

23. A teleconferencing method comprising executing teleconferencing operations by a teleconferencing system located at a first location which is one of two or more locations interconnected by a telecommunications network, the teleconferencing operations being executed during a teleconference conducted by participants located at the two or more locations, the teleconferencing operations comprising:

(1) obtaining, by the teleconferencing system, a sequence of segments of audio data representing audio signals generated at one or more other locations, each segment containing audio data from a respective one of the locations, the audio data representing the audio signals from each location separately without mixing the audio signals from different location regardless of whether or not the audio signals at different locations were generated simultaneously;

wherein the teleconferencing system is operable to play the audio data;

(2) wherein the method further comprises recording audio signals generated at the first location to generate audio data representing the audio signals, and sending such audio data over the telecommunications network for use at the other locations.

24. The method of claim 23 wherein the teleconferencing operations are executed during the teleconference conducted by participants located at the three or more locations.

25. The method of claim 23 further comprising providing, via user interface of the teleconferencing system, information regarding an identity of the participant who contributed to the audio data currently being played by the teleconferencing system.

26. The method of claim 23 further comprising providing, via user interface of the teleconferencing system, the total number of participants and/or locations.

27. The method of claim 23 further comprising providing, via user interface of the teleconferencing system, the total time contributed by all participants.

28. The method of claim 23 further comprising providing, via user interface of the teleconferencing system, the time contributed at the first location.

29. The method of claim 23 further comprising providing, via user interface of the teleconferencing system, information regarding an amount of time required to hear all of the audio signals that have been recorded in the teleconference but have not been played.

30. The method of claim 23 further comprising:

obtaining, by the teleconferencing system, information on a limit condition imposed on recording of audio signals in operation (2);

(3) detecting an audio signal;

performing or not performing the operation (2) on the audio signal detected in (3) depending on whether or not performing the operation (2) on the audio signal in (3) would violate the limit condition.

31. The method of claim 30 further comprising providing, via user interface of the teleconferencing system, an indication that the operation (2) is not performed on the audio signal in (3) if the operation (2) is not performed on the audio signal in (3) due to violation of the limit condition.

32. The method of claim 30 further comprising providing, via user interface of the teleconferencing system, an additional amount of time allowed by the limit condition for the audio signal (3) if the audio signal is to be processed by operation (2).

33. The method of claim 30 wherein the limit condition includes a limit on a proportion of time for audio signals processed in operation (2).

34. The method of claim 30 wherein the limit condition includes a limit on an amount of audio data not yet played by the teleconferencing system at the time of (3).

35. The method of claim 23 further comprising receiving, by user interface of the teleconferencing system, a command to:

- pause playing of the audio data;

- rewind playing of the audio data;

- skip ahead after a rewind to a beginning of the audio data not yet played;

- skip ahead to an end of at least a portion of audio data recorded by the

teleconferencing system;

- continue playing if the playing has been paused.

36. The method of claim 23 further comprising:

playing first audio data by the teleconferencing system, wherein the first audio data is obtained from another location and is associated with second audio data played at the other location at a time related to the time that the first audio data was produced at the other location, wherein the second audio data is associated with video data;

when playing the first audio data, the teleconferencing system discovering an association between the first audio data and the second audio data and between the second audio data and the video, and in response to this discovering, the teleconferencing system displaying the video data.

37. The method of claim 23 further comprising:

the recording comprises obtaining first audio data representing audio signals generated at the first location, and the method comprises determining that the first audio data is associated with second audio data played by the teleconferencing system at a time associated with the time of obtaining the first audio data; and operation (2) further comprises sending, over the telecommunications network, information on the second audio data as associated with the first audio data.

38. The method of claim 23 further comprising inserting pauses of a

predetermined length between the segments when playing the segments' audio data.

39. The method of claim 23 wherein the recording is performed while playing audio data so that the data generated at the first location is affected by played audio data; and

the method further comprises removing the played data from the generated data.

40. The method of claim 23 wherein the teleconferencing system is operable to play the audio data generated at the first location.

41. The method of claim 23 wherein the teleconferencing system pauses the playing of the audio data when the teleconferencing system detects an audio signal not generated by the teleconferencing system.

42. The method of claim 23 wherein the teleconferencing system comprises user interface allowing a participant at the first location to indicate creation of a new segment of audio data to be sent over the telecommunications network for use at the other locations.

43. The method of claim 23 wherein the audio data generated at each location are subdivided into segments and segments from different locations are ordered in a common sequence determined by a central computer system, and at least the end or the beginning of at least one of the segments of audio data representing audio signals generated by at least one of the participants is determined using a minimum- length pause in the audio signals generated by said at least one of the participants.

44. The method of claim 23 wherein the audio data generated at each location are subdivided into segments and segments from different locations are ordered in a common sequence determined by a central computer system, and at least the end or the beginning of at least one of the segments of audio data representing audio signals generated by at least one of the participants ("first participant" below) is determined using a maximum length for segments of the audio signals generated by the first participant.

45. The method of claim 23 wherein the audio data generated at each location are subdivided into segments and segments from different locations are ordered in a common sequence, and at least the end or the beginning of each of a plurality of segments of audio data representing audio signals generated by different participants is determined using maximum lengths for segments of the audio signals generated by said participants, wherein each maximum length depends on the participant.

46. A teleconferencing method comprising:

during a teleconference conducted by participants located at two or more locations interconnected by a telecommunications network, performing operations of:

at each location, generating audio data representing audio signals generated at the location;

sending the audio data generated at each location to every other location; and receiving, at each location, audio data generated at every other location, and playing the received audio data at the location at which the audio data is received;

wherein regardless of whether or not the audio signals or audio data are generated simultaneously at two or more locations, the audio data represent the audio signals from each location separately without mixing the audio signals from different location regardless of whether or not the audio signals at different locations were generated simultaneously.

47. A data processing apparatus operable to communicate over a

telecommunications network and perform the method of any one of claims 1 to 46.