WO2025153655A1

WO2025153655A1 - Fallback from augmented audio capture to real-world audio capture

Info

Publication number: WO2025153655A1
Application number: PCT/EP2025/051099
Authority: WO
Inventors: Janusz Klejsa; Heidi-Maria LEHTONEN; Ludvig NORING; Pawel JAROCH
Original assignee: Dolby International AB
Current assignee: Dolby International AB
Priority date: 2024-01-17
Filing date: 2025-01-16
Publication date: 2025-07-24
Anticipated expiration: 2026-07-17
Also published as: WO2025153481A1

Abstract

Some disclosed methods involve receiving audio data from a microphone system; receiving video data from a camera system; creating an inventory of audio sources; selecting at least a first selected audio source from the inventory for augmentation or replacement; augmenting or replacing audio data of the first selected audio source, to produce first modified audio data comprising at least one of first augmented audio data or first replacement audio data; storing the first modified audio data; storing audio data and video data from a capture phase including first unmodified audio data of at least the first selected audio source; controlling a display of the device to present images of the video data overlaid by a post-capture GUI; and editing, during a post-capture phase review process, the first modified audio data including at least a portion of the first unmodified audio data based on the user input from the post-capture GUI.

Description

FALLBACK FROM AUGMENTED AUDIO CAPTURE TO REAL-WORLD AUDIO CAPTURE

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of priority from U.S. Provisional Application No. 63/621,960, filed on 17 January 2024 and U.S. Provisional Application No. 63/744, 455, filed on 13 January 2025, each of which is incorporated by reference herein in its entirety

TECHNICAL FIELD

[0002] This disclosure relates generally to audio capture and to related user feedback and audio processing.

BACKGROUND

[0003] Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application, and are not admitted as prior art by inclusion in this section.

[0004] It is a common situation that during capture of audio-video content on a mobile device users can see a video that is being captured but cannot easily monitor the audio signal being captured (e.g., without using one or more additional devices). While the audio capture processes — also referred to herein as the “audio capture stack” or the “audio stack” — implemented by a mobile device is typically designed to maximize the quality of the captured audio according to some assumed artistic intent, a mobile device user generally does not know whether the technology operating within the audio capture stack performs satisfactorily. Other issues are discussed in the Detailed Description below. Improved methods, devices and systems would be desirable.

SUMMARY

[0005] Techniques are described for audio capture and for related user feedback and audio signal processing. In some example embodiments, the methods may involve receiving, by a control system of a device, audio data from a microphone system and video data from a camera system. In some examples, the audio data may be received from a microphone system of the device and the video data may be received from a camera system of the device. Some methods may involve identifying, by the control system and based at least in part on the audio data and the video data, two or more audio sources in an audio scene. Some methods may involve estimating, by the control system and based on the audio data, at least one audio characteristic of each of the two or more audio sources. Some methods may involve storing, by the control system, audio data and video data received during a capture phase. The storing may, in some examples, involve storing unmodified audio data received during the capture phase.

[0006] Some methods may involve controlling, by the control system, a display of the device to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images. In some examples, the GUI may include an audio source image corresponding to the at least one audio characteristic of each of the two or more audio sources. In some examples, the GUI may include one or more user input areas for receiving user input. Some methods may involve receiving, by the control system and prior to the capture phase, user input via the one or more user input areas. Some methods may involve causing, by the control system, audio data received during the capture phase to be modified according to the user input.

[0007] Some methods may involve receiving, by the control system and after the capture phase has begun, user input via the user input area. Some methods may involve causing, by the control system, audio data received throughout a duration of the capture phase to be modified according to the user input.

[0008] Some methods may involve comprising classifying, by the control system, the two or more audio sources into two or more audio source categories. In some examples, the GUI may include a user input area portion corresponding to each of the two or more audio source categories. In some examples, the two or more audio source categories may include a background category and a foreground category. According to some examples, the two or more audio source categories may include at least one user input area configured for receiving user input regarding a selected level, or ratio of levels, for each of the two or more audio source categories.

[0009] Some methods may involve creating, by the control system, an inventory of sound sources. According to some examples, the inventory of sound sources may include actual sound sources and potential sound sources. In some examples, classifying the two or more audio sources into two or more audio source categories may be based, at least in part, on the inventory of sound sources. [0010] Some methods may involve determining one or more types of actionable feedback regarding the audio scene. In some examples, the GUI may be based, in part, on the one or more types of actionable feedback.

[0011] Some methods may involve storing unmodified audio data received during the capture phase. Some methods may involve storing modified audio data that has been modified according to user input.

[0012] Some methods may involve creating and storing, by the control system, user input metadata corresponding to user input received via the user input area. Some methods may involve causing the audio data received during the capture phase to be modified according to the user input involves post-capture audio processing based at least in part on the user input metadata. In some examples, the control system may be configured to perform at least a portion of the post-capture audio processing. According to some examples, another control system — such as a control system of a cloud-based service, e.g., a server’s control system — may be configured to perform at least a portion of the post-capture audio processing. In some examples, the identifying may involve performing, by the control system, a first sound source separation process. In some such examples, the post-capture audio processing may involve performing a second sound source separation process.

[0013] In some examples, the identifying may involve detecting, by the control system and based at least in part on the video data, one or more potential sound sources. According to some examples, at least one of the one or more potential sound sources may not be indicated by the audio data.

[0014] Some methods may involve detecting, by the control system, one or more candidate sound sources for augmented audio capture. In some examples, the augmented audio capture may involve replacement of a candidate sound source by external audio or synthetic audio. According to some examples, the GUI may include at least one user input area configured for receiving a user selection of a selected potential sound source or a selected candidate sound source. In some examples, the GUI may include at least one user input area configured for receiving a user selection of augmented audio capture. According to some examples, the augmented audio capture may include at least one of external audio or synthetic audio for the selected potential sound source or the selected candidate sound source. In some examples, the GUI may include at least one user input area configured for receiving a user selection of a ratio between augmented audio capture and real-world audio capture.

[0015] According to some examples, causing the audio data received during the capture phase to be modified according to the user input may involve modifying audio data corresponding to a selected audio source or a selected category of audio sources. In some examples, causing the audio data received during the capture phase to be modified according to the user input may involve a beamforming process corresponding to a selected audio source.

[0016] Still other example embodiments describe an apparatus. In some example embodiments, the apparatus may include an interface system that includes an input/output (I/O) system.

According to some implementations, the apparatus may include a control system including one or more processors. In some example embodiments, the apparatus may include a display system including one or more displays and a touch sensor system proximate at least one of the one or more displays. According to some example embodiments, the control system may be configured to receive, via the interface system, audio data from a microphone system and video data from a camera system. In some example embodiments, the control system may be configured to identify, based at least in part on the audio data and the video data, two or more audio sources in an audio scene. According to some example embodiments, the control system may be configured to estimate at least one audio characteristic of each of the two or more audio sources. In some example embodiments, the control system may be configured to store audio data and video data received during a capture phase. According to some example embodiments, the control system may be configured to control a display of the display system to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images. In some examples, the GUI may include an audio source image corresponding to the at least one audio characteristic of each of the two or more audio sources. In some examples, the GUI may include one or more user input areas for receiving user input via the touch sensor system. According to some example embodiments, the control system may be configured to receive, via the interface system and prior to the capture phase, user input via the one or more user input areas. In some example embodiments, the control system may be configured to cause audio data received during the capture phase to be modified according to the user input. [0017] In some example embodiments, the control system may be configured to receive, after the capture phase has begun, user input via the user input area and to cause audio data received throughout a duration of the capture phase to be modified according to the user input.

[0018] According to some example embodiments, the control system may be configured to classify the two or more audio sources into two or more audio source categories. In some examples, the GUI may include a user input area portion corresponding to each of the two or more audio source categories. According to some examples, the two or more audio source categories may include a background category and a foreground category. In some examples, the one or more user input areas may include at least one user input area configured for receiving user input regarding a selected level, or ratio of levels, for each of the two or more audio source categories.

[0019] Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory and computer- readable media for controlling one or more devices to perform one or more methods. Such non- transitory and computer-readable media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc.

[0020] Some such methods may involve receiving, by a control system of a device, audio data from a microphone system and video data from a camera system. In some examples, the audio data may be received from a microphone system of the device and the video data may be received from a camera system of the device. Some methods may involve identifying, by the control system and based at least in part on the audio data and the video data, two or more audio sources in an audio scene. Some methods may involve estimating, by the control system and based on the audio data, at least one audio characteristic of each of the two or more audio sources. Some methods may involve storing, by the control system, audio data and video data received during a capture phase. The storing may, in some examples, involve storing unmodified audio data received during the capture phase.

[0021] Some methods may involve controlling, by the control system, a display of the device to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images. In some examples, the GUI may include an audio source image corresponding to the at least one audio characteristic of each of the two or more audio sources. In some examples, the GUI may include one or more user input areas for receiving user input. Some methods may involve receiving, by the control system and prior to the capture phase, user input via the one or more user input areas. Some methods may involve causing, by the control system, audio data received during the capture phase to be modified according to the user input.

[0022] Some methods may involve receiving, by the control system and after the capture phase has begun, user input via the user input area. Some methods may involve causing, by the control system, audio data received throughout a duration of the capture phase to be modified according to the user input.

[0023] Some methods may involve comprising classifying, by the control system, the two or more audio sources into two or more audio source categories. In some examples, the GUI may include a user input area portion corresponding to each of the two or more audio source categories. In some examples, the two or more audio source categories may include a background category and a foreground category. According to some examples, the two or more audio source categories may include at least one user input area configured for receiving user input regarding a selected level, or ratio of levels, for each of the two or more audio source categories.

[0024] Some methods may involve creating, by the control system, an inventory of sound sources. According to some examples, the inventory of sound sources may include actual sound sources and potential sound sources. In some examples, classifying the two or more audio sources into two or more audio source categories may be based, at least in part, on the inventory of sound sources.

[0025] Some methods may involve determining one or more types of actionable feedback regarding the audio scene. In some examples, the GUI may be based, in part, on the one or more types of actionable feedback.

[0026] Some methods may involve storing unmodified audio data received during the capture phase. Some methods may involve storing modified audio data that has been modified according to user input.

[0027] Some methods may involve creating and storing, by the control system, user input metadata corresponding to user input received via the user input area. Some methods may involve causing the audio data received during the capture phase to be modified according to the user input involves post-capture audio processing based at least in part on the user input metadata. In some examples, the control system may be configured to perform at least a portion of the post-capture audio processing. According to some examples, another control system — such as a control system of a cloud-based service, e.g., a server’s control system — may be configured to perform at least a portion of the post-capture audio processing. In some examples, the identifying may involve performing, by the control system, a first sound source separation process. In some such examples, the post-capture audio processing may involve performing a second sound source separation process.

[0028] In some examples, the identifying may involve detecting, by the control system and based at least in part on the video data, one or more potential sound sources. According to some examples, at least one of the one or more potential sound sources may not be indicated by the audio data.

[0029] Some methods may involve detecting, by the control system, one or more candidate sound sources for augmented audio capture. In some examples, the augmented audio capture may involve replacement of a candidate sound source by external audio or synthetic audio. According to some examples, the GUI may include at least one user input area configured for receiving a user selection of a selected potential sound source or a selected candidate sound source. In some examples, the GUI may include at least one user input area configured for receiving a user selection of augmented audio capture. According to some examples, the augmented audio capture may include at least one of external audio or synthetic audio for the selected potential sound source or the selected candidate sound source. In some examples, the GUI may include at least one user input area configured for receiving a user selection of a ratio between augmented audio capture and real-world audio capture.

[0030] According to some examples, causing the audio data received during the capture phase to be modified according to the user input may involve modifying audio data corresponding to a selected audio source or a selected category of audio sources. In some examples, causing the audio data received during the capture phase to be modified according to the user input may involve a beamforming process corresponding to a selected audio source.

[0031] The embodiments described herein may be generally described as techniques, where the term “technique” may refer to system(s), device(s), method(s), computer-readable instruction(s), module(s), component(s), hardware logic, and/or operation(s) as suggested by the context as applied herein. [0032] Features and technical benefits other than those explicitly described above will be apparent from a reading of the following Detailed Description and a review of the associated drawings. This Summary is provided to introduce a selection of techniques in a simplified form, and not intended to identify key or essential features of the claimed subject matter, which are defined by the appended claims.

DESCRIPTION OF DRAWINGS

[0033] Figure 1 A illustrates a schematic block diagram of an example device architecture that may be used to implement various aspects of the present disclosure.

[0034] Figure IB illustrates a schematic block diagram of an example central processing unit (CPU) implemented in the device architecture of Figure 1 A that may be used to implement various aspects of the present disclosure.

[0035] Figure 1C shows examples of a time line and of time intervals during which some disclosed processes may occur.

[0036] Figure ID is a flow diagram that outlines various example methods according to some disclosed implementations.

[0037] Figure 2A shows examples a time line and of time intervals when some disclosed processes may occur.

[0038] Figure 2B is a flow diagram that outlines various example methods according to some disclosed implementations.

[0039] Figure 3 A illustrates an example of a GUI that may be presented to indicate one or more audio scene preferences and one or more current audio scene characteristics.

[0040] Figure 3B is a block diagram that illustrates examples of customization layers that may be presented via one or more GUIs.

[0041] Figure 4A illustrates an example of a GUI that may be presented in accordance with the first customization layer of Figure 3B.

[0042] Figure 4B illustrates another example of a GUI that may be presented in accordance with the first customization layer of Figure 3B.

[0043] Figure 4C illustrates an example of a GUI that may be presented in accordance with the second customization layer of Figure 3B.

[0044] Figures 4D and 4E illustrate example elements of a GUI that may be presented in accordance with the third or fourth customization layers of Figure 3B. [0045] Figure 5A is a flow diagram that outlines various example methods according to some disclosed implementations.

[0046] Figure 5B is a flow diagram that outlines various example methods according to some disclosed implementations.

[0047] Figure 5C shows a table that represents example elements of a video object data structure according to some disclosed implementations.

[0048] Figure 5D shows a table that represents example elements of an audio source inventory data structure according to some disclosed implementations.

[0049] Figure 6 shows examples of modules that may be implemented according to some disclosed examples.

[0050] Figure 7 shows example components of an analysis module that is configured for context hypothesis evaluation according to some disclosed examples.

[0051] Figure 8 is a flow diagram that outlines various example methods according to some disclosed implementations.

[0052] Figure 9 is a flow diagram that outlines various example methods according to some disclosed implementations.

[0053] Figure 10A represents elements of an audio source inventory according to some disclosed implementations.

[0054] Figure 10B is a flow diagram that outlines various example methods according to some disclosed implementations.

[0055] Figure 10C is a flow diagram that outlines additional example methods according to some disclosed implementations.

[0056] Figure 11 A shows examples of media assets and a mixing module according to some disclosed examples.

[0057] Figure 1 IB is a flow diagram that outlines various example methods 1150 according to some disclosed implementations.

[0058] Figure 12A shows examples of media assets and an interpolator according to some disclosed examples.

[0059] Figures 12B, 12C and 12D illustrate example elements of GUIs that may be presented during a post-editing process that includes interpolation. [0060] Figure 12E is a flow diagram that outlines various example methods according to some disclosed implementations.

[0061] Figure 13 shows examples of media assets and an interpolator according to some disclosed examples.

[0062] Figure 14 is a flow diagram that outlines various example methods according to some disclosed implementations.

[0063] Figure 15 is a block diagram of an example immersive voice and audio services (IVAS) coder/decoder (“codec”) framework for encoding and decoding IVAS bitstreams, according to one or more embodiments.

[0064] In the drawings, specific arrangements or orderings of schematic elements, such as those representing devices, units, instruction blocks and data elements, are shown for ease of description. However, it should be understood by those skilled in the art that the specific ordering or arrangement of the schematic elements in the drawings is not meant to imply that a particular order or sequence of processing, or separation of processes, is required. Further, the inclusion of a schematic element in a drawing is not meant to imply that such element is required in all embodiments or that the features represented by such element may not be included in or combined with other elements in some implementations.

[0065] Further, in the drawings, where connecting elements, such as solid or dashed lines or arrows, are used to illustrate a connection, relationship, or association between or among two or more other schematic elements, the absence of any such connecting elements is not meant to imply that no connection, relationship, or association can exist. In other words, some connections, relationships, or associations between elements are not shown in the drawings so as not to obscure the disclosure. In addition, for ease of illustration, a single connecting element is used to represent multiple connections, relationships or associations between elements. For example, where a connecting element represents a communication of signals, data, or instructions, it should be understood by those skilled in the art that such element represents one or multiple signal paths, as may be needed, to effect the communication.

[0066] The same reference symbol used in various drawings indicates like elements.

DETAILED DESCRIPTION

[0067] In the following detailed description, numerous specific details are set forth to provide a thorough understanding of various described embodiments with reference to the accompanying drawings. The illustrative embodiments in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes made, without departing from the spirit or scope of the present disclosure. In light of the present disclosure, it will be apparent to one of ordinary skill in the art that the various described features and implementations may be practiced without many of these specific details. In some instances, well- known methods, procedures, components, and circuits, have not been described in detail so as not to unnecessarily obscure aspects of the embodiments. Several features are described hereafter that can each be used independently of one another or with any combination of other features. Thus, the features may be arranged, substituted, combined, separated or designed into other configurations, which is contemplated in light of the present disclosure.

Nomenclature

[0068] As used herein, the term “includes” and its variants are to be read as open-ended terms that mean “includes, but is not limited to.” The term “or” is to be read as “and/or” unless the context clearly indicates otherwise. The term “based on” is to be read as “based at least in part on.” The term “one example implementation” and “an example implementation” are to be read as “at least one example implementation.” The term “another implementation” is to be read as “at least one other implementation.” The terms “determined,” “determines,” or “determining” are to be read as obtaining, receiving, computing, calculating, estimating, predicting or deriving. In addition, in the following description and claims, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skills in the art to which this disclosure belongs.

DEFINITIONS

[0069] An audio bed is a mono or multichannel audio waveform which is associated with a particular channel representation of the content.

[0070] An audio object is mono or stereo audio waveform which is associated with some positional metadata, which may facilitate rendering of this audio object for an arbitrary configuration of a playout system.

[0071] A microphone feed is a signal captured or being captured by a microphone or a microphone array connected to a mobile device performing the capture. The “microphone feed” may also be processed by enhancement tools. The microphone feed may be created at any layer of the audio capture stack implemented on a mobile device. In other words, we use the term “microphone feed” to refer to any signal (mono, multichannel, sound field, etc.) that is available to the disclosed computational capture system.

[0072] A video feed is a signal captured or being captured by a camera or multiple cameras of the mobile device. The video feed is any video signal that is available to the disclosed computational capture system.

[0073] A standard media container is an interchangeable media format, which is supported by the software ecosystem associated with the mobile device.

[0074] Acronyms

[0075] AR - Augmented Reality

[0076] ASIC - Application-Specific Integrated Circuit

[0077] BS - Bitstream

[0078] CD-ROM - Compact Disc Read-Only Memory

[0079] CNN - Convolutional Neural Network

[0080] 3D-CNN - 3-Dimensional-CNN

[0081] CNN-RNN - CNN-Recurrent Neural Network

[0082] CPU - Central Processing Unit

[0083] DIRAC - Directional Audio Coding

[0084] DSP - Digital Signal Processor

[0085] EPROM - Erasable Programmable Read-Only Memory

[0086] EVS - Enhanced Voice Services

[0087] FOA - First Order Ambisonics

[0088] FPGA - Field-Programmable Gate Array

[0089] GUI - Graphical User Interface

[0090] HOA - Higher Order Ambisonics

[0091] VO - Input/Output

[0092] IVAS - Immersive Voice and Audio Services [[[

[0093] MD - Metadata

[0094] MP4 - Moving Picture Experts Group (MPEG)-4 Part 14

[0095] MViTv2 - Multiscale Vision Transformers

[0096] PCM - Pulse Code Modulation

[0097] QMF - Quadrature Mirror Filter [0098] RAM - Random Access Memory

[0099] ROM - Read Only Memory

[0100] SPAR - Spatial Reconstruction

[0101] SNR - Signal-to-Noise Ratio

[0102] VR -Virtual Reality

[0103] VSC - Video Scene Classification

[0104] YAMNet - Yet Another Multi-scale Convolutional Neural Network

[0105] This disclosure describes various audio capture methods, devices and systems configured for operating in the context of audio and video capture. Some embodiments presented in this disclosure describe situations in which a mobile audio and video capture is performed using via a mobile device, such as a mobile phone, equipped with a built-in microphone or a microphone array and a video camera.

[0106] As noted above, it is a common situation that during capture of audio-video content on a mobile device users can see a video that is being captured but cannot easily monitor the audio signal being captured (e.g., without using one or more additional devices). Some embodiments presented in this disclosure describe a situation where a mobile audio capture is performed on a mobile phone equipped with a built-in microphone or a microphone array, and the user performing the audio capture has no possibility to monitor the audio material being recorded, e.g., by means of headphone play out. While some embodiments will be described in this context herein, the present disclosure is not limited to such a field of use and is applicable in broader contexts.

[0107] While the audio capture processes — also referred to herein as the “audio capture stack” or the “audio stack” — implemented by a mobile device is typically designed to maximize the quality of the captured audio according to some assumed artistic intent, a mobile device user generally does not know whether the technology operating within the audio capture stack performs satisfactorily. This can lead to scenarios in which a user may give up on audio capture, for example, in challenging audio conditions, because the user does not know whether any useful audio can be captured in such a situation. Furthermore, a user may not be aware about a solvable audio problem appearing during the audio capture that would require a corrective action from the user. For example, the performance of audio capture may often be improved by moving closer to an audio source, by changing the way the mobile device is held, etc. Furthermore, the audio stack may contain powerful audio enhancement tools, but these tools typically have some limits on their performance (e.g., in terms of the minimum signal -to-noise ratio (SNR)). The performance of these audio enhancement tools typically degrades as audio capture conditions worsen. However, the user generally does not know about limits of the audio enhancement tools. Also, the user generally does not know how far the current operating point is from these audio enhancement tool limits.

[0108] Some disclosed methods involve generating feedback on the composition of the audio scene that adapts to the context, allows a user to address the forementioned issues, and facilitates a corrective action from the user. In some examples, such feedback may be adaptive in realtime, so user can correlate the feedback with the changes in the acoustic scene perceived by the user. According to some examples, the feedback may be available starting in a pre-capture phase and may continue through the actual capture phase. In some examples, the feedback may include graphical feedback regarding the composition of an audio scene overlaid on top of video that is currently being obtained. Some disclosed examples allow a user to introduce augmentations of the audio capture (before, during or after capture). Some disclosed examples provide means for a fallback from augmented audio capture to real-world audio capture, to interpolate between augmented audio capture and real-world audio capture, or both.

[0109] To implement the intended user experiences that can be provided according to some implementations, various technical problems may need be solved. For example, the audio-video scene — also referred to herein as the audio scene — may need to be analyzed in order to identify the audio scene’s acoustic components, preferably along with one or more audio characteristics such as acoustic level information. In some instances, both actual and potential acoustic components may be determined. The audio scene analysis may be available at any time after the capture app has been launched, and may be updated during the audio capture. In some examples, the visualization of the audio scene may include not only the presence of actual or potential audio sources, but may also provide estimates about contributions from the respective audio sources to the audio scene. The estimates of audio source contributions to an audio scene may be derived, for example, from estimated levels of the respective audio sources. In some examples, information on the components of the audio scene may be filtered by estimating which audio sources are likely to be most relevant to the artistic intent of the user. [0110] Some aspects of this disclosure involve providing real-time graphical feedback regarding the composition of an audio scene. In some such examples, the graphical feedback may be overlaid on video images that are being obtained by the mobile device. The graphical feedback may, in some examples, be provided via a graphical user interface (GUI) that includes a representation of one or more audio sources in an audio scene one or more user input areas for receiving user input. In some examples, audio data received during a capture phase may be modified according to the user input.

[0111] Some examples involve providing or facilitating one or more types of audio data augmentation, which may occur before, during or after the capture process. The audio data augmentation may involve replacement of a candidate sound source by external audio or synthetic audio, a beamforming process, or both. Some disclosed examples may facilitate the insertion of audio signals that are not present in the microphone feed, the removal of one or more components of the audio scene, etc. According to some examples, a control system may select one or more candidate sound sources for modification, for augmented audio capture, or both. In some such examples, a user may select an audio source for modification, augmentation, or both via a GUI.

[0112] Some disclosed examples provide means for a fallback from augmented capture into “real-world” or unmodified audio capture. Some such examples involve storing unmodified audio data received during the capture phase as well as augmented audio data. According to some examples, a GUI — which may be provided during a post-capture editing process — may include at least one user input area configured for receiving a user selection of a ratio between augmented audio capture and real-world audio capture. The user input area may include a virtual slider, a virtual dial, etc.

[0113] Multimodal analysis of video and audio may be used in the context of identification of audio sources by exploiting correlation between objects present in the video feed of the capture system and objects presents in the microphone feed. In general, the set of detected audio objects may contain many elements, and its composition could be instable or ambiguous. Some technical problems solved by the present disclosure involve the generation of context-relevant feedback on the composition of the scene, where only the significant audio sources are selected and their contribution levels are estimated, while other sources may be, for example, bucketed into background component. In other words, while it would be potentially easy to overwhelm a user with many details related to the composition of an audio scene, some disclosed examples involve filtering the details of the audio scene and presenting information regarding one or more sound sources that are estimated to be the most relevant components of an audio scene.

[0114] In some instances, a user may indicate one or more preferences about audio capture, such as desired level of a specific type of source (e.g., speech, background sounds, a component that is not present in the scene but integrated into capture) by interacting with the graphical feedback (e.g., by interacting with a GUI). For example, the user may focus on a specific sound source, suppress a specific source, or adjust a specific source. At least in part by obtaining and recording such user input, some disclosed examples provide means for estimating and implementing the artistic intent of a user.

[0115] Figure 1 A illustrates a schematic block diagram of an example device architecture 101 (in this example, an apparatus 101) that may be used to implement various aspects of the present disclosure. Architecture 101 includes but is not limited to servers and client devices, systems, etc., which may be configured to perform the methods that are described with reference to any or all of the disclosed figures. As shown, the architecture 101 includes central processing unit (CPU) 141, which is capable of performing various processes in accordance with a program stored in, for example, read only memory (ROM) 142 or a program loaded from, for example, storage unit 148 to random access memory (RAM) 143. The CPU 141 may be, for example, an electronic processor 141. The CPU 141 is an instance of what may be referred to herein as a “control system” or as an element of a control system. The control system may, for example, include a general purpose single- or multi-chip processor, one or more digital signal processors (DSPs), one or more application specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, one or more discrete hardware components, or combinations thereof. The ROM 142 and RAM 143 are instances of what may be referred to as a “memory system” or an element of a memory system. In RAM 143, the data required when CPU 141 performs the various processes is also stored, as required. In this example, CPU 141, ROM 142, and RAM 143 are connected to one another via bus 144. Input/output (I/O) interface 145 is also connected to bus 144. The bus 144 and the I/O interface 145 are instances of “interface system” elements as that term is used in this disclosure. [0116] According to this example, the following components are connected to I/O interface 145: input unit 146, which may include a keyboard, a mouse, or the like; output unit 147, which may include a display system including one or more displays, a loudspeaker system including one or more loudspeakers, etc.; storage unit 148 including a hard disk, or another suitable storage device; and communication unit 149 including a network interface card such as a network card (e.g., wired or wireless). The communication unit 149 may be referred to herein as part of an interface system.

[0117] In some implementations, the input unit 146 may include a microphone system that includes one or more microphones. In some examples, the microphone system may include two, three or more microphones in different positions, enabling capture of audio signals in various formats (e.g., mono, stereo, spatial, immersive, and other suitable formats).

[0118] According to some implementations, the output unit 147 may include systems with various numbers of loudspeakers. Output unit 147 (depending on the capabilities of the host device) may be capable of rendering audio signals in various formats (e.g., mono, stereo, immersive, binaural, and other suitable formats).

[0119] In some embodiments, communication unit 149 is configured to communicate with other devices (e.g., via a network). Drive 150 is also connected to I/O interface 145, as required. Removable medium 151, such as a magnetic disk, an optical disk, a magneto-optical disk, a flash drive or another suitable removable medium is mounted on drive 150, so that a computer program read therefrom is installed into storage unit 148, as required. A person skilled in the art would understand that although apparatus 101 is described as including the above-described components, in real applications, it is possible to add, remove, and/or replace some of these components and all these modifications or alteration all fall within the scope of the present disclosure.

[0120] In accordance with example embodiments of the present disclosure, the processes described above may be implemented as computer software programs or on a computer-readable storage medium. For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program including program code for performing methods. In such embodiments, the computer program may be downloaded and mounted from the network via the communication unit 149, and/or installed from the removable medium 151, as shown in Figure 1A.

[0121] Figure 1 A illustrates a schematic block diagram of an example device architecture 101 (in this example, an apparatus 101) that may be used to implement various aspects of the present disclosure. Architecture 101 includes but is not limited to servers and client devices, systems, etc., which may be configured to perform the methods that are described with reference to any or all of the disclosed figures. In some examples, the apparatus 101 may be a mobile display device, such as a cell phone. As shown, the architecture 101 includes central processing unit (CPU) 141, which is capable of performing various processes in accordance with a program stored in, for example, read only memory (ROM) 142 or a program loaded from, for example, storage unit 148 to random access memory (RAM) 143. The CPU 141 may be, for example, an electronic processor 141. The CPU 141 is an instance of what may be referred to as a “control system” or an element of a control system. The ROM 142 and RAM 143 are instances of what may be referred to as a “memory system” or an element of a memory system. In RAM 143, the data required when CPU 141 performs the various processes is also stored, as required. In this example, CPU 141, ROM 142, and RAM 143 are connected to one another via bus 144.

Input/output (I/O) interface 145 is also connected to bus 144. The bus 144 and the I/O interface 145 are instances of “interface system” elements as that term is used in this disclosure.

[0122] According to this example, the following components are connected to I/O interface 145: input unit 146, which may include a keyboard, a mouse, or the like; output unit 147, which may include a display system including one or more displays, a loudspeaker system including one or more loudspeakers, etc.; storage unit 148 including a hard disk, or another suitable storage device; and communication unit 149 including a network interface card such as a network card (e.g., wired or wireless). The communication unit 149 may be referred to herein as part of an interface system.

[0123] In some implementations, the input unit 146 may include a microphone system that includes one or more microphones. In some examples, the microphone system may include two, three or more microphones in different positions, enabling capture of audio signals in various formats (e.g., mono, stereo, spatial, immersive, and other suitable formats).

[0124] According to some implementations, the output unit 147 may include systems with various numbers of loudspeakers. Output unit 147 (depending on the capabilities of the host device) may be capable of rendering audio signals in various formats (e.g., mono, stereo, immersive, binaural, and other suitable formats).

[0125] In some embodiments, communication unit 149 is configured to communicate with other devices (e.g., via a network). Drive 150 is also connected to I/O interface 145, as required.

Removable medium 151, such as a magnetic disk, an optical disk, a magneto-optical disk, a flash drive or another suitable removable medium is mounted on drive 150, so that a computer program read therefrom is installed into storage unit 148, as required. A person skilled in the art would understand that although apparatus 101 is described as including the above-described components, in real applications, it is possible to add, remove, and/or replace some of these components and all these modifications or alteration all fall within the scope of the present disclosure.

[0126] In accordance with example embodiments of the present disclosure, the processes described above may be implemented as computer software programs or on a computer-readable storage medium. For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program including program code for performing methods. In such embodiments, the computer program may be downloaded and mounted from the network via the communication unit 149, and/or installed from the removable medium 151, as shown in Figure 1A.

[0127] According to some examples, the CPU 141 may be, or may be part of, a control system that is configured to perform some or all of the methods that are disclosed herein. In some examples, the control system may be configured to receive, via an interface system, audio data from a microphone system and video data from a camera system. According to some examples, the control system may be configured to identify, based at least in part on the audio data and the video data, two or more audio sources in an audio scene and to estimate at least one audio characteristic of each of the two or more audio sources. In some examples, the control system may be configured to store audio data and video data received during a capture phase.

According to some examples, the control system may be configured to control a display of a display system to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images. In some examples, the GUI may include an audio source image corresponding to the at least one audio characteristic of each of the two or more audio sources and one or more user input areas for receiving user input via the touch sensor system. According to some examples, the control system may be configured to receive, via the interface system and prior to the capture phase, user input via the one or more user input areas and to cause audio data received during the capture phase to be modified according to the user input.

[0128] Figure IB illustrates a schematic block diagram of an example CPU 141 implemented in the device architecture 101 of Figure 1 A that may be used to implement various aspects of the present disclosure. The CPU 141 includes an electronic processor 160 and a memory 161. The electronic processor 160 is electrically and/or communicatively connected to the memory 161 for bidirectional communication. The memory 161 stores encoding software 162 and decoding software 163. The memory 161 may be, for example, a ROM, a RAM, or another non-transitory computer readable medium. The electronic processor 160 may implement the encoding software 162 stored in the memory 161 to perform one, some or all of the disclosed methods. Additionally, the electronic processor 160 may implement the decoding software 163 stored in the memory 161 to perform one, some or all of the disclosed methods.

[0129] Generally, various example embodiments of the present disclosure may be implemented in hardware or special purpose circuits (e.g., control circuitry), software, logic or any combination thereof. For example, the units discussed above can be executed by control circuitry (e.g., CPU 141 in combination with other components of Figure 1 A), thus, the control circuitry may be performing the actions described in this disclosure. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device (e.g., control circuitry). While various aspects of the example embodiments of the present disclosure are illustrated and described as block diagrams, flowcharts, or using some other pictorial representation, it will be appreciated that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

[0130] Additionally, various blocks shown in the flowcharts may be viewed as method steps, and/or as operations that result from operation of computer program code, and/or as a plurality of coupled logic circuit elements constructed to carry out the associated function(s). For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program containing program codes configured to carry out the methods as described above.

[0131] In the context of the disclosure, a machine-readable medium may be any tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine- readable signal medium or a machine-readable storage medium. A machine-readable medium may be non-transitory and may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the machine-readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random-access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

[0132] Computer program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These computer program codes may be provided to a processor of a general -purpose computer, special purpose computer, or other programmable data processing apparatus that has control circuitry, such that the program codes, when executed by the processor of the computer or other programmable data processing apparatus, cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented. The program code may execute entirely on a computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server or distributed over one or more remote computers and/or servers.

[0133] Figure 1C shows examples of a time line and of time intervals during which some disclosed processes may occur. The processes may, for example, be performed at least in part by the apparatus 101 of Figure 1A, or by a similar apparatus. In these examples, the time line 102 indicates a start of capture application time 104, a start of capture time 106, an end of capture time 108, a post-capture editing beginning time 110 and a post-capture editing ending time 112. According to these examples, the time line 102 is broken between the end of capture time 108 and the post-capture editing beginning time 110, to indicate that this time interval could be variable. [0134] The start of capture application time 104 indicates when one or more software applications relating to audio and video capture are initiated, whereas the start of capture time 106 indicates when the actual audio and video capture process begins. In other words, the start of capture time 106 indicates when the actual audio and video recording(s) begin. In these examples, the time interval between the start of capture application time 104 and the start of capture time 106 is referred to as the pre-capture phase 114, the time interval between the start of capture time 106 and the end of capture time 108 is referred to as the capture phase 116, and the time interval between the post-capture editing beginning time 110 and the post-capture editing ending time 112 is referred to as the post-capture phase 118, during which a post-capture editing process 124 may take place. The post-capture editing process 124 may be performed, at least in part, by the device used during the capture phase 116. However, in some examples the post-capture editing process 124 may be performed, at least in part, by one or more other devices, such as by one or more servers of a cloud-based audio processing service.

[0135] According to some examples, analysis of the composition of the audio scene may begin during the pre-capture phase 114. The analysis may involve, for example, identification of audio sources, classifying and labeling of audio sources, associating audio sources with the components of the video scene (e.g., an active talker that is present in the video scene), associating audio sources with components that are not present into the video scene (e.g., a talking person performing the capture), etc. During the analysis of the audio scene, audio sources may be associated with a spatial position of an object present in the video (e.g., the position of an active talker present in the video scene). The analysis may involve estimating the level of each audio sources. At least some results of the audio scene analysis may be provided to a component or module configured or generating a user interface, for example a graphical user interface (GUI) that includes graphical feedback regarding the composition of an audio scene. Some examples of such GUIs are provided in this disclosure.

[0136] Accordingly, during the pre-capture phase 114 and the capture phase 116, a user may interact with a GUI that is providing feedback regarding the current composition of the audio scene. There may be various possible types of user interaction. One type of user interaction involves alterations to the audio scene achieved by interaction with the audio scene itself, for example by approaching or retreating from an audio source. Another type of user interaction involves interactions with the GUI. Examples of user interaction with the GUI may include adjustment of audio levels between acoustic background and acoustic foreground, adjustment of the audio level of one or more talkers (e.g., by touching displayed audio source indicators associated with the talker), etc. In some examples, a user may interact with the GUI in order to start or end an instance of the capture phase 116 — for example by touching a virtual recording button of the GUI that is presented by the capture application — to start or end the capture of a particular video clip and the associated audio data.

[0137] In Figure 1C, the arrow 120 corresponds to a time interval during which a process of registering user input 121 may be performed and the arrow 122 corresponds to a time interval during which a process of providing feedback regarding audio scene composition 123 may be performed. In some examples, the user input for the process of registering user input 121 may be obtained via touch sensor signals corresponding to one or more user input areas of a GUI.

[0138] According to some examples, the process of providing feedback regarding audio scene composition 123 also may involve the GUI. The GUI may, for example, be overlaid on images corresponding to video data and audio data obtained by the apparatus 101, or by a similar apparatus. The images corresponding to the audio data may, for example, include shapes, outlines, etc., corresponding to one or more audio sources. In some examples, the GUI may include an audio source image corresponding to at least one audio characteristic — for example, volume or level — of each of the audio sources. In some such examples, the user input may indicate a user’s desired modification to at least one audio characteristic, e.g., a user’s desired increase or reduction in volume of one or more audio sources. Some disclosed methods may involve causing — e.g., by a control system — audio data received during the capture phase 116 to be modified according to at least some of the user input received via the GUI. The audio data modification may or may not occur during the capture phase 116, depending on the implementation and the type of modification. Some such methods may involve storing the modified audio data. Some disclosed methods may involve creating and storing user input metadata corresponding to user input received via the user input area.

[0139] In the examples shown in Figure 1C, one may observe that process of registering user input 121 may potentially be performed, and the process of providing feedback regarding audio scene composition 123 may also potentially be provided, during some or all of the pre-capture phase 114, as well as during some or all of the capture phase 116. Accordingly, in some examples the GUI may be displayed prior to the capture phase 116 — during the pre-capture phase 114 — as well as during the capture phase 116. Such examples provide some potential advantages. For example, a user may be able to evaluate information regarding the audio scene — including but not limited to level information corresponding to one or more audio sources — and to provide user input regarding this audio scene information, prior to the commencement of the capture phase 116. During the capture phase 116, the user may be able to devote relatively more attention to other aspects of the capture phase 116, such as aspects of the video capture. Moreover, in some implementations information corresponding to user input obtained prior to, as well as during, the capture phase 116 may be “registered,” e.g., may be stored as user input metadata. In some such examples, post-capture audio processing may be based, at least in part, on the registered user input, for example as described below with reference to process 127.

[0140] In general, it is not required that the final audio outcome of the capture is created during the capture phase 116. For example, some of the audio manipulation disclosed herein may require operating a high-quality audio source separation algorithm, which could be prohibitive in terms of its computational performance during the capture phase 116. Instead, some disclosed examples involve using a relatively lower-quality audio source separation method during the capture phase 116 and using a relatively higher-quality audio source separation method during a post-capture editing process 124. The relatively lower-quality audio source separation method may have relatively lower computational requirements and may be more suitable for providing real-time feedback during the capture phase 116. In some examples, the capture phase 116 may involve estimating contributions of individual audio components to the audio scene, presenting the estimated contributions of the individual audio components to the user performing capture, and registering the intent of the user according to user input, which may include some manipulation that the user desires to be applied to the sources (e.g., adjustment of level of a source, introducing sources that are not present in the microphone feed, etc.). Information, for example metadata, corresponding to the user input may be appended to the captured content, and the final version of the audio data may be produced during the post-capture editing process 124 and may be based at least in part on the metadata. While the final version of the audio data may be derived during the post-capture editing process 124, according to some examples the alterations made to an audio scene may also be undone during the post-capture editing process 124 should the user wish to do so. For example, a user may preview the automated rendition of a scene based on the user input registered during the pre-capture phase 114 and the capture phase 116. During or after the preview, the user may decide to fall back to default capture or introduce a further adjustment to the composition of the scene.

[0141] Accordingly, in the examples shown in Figure 1C, the post-capture editing process 124 may involve a process of enabling fallback to real-world capture 125. In some such examples, both modified and unmodified versions of the captured audio data may be stored. According to some examples, the process of enabling fallback to real-world capture 125 may involve allowing a user, for example during the post-capture editing process 124, to select a degree to which the final version of the audio will include modified or unmodified audio corresponding to at least one sound source. In some such examples, the user may be presented with a GUI that includes a virtual slider, a dial, or another such virtual tool with which the user may interact to indicate the degree of audio modification in the final version of the audio.

[0142] In the examples shown in Figure 1C, the post-capture editing process 124 may involve a process 127 of enabling post-processing guided by registered user input. According to some examples, the process 127 may involve referring to stored input metadata corresponding to user input received via the user input area during the pre-capture phase 114, during the capture phase, or both. In some examples, the post-capture editing process 124 may involve one or more video editing processes.

[0143] After the post-capture editing process 124 is complete, in this example process 128 involves creating one or more multimedia files that include the final versions of the audio data and the video data. Alternatively, or additionally, process 128 may involve creating and transmitting a bitstream that corresponds to the final versions of the audio data and the video data. According to some examples, the bitstream may be an immersive voice and audio services (IVAS) encoded bitstream. Some examples are provided herein for encoding and decoding IVAS bitstreams.

[0144] Figure ID is a flow diagram that outlines various example methods 160 according to some disclosed implementations. The example methods 160 may be partitioned into blocks, such as blocks 161, 163, 165, 166, 167, 168, 169, 170, 171, 173 and 175. The various blocks may be described as operations, processes, methods, steps, acts or functions. The blocks of methods 160, like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of methods 160 may be performed concurrently. Moreover, some implementations of methods 160 may include more or fewer blocks than shown and/or described. The blocks of methods 160 may be performed by one or more devices, for example, the device that is shown in Figure 1 A.

[0145] In these examples, blocks 161, 163, 165 and 167 are performed during a pre-capture phase 114. According to these examples, blocks 168, 169 and 171 are performed during a capture phase 116. In these examples, blocks 173 and 175 are performed during a post-capture phase 118. The pre-capture phase 114, capture phase 116 and post-capture phase 118 may, for example, be as described herein with reference to Figure 1C. The functions of blocks 163, 165, 166, 167, 168, 169 and 171also may be referred to herein as an “audio interaction framework.” Processing may commence at block 161.

[0146] Block 161 involves “start of capture application.” In some examples, the capture application may be a software application for capturing audio and video data. The capture application may, in some examples, be stored in a memory system of a mobile device, such as a cell phone. According to some examples, block 161 may involve initializing or starting the capture application, for example responsive to user input received by the device. Processing may continue to block 163.

[0147] Block 163 involves “analyze audio scene and generate GUI with customization options.” According to some examples, block 163 may involve receiving and analyzing audio data from a microphone system. In some examples, block 163 may involve receiving and analyzing video data from a camera system. According to some examples, block 163 may involve identifying, based at least in part on the audio data and the video data, one or more audio sources in an audio scene. Audio sources may also be referred to herein as sound sources. In some examples, analyzing the audio scene may involve creating an inventory of sound sources in the audio scene. The inventory of sound sources may include one or more actual sound sources and one or more potential sound sources. The potential sound source(s) may, for example, be identified according to the video data even if audio data from the potential sound source(s) is not detected or is below a threshold level. According to some examples, block 163 may involve classifying two or more audio sources into two or more audio source categories, which may include a background category and a foreground category. In some examples, block 163 may involve presenting a GUI that includes a user input area portion corresponding to each of the two or more audio source categories. According to some examples, block 163 may involve deriving feedback and customization options. In some examples, block 163 may involve presenting at least one feedback option, at least one customization option, or both, on a GUI. In some examples, block 163 may involve overlaying one or more images corresponding to the at least one feedback option, at least one customization option, or both, on displayed images that correspond to video data from the camera system. In some examples, the GUI includes one or more areas for receiving user input. Processing may continue to block 165.

[0148] Block 165 involves determining whether “user input [is] received from [the] GUI.” According to some examples, block 165 may involve determining whether user input is received from an area of a touch sensor system corresponding to one or more areas of the GUI that are configured for receiving user input. Processing may continue to block 163 or to block 166. If it is determined in block 165 that user input is not received from the GUI, processing reverts to block 163 in this example. If it is determined in block 165 that user input is received from the GUI, processing continues to block 166 in this example.

[0149] Block 166 involves determining whether a “start of capture” event has occurred. In some examples, block 166 may involve determining whether user input received via the abovereferenced GUI corresponds with the initiation of the capture phase 116, during which time audio data received by the microphone system and video data received by the camera system are stored in memory. In some examples, the methods 160 may involve storing unmodified audio data received during the capture phase, storing modified audio data received during the capture phase — which may have been modified according to user input — or both. Processing may continue to block 168 or to block 167. If it is determined in block 166 that user input from the GUI corresponds to the initiation of the capture phase 116, in this example the process continues to block 168. If it is determined in block 166 that user input from the GUI does not correspond to the initiation of the capture phase 116, in this example the process continues to block 167.

[0150] Block 167involves “register user input.” In some examples, the user input received via the GUI may be responsive to a customization option provided by the GUI, such as an option to increase the volume of a sound source, to decrease the volume of a sound source, to select a potential sound source, etc. According to some examples, registering the user input may involve creating and storing metadata corresponding to the user input. Registered user input may be used to guide the modification of audio data during the capture phase 116, the post-capture phase 118, or both. Processing may continue to block 163. [0151] Block 168 involves “analyze audio scene and generate GUI with customization options.” According to some examples, block 168 may simply be a continuation of the “audio interaction framework” that includes block 163 and that continues from block 163 throughout the precapture phase 114 and the capture phase 116. Processing may continue to block 169.

[0152] Block 169 involves determining whether “user input is received from [the] GUI.” In this example the GUI is being presented during the capture phase 116. Processing may continue to block 168 or to block 170. If it is determined in block 169 that user input is received from the GUI, processing continues to block 170 in this example.

[0153] Block 170 involves determining whether an “end of capture” event has occurred. In some examples, block 170 may be responsive to user input, which may or may not be received via the above-referenced GUI. In this example, block 170 may involve determining whether user input received via the above-referenced GUI corresponds with the end of the capture phase 116. If it is determined in block 170 that user input from the GUI corresponds to the end of the capture phase 116, in some examples the process continues to block 173. However, as noted elsewhere herein, there may be multiple instances of starting and ending capture prior to block 173. Accordingly, the arrow linking blocks 170 and 173 is broken in Figure ID. If it is determined in block 170 that user input from the GUI does not correspond to the end of the capture phase 116, in this example the process continues to block 171.

[0154] Block 171 involves “register user input.” In some examples, the user input may be responsive to a customization option provided by the GUI, such as an option to increase the volume of a sound source, to decrease the volume of a sound source, to select a potential sound source, etc. As noted elsewhere herein, registering the user input may involve creating and storing metadata corresponding to the user input, and registered user input may be used to guide the modification of audio data during the capture phase 116, the post-capture phase 118, or both. For example, is user input is received indicating a user’s intention to increase the volume of a selected sound source, a beamforming process may be provided during the capture phase 116 in order to augment the sound from the selected sound source. This is one example of what may be referred to herein as “augmented audio capture.” Processing may continue to block 168.

[0155] Block 173 involves “post-capture editing.” According to some examples, block 173 may be performed by a control system of the device involved with audio and video capture, by one or more remote devices — such as one or more servers — or a combination thereof. In some examples, block 173 may involve causing audio data received during the capture phase to be modified according to registered user input, such as metadata corresponding to user input. According to some examples, block 173 may involve presenting a GUI that includes at least one user input area allowing the selection of a potential sound source or a candidate sound source. In some examples, user input may previously have been received corresponding to the selection of a potential sound source or a candidate sound source. According to some examples, block 173 may involve replacement of a potential sound source or a candidate sound source by external audio or synthetic audio. For example, block 173 may involve the replacement of music detected in the audio data from a microphone system — which is an example of a “candidate sound source — with downloaded music. In another example, block 173 may involve adding external audio or synthetic audio corresponding to a selected potential sound source, for example adding external audio or synthetic audio corresponding to a ticking clock when the selected potential sound source is a clock detected via the video data.

[0156] As noted elsewhere herein, the methods 160 may involve storing unmodified audio data received during the capture phase. According to some examples, block 173 may involve presenting a GUI that includes at least one user input area — such as a virtual slider, a virtual dial, etc. — configured for receiving a user selection of a ratio between modified audio data corresponding to an augmented audio capture and unmodified audio data corresponding to a “real-world” audio capture.

[0157] According to some examples, the pre-capture phase 114, the capture phase 116, or both, may involve performing a first sound source separation process. In some examples, block 173 may involve performing a second sound source separation process, for example by one or more servers. This is potentially advantageous because the second sound source separation process may be more accurate but may be too computationally intensive for the capture device to perform during the pre-capture phase 114 or the capture phase 116. Processing may continue to block 175.

[0158] Block 175 involves preparing one or more media files, storing one or more media files transmitting one or more media files, or combinations thereof. In this example, the one or more media files include the final version of the audio data resulting from the post-capture editing process of block 173. In some examples, block 175 may involve providing the final version of the audio data and the video data in a standard media container, such as Moving Picture Experts Group (MPEG)-4 Part 14 (MP4). According to some examples, block 175 may involve transmitting a bitstream that includes the final version of the audio data and the video data. According to some examples, the bitstream may be an immersive voice and audio services (IVAS) encoded bitstream. Some examples are provided herein for encoding and decoding IVAS bitstreams.

[0159] Figure 2A shows examples of a time line and of time intervals when some disclosed processes may occur. The processes may, for example, be performed at least in part by the apparatus 101 of Figure 1A, or by a similar apparatus. In these examples, the time line 202 indicates a start of capture application time 204, a start of capture time 206a, an end of capture time 208a, a start of capture time 206b, an end of capture time 208b. According to these examples, a first instance of capture, for the capture of clip #1, occurs during the time interval between the start of capture time 206a and the end of capture time 208a. Here, a second instance of capture, for the capture of clip #2, occurs during the time interval between the start of capture time 206b and the end of capture time 208b. In these examples, the time line 202 is broken between the end of capture time 208a and the start of capture time 206b, to indicate that this time interval could be variable.

[0160] In these examples, the start of capture application time 204 is an instance of start of capture application time 104 of Figure 1C may be substantially as described with reference to Figure 1C. Similarly, the start of capture times 206a and 206b are instances of the start of capture times 106 of Figure 1C, and the end of capture times 208a and 208b are instances of the end of capture time 108 of Figure 1C, and may be substantially as described with reference to Figure 1C. As with other disclosed examples, the types, numbers and arrangements of elements shown in Figure 2A are merely provided by way of example. For example, although the capture phase 116 of Figure 2A includes two instances of capture, other examples may include more or fewer instances of capture.

[0161] Alterations to the composition of an audio scene may be performed in several ways, depending on the particular illustration. Some options are illustrated in Figure 2A. According to these examples, user interaction #1 is detected during the pre-capture phase 114 and user interaction #2 is detected during the capture phase 116. User interactions #1 and #2 may, for example, be detected according to a user’s interactions with a GUI such as one of those disclosed herein. For example, the GUI may be overlaid on images corresponding to the video data obtained by and displayed by a capture device, such as a cell phone or other mobile device. The GUI may include one or more audio source images corresponding to at least one audio characteristic and may include one or more user input areas.

[0162] According to some examples, user interaction #1 may correspond to an adjustment of a level of a sound source during the pre-capture phase 114. In some examples, such an adjustment may immediately affect the graphical feedback provided to the user. For example, a virtual level or volume control slider, dial, etc., may change its size, position, etc. In the examples shown in Figure 2A, effect of user interaction #1 is be applied to the audio captured during the subsequent capture phase 116 and is therefore referred to as a “causal interaction.”

[0163] In some examples, user interaction #2 — during the capture phase 116 — may correspond to user input to adjust the ratio between acoustic background and acoustic foreground, to adjust the level of an audio source, etc. In the example shown in Figure 2A, the effect of user interaction #2 is applied in a “retro-causal” way, by applying the effect of user interaction #2 from the beginning of the capture of clip #1 at the start of capture time 206a. The retro-causal effect may, for example, be automatically applied by the capture device during the capture phase 116, at the end of the capture phase 116, or during a post-capture phase 118.

[0164] Figure 2B is a flow diagram that outlines various example methods 250 according to some disclosed implementations. The example methods 250 may be partitioned into blocks, such as blocks 251, 252, 253, 255, 256, 257, 259, 261, 263 and 265. The various blocks may be described as operations, processes, methods, steps, acts or functions. The blocks of methods 250, like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of methods 250 may be performed concurrently. Moreover, some implementations of methods 250 may include more or fewer blocks than shown and/or described. The blocks of methods 250 may be performed by one or more devices, for example, the device that is shown in Figure 1 A.

[0165] According to these examples, blocks , 252, 253, 255, 256, 257 and 259 are performed during a capture phase 116. In these examples, blocks 263 and 265 are performed during a postcapture phase 118. The capture phase 116 and post-capture phase 118 may, for example, be as described herein with reference to Figure 1C. Processing may commence at block 251.

[0166] Block 251 involves the “start of capture application.” In this example, block 251 corresponds with block 161 of Figure ID. Processing may continue to block 252. [0167] Block 252 involves “obtaining capture preferences.” In this example, the capture preferences are, or include, audio capture preferences. Such audio capture preferences may be obtained in various ways. According to some examples, one or more audio capture preferences may be obtained in block 252 according to explicit action by a user, for example, when the user explicitly sets the audio capture preferences by interacting with a GUI. Accordingly, block 252 may involve receiving user input from a GUI corresponding to one or more capture preferences. [0168] In some examples, block 252 may involve obtaining previously-obtained user preference data from a memory. According to some examples, one or more audio capture preferences may be obtained by accessing a stored data structure that includes audio capture preferences. According to some implementations, audio capture preferences may be represented in the form of one or more lookup tables that include a list of audio sources along with a corresponding signal level characteristic for each of the audio sources. Such lookup tables may take form of a data structure including a list of detected audio sources present in the audio scene. In some such examples, the list of detected audio sources may include foreground audio objects and one or more background components of the audio scene. The background components may include multiple context-dependent sound sources, such as multiple vehicle sound sources, background talkers, other street noise sound sources, etc., for a street scene, background music, background talkers, dining-related and eating-related sounds for a cafe scene, etc. Within the data structure, each audio source may be associated with a semantic label such as “human talker #1,” “human talker #2,” etc., for foreground audio objects, “street noise,” “babble noise,” “wind noise,” etc., for background components. In some examples, each audio source in the data structure may be associated with a sequence of timestamps and a binary flag indicating the presence or absence of that audio source, a signal level measurement for that audio source and a detection probability for that audio source.

[0169] In some examples, the semantic labels for the objects may be generated by an instance of an “audio tagger” that is being implemented by the control system of a device being used for audio capture. Such an audio tagger may be implemented in many ways, depending on the particular implementation. According to some examples, the audio tagger may be implemented as a trained neural network constructed according to the so-called YAMNet (Yet Another Multiscale Convolutional Neural Network) architecture, which may employ MobileNet architecture operating on a Mel spectrogram of an audio signal. Examples of MobileNets are described in Howard, Andrew G. et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications" (arXiv preprint arXiv: 1704.04861 (2017)), which is hereby incorporated by reference. Such an architecture facilitates assigning tags from a large set of tags to segments of a waveform. The assigned tags are examples of semantic labels.

[0170] More generally, an audio tagger may be constructed by using a convolutional neural network (CNN). Such an audio tagger may operate on a spectral representation of an audio signal (e.g., a mel spectrogram) and may update a set of tags along with confidence intervals for these tags with some pre-defined time resolution (e.g., 1 second). The CNN may be trained to provide an output including indications of tags (within a large set of all possible tags) corresponding to audio objects present in that segment along with another output representing a confidence level for the tags. The output representing the confidence level may, for example, be a probability of object detection, such as a value between zero and one, where a value closer to 1.0 would indicate a large confidence and a value close to zero would indicate a low confidence. Such a confidence level may be determined according to the estimated object detection probability, which may be computed by the same CNN.

[0171] According to some examples, the signal level for an individual audio object may be determined as follows. An instance of what is referred to herein as the “first source separator” may be used to find the distribution of signal level (or signal energy) between all foreground components of the audio scene and the overall signal level of the background audio scene components. Then, knowing the signal level of the foreground audio scene components and the set of foreground components present in a segment of an audio signal, the control system may estimate the weighted signal levels of individual foreground audio scene components having known detection probabilities weighting the total signal level of each foreground audio scene component with respect to the detection probability of that foreground audio scene component. The detection probabilities may be provided via an analysis of the video signal associated with the audio signal, for example, by detecting probable active talkers in the audio signal and determining corresponding confidence levels for each probable active talker (e.g., according to mouth movements that correspond with time intervals of speech).

[0172] In some examples, object detection in the video associated with the audio scene may be used to perform an adjustment of the signal energy distribution. For example, if multiple talkers are simultaneously present in the scene, a video analysis can be used to identify the active talkers. The instantaneous signal energy associated with all of the current talkers may be distributed between the active talkers.

[0173] According to some examples, the previously-obtained user preference data may have been obtained during a previous capture phase 116 or during a previous segment of the current capture phase 116 — for example, when a previous clip was being obtained during the current capture phase 116. In some examples, block 252 may involve obtaining one or more audio capture preferences by analyzing an audio clip obtained during a previous capture phase 116— for example, by using analysis tools of the capture engine — and using the results of the analysis to populate settings of audio capture for the current capture phase 116. The one or more audio capture preferences obtained in block 252 may include a preferred ratio between (a) the total audio signal level of all foreground audio components and (b) the total audio signal level of the background audio component(s). Alternatively, or additionally, a user may have a preference that the audio level associated with the background component s) in a scene involving human talkers should not exceed a certain level. Accordingly, the user’s preferences regarding foreground and background audio levels may be indicated by a preferred minimum foreground- to-background ratio of audio levels, a preferred maximum background audio level, or both. In some embodiments, obtaining the one or more audio capture preferences may involve receiving user input indicating a class of audio sources to focus on or emphasize. For example, a user may indicate that the capture should emphasize audio signals corresponding to human talkers, audio signals corresponding to musical instruments, etc.

[0174] In some examples, after one or more audio scene preferences are known or set, the audio scene preference(s) may be used to facilitate the comparison of the current composition of the audio scene with respect to these preferences. Accordingly, in some examples block 252 — or another block of the methods 250, such as block 253 — may involve mapping one or more capture preferences to one or more elements of the current audio scene. Processing may continue to block 253.

[0175] Block 253 involves “capturing with feedback.” In this example, block 253 involves obtaining and storing audio data and video data during the capture phase 116 and providing feedback, for example via a GUI. According to some examples, the feedback provided in block 253 may involve signaling deviations in the audio scene being captured from typical scenarios, from audio capture preferences, or both. For example, the feedback provided in block 253 may indicated that a ratio between an acoustic background level and an acoustic foreground level for the current audio scene differs from some preferred level. In some examples, the feedback provided in block 253 may indicated that the level of one or more sound sources — such as blowing wind, traffic noise, etc. — is likely exceeding a desirable level. Processing may continue to block 255.

[0176] Block 255 involves determining whether an “adjustment [is] made.” According to some examples, block 255 may involve determining whether input has been received from a user — for example by a user’s interaction with the GUI presented in block 253 — during the capture phase 116 indicating the user’s desire to make an adjustment. The interaction may, for example, correspond to adjusting the desired level of an audio source, adjusting the ratio between acoustic background and acoustic foreground, etc. If it is determined in block 255 that no adjustment will currently be made, in this example the process reverts to block 253. In many cases, blocks 253 and 255 involve concurrent operations, because the capture process will generally continue during block 255. However, if it is determined in block 255 that an adjustment will currently be made, in this example the process proceeds to block 257.

[0177] Block 257 involves “applying alterations including feedback.” In this example, block 257 involves processing the received user input indicating the user’s desire to make an adjustment and providing feedback, for example via the GUI from which the input was received, indicating that one or more alterations have been applied or will be applied. In some examples, the alteration(s) may involve some type of augmented capture, such as a microphone beamforming process to enhance audio received from a selected sound source. According to some examples, the augmented audio capture may involve replacement of audio corresponding to a selected potential sound source or a selected candidate sound source by external audio or synthetic audio. In some examples, such an adjustment may be performed in a “retro-causal” way, by applying the adjustment from the beginning of the captured clip. However, in some examples the alteration(s) may be performed in a “causal” way, by applying the adjustment from the time the corresponding user input was received until the end of the captured clip.

Alternatively, or additionally, in some examples the adjustment may be performed, at least in part, after the capture phase 116, for example during a post-capture editing process.

[0178] Block 259 involves determining whether “capturing [is] complete.” In some examples, block 259 may involve determining whether an “end of capture” event has occurred. In some examples, block 259 may be responsive to user input, which may or may not be received via the above-referenced GUI. In this example, block 259 may involve determining whether user input received via the above-referenced GUI corresponds with the end of the capture phase 116. If it is determined in block 259 that user input from the GUI corresponds to the end of the capture phase 116, in some examples the process continues to block 263. However, as noted elsewhere herein, there may be multiple instances of starting and ending capture prior to block 263. Accordingly, the arrow linking blocks 259 and 263 is broken in Figure 2B. If it is determined in block 259 that user input from the GUI does not correspond to the end of the capture phase 116, in this example the process reverts to block 253.

[0179] Block 263 involves “post-capture editing.” According to some examples, block 263 may be performed by a control system of the device involved with audio and video capture, by one or more remote devices — such as one or more servers — or a combination thereof. In some examples, block 263 may involve causing audio data received during the capture phase to be modified according to registered user input, such as metadata corresponding to user input. In some examples, block 263 may be performed as described above with reference to block 173 of Figure ID. In some examples, user input may previously have been received — for example, during the capture phase 116 — corresponding to the selection of a potential sound source or a candidate sound source. According to some examples, block 263 may involve replacement of a potential sound source or a candidate sound source by external audio or synthetic audio. For example, block 263 may involve the replacement of music detected in the audio data from a microphone system — which is an example of a “candidate sound source — with downloaded music. In another example, block 263 may involve adding external audio or synthetic audio corresponding to a selected potential sound source, for example adding external audio or synthetic audio corresponding to a ticking clock when the selected potential sound source is a clock detected via the video data.

[0180] As noted elsewhere herein, the methods 160 may involve storing unmodified audio data received during the capture phase. In some examples, some examples, block 263 may involve allowing a user to make an interpolation between modified audio data corresponding to an augmented audio capture and unmodified audio data corresponding to a “real-world” audio capture. According to some examples, block 263 may involve presenting a GUI that includes at least one user input area — such as a virtual slider, a virtual dial, etc. — configured for receiving a user selection of a ratio between modified audio data corresponding to an augmented audio capture and unmodified audio data corresponding to a “real-world” audio capture. In some examples, some examples, block 263 may involve allowing a user to edit modified audio data by substituting another instance of external audio or synthetic audio, e.g., by allowing the user to select another type of background music, to select a different audio clip corresponding to a potential sound source — such as a different “ticking clock” audio clip, etc. Processing may continue to block 265.

[0181] Block 265 involves preparing one or more media files, storing one or more media files transmitting one or more media files, or combinations thereof. In this example, the one or more media files include the final version of the audio data resulting from the post-capture editing process of block 263. In some examples, block 265 may involve providing the final version of the audio data and the video data in a standard media container, such as Moving Picture Experts Group (MPEG)-4 Part 14 (MP4). According to some examples, block 265 may involve transmitting a bitstream that includes the final version of the audio data and the video data. According to some examples, the bitstream may be an immersive voice and audio services (IVAS) encoded bitstream.

[0182] In some examples, the visualization of an audio scene may allow for customization of audio capture. According to some such examples, an audio customization interface may be implemented as layers overlaid on a video scene. For example, a GUI may provide multiple layers of customization, in which the basic layer involves feedback on the composition of the audio scene and more advanced customization options may be presented as one or more subsequent or optional layers. In this context, the terms “customization layer” and “layer of customization” refer to one or more graphical features that may be presented on a GUI. In some examples, the one or more graphical features may be presented in a cumulative or additive manner. Some examples of such advanced customizations may involve adjustment of audio source levels between background audio sources and foreground audio sources, adjustment of speech level from multiple talkers to a consistent level, etc.

[0183] Figure 3 A illustrates an example of a GUI that may be presented to indicate one or more audio scene preferences and one or more current audio scene characteristics. In this example, Figure 3A shows a GUI 300, an image corresponding to a video frame of a video clip 301, a person 305 in the current video clip 301, an audio source summary area 315 of the GUI 300, audio source information areas 330a and 330b within the audio source summary area 315, audio source level information areas 332a and 332b within the audio source information areas 330a and 330b, respectively, and an audio source level preference indication 333 in the audio source level information area 332b. In this example, the GUI 300 is being overlaid on the video clip 301.

The GUI 300 and the video clip 301 may be presented on an apparatus such as the apparatus 101 of Figure 1 A, for example on a display corresponding to the output unit(s) 147 shown in Figure 1A. In this example, the apparatus 101 is a cell phone.

[0184] In this example, audio source level information area 332a provides information about a talker audio source, which corresponds to speech of the person 305 in this instance. Here, the audio source level information area 332b provides information about an audio source category, which is cafeteria background noise in this example, and which may include multiple individual audio sources. In some examples, the apparatus 101 may be configured to determine or estimate the audio source category, for example by implementing an audio source classifier such as disclosed herein to evaluate the background audio that is currently being received by a microphone system of the apparatus 101.

[0185] According to this example, the audio source level information areas 332a and 332b indicate the estimated current audio levels corresponding to the talker and the cafeteria noise, respectively. In this example, the audio source level preference indication 333 has been obtained from a template that corresponds to a particular type of background noise, which is cafeteria noise in this instance. According to some such examples, the audio source level preference indication 333 may correspond to one or more previously-selected audio levels for cafeteria noise, e.g., by a current user of the apparatus 101.

[0186] In this example, the audio source level information areas 332a and 332b each include a stippled portion 337a and a portion 337b with diagonal hash marks. The portions 337b indicate the extent to which a current sound level has been modified according to user input. The modifications indicated by the portions 337b may be an increase or a decrease. For example, the portion 337b in the audio source level information area 332a may indicate the extent to which the current sound level corresponding to the talker has been increased according to user input, whereas the portion 337b in the audio source level information area 332b may indicate the extent to which the current sound level corresponding to the cafeteria noise has been decreased according to user input. In some examples, the user input may have involved touching the audio source level information areas 332a and 332b, e.g., interacting with a virtual slider that is provided by the audio source level information areas 332a and 332b.

[0187] As with other disclosed examples, the types, numbers and arrangements of elements shown in Figure 3A are merely provided by way of example. For example, although the audio source summary area 315 of Figure 3 A provides information regarding one individual sound source and one sound source category, other examples may provide information about multiple individual sound sources. Alternatively, or additionally, some alternative examples may provide a different graphical depiction of sound source characteristics. Some such examples are provided in this disclosure.

[0188] Figure 3B is a block diagram that illustrates examples of customization layers that may be presented via one or more GUIs. The one or more GUIs may be presented on an apparatus such as the apparatus 101 of Figure 1A, for example on a display corresponding to the output unit(s) 147 shown in Figure 1 A. In some examples, the apparatus may be a mobile device, such as a cell phone. In this example, Figure 3B shows blocks 355, 360, 365 and 370, each of which corresponds to a customization layer that may be presented via one or more GUIs. According to this example, each of the blocks 355, 360, 365 and 370 are interconnected with two-way arrows labeled “user interaction,” indicating that a user may be able to interact with a GUI corresponding to any one of the blocks 355, 360, 365 and 370 in order to add a customization layer to a GUI that is currently being presented or to revert to a different GUI corresponding to another customization layer.

[0189] In these examples, block 355 corresponds to a first customization layer that corresponds to various sets of possible graphical features for providing a basic level of feedback regarding the composition of an audio scene. For example, a GUI corresponding to the first customization layer may provide a graphical depiction of one or more audio sources, or groups of audio sources, in the current audio scene. According to some examples, a GUI corresponding to the first customization layer may provide estimates of levels of what are estimated to be the most relevant audio sources present in the audio scene. Alternatively, or additionally, a GUI corresponding to the first customization layer may show images corresponding to one or more background audio sources and one or more foreground audio sources. In some such examples, a GUI corresponding to the first customization layer may provide an estimate of the ratio of the level of the background audio source(s) to the level of the foreground audio source(s). [0190] According to these examples, block 360 corresponds to a second customization layer that corresponds to various sets of possible graphical features to facilitate a user’s basic customization of one or more sound sources of the audio scene. In some examples, one or more additional graphical features corresponding to the second customization layer may include one or more areas for receiving user input for an adjustment of the ratio between audio levels of background and foreground audio components, for the adjustment of the level of one or more foreground components, etc. In some such examples, the audio data corresponding to one or more sound sources may be modified, e.g., suppressed or augmented, according to received user input. According to these examples, a user may be able to interact with a user input area of a GUI corresponding to block 355 in order to add one or more additional graphical features corresponding to the second customization layer of block 360.

[0191] In these examples, block 365 corresponds to a third customization layer that corresponds to various sets of possible graphical features to facilitate a user’s advanced customization of one or more sound sources of the audio scene. According to these examples, a user may be able to interact with a user input area of a GUI corresponding to block 355 or block 360 in order to add one or more additional graphical features corresponding to the third customization layer of block 365. In some examples, one or more additional graphical features corresponding to the third customization layer may include one or more areas for receiving user input for introducing audio components that are not currently present in the audio data being captured by a microphone system of the capture device, e.g., apparatus 101, or for which the acoustic signal is below a threshold level. For example, the video data obtained by the camera system of the capture device may indicate one or more potential sound sources for which the audio data being captured by the microphone system is below a threshold level. The potential sound source(s) may include a distant person or animal, a clock, an audio device that is not currently in use or which is producing sound at a low level, etc. One or more additional graphical features corresponding to the third customization layer may indicate one or more such potential sound sources and one or more user input areas for selecting a potential sound source. After a user selects a potential sound source, a user may be presented with one or more additional graphical features corresponding to options for selecting audio data corresponding to the selected potential sound source. [0192] According to these examples, block 370 corresponds to a fourth customization layer that corresponds to various sets of possible graphical features to facilitate a user’s post-capture editing of one or more sound sources of a captured audio scene. According to these examples, a user may be able to interact with a user input area of a GUI corresponding to block 355, block 360 or block 365 in order to add one or more additional graphical features corresponding to the fourth customization layer of block 370. In some examples, one or more additional graphical features corresponding to the fourth customization layer may include one or more areas for receiving user input indicating a user’s selection of a ratio between augmented audio capture and real-world audio capture. The one or more additional graphical features may, for example, include a virtual slider, a virtual dial, etc.

[0193] As with other disclosed examples, the types, numbers and arrangements of elements shown in Figure 3B are merely provided by way of example. For example, although Figure 3B shows four examples of customization layers, other implementations may involve more or fewer than four customization layers, one or more different types of customization layers, etc.

[0194] Figure 4A illustrates an example of a GUI that may be presented in accordance with the first customization layer of Figure 3B. In this example, Figure 4A shows a GUI 400a, an image corresponding to a video frame of a video clip 401a, a person 405a in the current video clip 401a, audio source level representations 410a and 410b, an audio source summary area 415a, and audio source information areas 420a and 420b within the audio source summary area 415a. In this example, the GUI 400a is being overlaid on the video clip 401a. The GUI 400a and the video clip 401a may be presented on an apparatus such as the apparatus 101 of Figure 1 A, for example on a display corresponding to the output unit(s) 147 shown in Figure 1 A. In this example, the apparatus 101 is a cell phone.

[0195] In this example, audio source level information area 410a provides information about a talker audio source, which corresponds to the speech of the person 405a in this instance. Here, the audio source level information area 410b provides information about another talker audio source, which corresponds to the speech of the person operating the apparatus 101. According to this example, the audio source level information areas 410a and 410b indicate the estimated current audio levels corresponding to the speech of the person 405a and the speech of the person operating the apparatus 101, respectively, according to the sizes of the audio source level information areas 410a and 410b. [0196] Here, the audio source information areas 420a and 420b provide information about two audio source categories, which are street noise and speech, respectively, and which may include multiple individual audio sources. In this example, the relative sizes of the audio source information areas 420a and 420b correspond with relative audio levels. In some examples, the total size of the audio source information areas 420a and 420b may correspond to the total instantaneous audio level, which may be scaled with respect to the maximum instantaneous audio level that can be captured by the audio capture system of the apparatus 101. In some such examples, the length or volume of the audio source summary area 415a may correspond to the maximum instantaneous audio level that can be captured by the audio capture system of the apparatus 101. According to this example, the audio source information area 420a provides information about the overall sound levels of various sound sources that the apparatus 101 — or one or more devices in communication with the apparatus 101, such as a device of a cloud service — has determined to be in the audio source category of street noise. In some examples, the apparatus 101 may be configured to determine or estimate the audio source category by implementing an audio source classifier such as disclosed herein to evaluate the background audio that is currently being received by a microphone system of the apparatus 101 as being street noise. The audio source information area 420b may provide information about the overall levels of speech of the person 405a and the person operating the apparatus 101, or only about the levels of speech of the person 405a, depending on the particular implementation.

[0197] Figure 4B illustrates another example of a GUI that may be presented in accordance with the first customization layer of Figure 3B. In this example, Figure 4B shows the GUI 400a at a time subsequent to the time depicted in Figure 4A, while displaying the same video clip 401a. Accordingly, the elements of GUI 400a shown in Figure 4B are the same as those shown in Figure 4A except for the following details: audio source level representation 410a is not shown, because the person 405a is not speaking at this moment; a vehicle 425, which is a bus in this example, is now part of the audio scene; the GUI 400a now includes the audio source level representation 410c, which corresponds to the vehicle 425; and the audio source summary area 415a now includes audio source information area 420c, which also corresponds to the vehicle 425. In these examples, the audio source information areas 420a and 420b (and, when present, 420c) illustrated in Figures 4A and 4B provide graphic depictions of the total instantaneous audio level of the audio scene and the instantaneous distribution of the total audio level among the components of the audio scene.

[0198] In this example, the audio source level information area 410c is larger than the audio source level information areas 410a and 410b shown in Figure 4 A or the audio source level information area 410b shown in Figure 4B, indicating that the bus sound is relatively louder than any of the other audio sources in either audio scene and therefore causes a higher received audio signal level. Moreover, the audio source level information area 410c shows an exclamation point inside a triangle, indicating the estimated current audio level corresponding to the bus may be so high as to be problematic.

[0199] For example, a sound source separator implemented by the apparatus 101 will generally have performance limits, beyond which components of an audio signal cannot be reliably estimated. In this example, an instance of a speech separator may fail to separate speech if the level of a background audio signal, such as that corresponding to the bus, is too high. It is therefore potentially advantageous to notify a user performing the capture about this situation, because this may prompt a corrective action by the user, e.g., moving closer to an audio source of interest, pausing an audio clip until the background audio level has decreased (e.g., waiting until the bus has moved on), etc.

[0200] A GUI such as the GUI 400a of Figure 4A or 4B provides various other potential advantages. For example, the apparatus 101 that is presenting the GUI 400a has reduced the number of possible audio sources about which information is provided, as well as the types of information provided, to a manageable number of user feedback elements that are presented in the GUI 400a. Accordingly, the user is not overwhelmed with a large amount of information, but may instead focus on what the apparatus 101 estimates to be the most pertinent categories of information. Therefore, the user is able to focus on the tasks involved with capturing a desired type of video clip, along with the associated audio.

[0201] As with other disclosed examples, the types, numbers and arrangements of elements shown in Figures 4 A and 4B are merely provided by way of example. For example, although the audio source summary area 415a of Figures 4A and 4B provides information regarding two sound source categories and the audio source summary area 415a of Figure 4B provides information regarding sounds from an individual vehicle, other examples may provide information about more or fewer individual sound sources or groups of sound sources. Alternatively, or additionally, some alternative examples may provide a different graphical depiction of sound source characteristics. Some such examples are provided in this disclosure. [0202] Figure 4C illustrates an example of a GUI that may be presented in accordance with the second customization layer of Figure 3B. In this example, Figure 4C shows a GUI 400b that is similar to the GUI 400a that is shown in Figures 4 A and 4B. For example, the GUI 400b shows circular audio source level representations 410a, 410b and 410c, corresponding to the person 405b, the person operating the apparatus 101 and a vehicle 425, accordingly. The vehicle 425 is a truck in this example. Moreover, the sizes of the audio source level representations 410a, 410b and 410c shown in the GUI 400b correspond to the level of each respective audio source.

[0203] However, the GUI 400b includes two significant differences. One difference is that the audio source summary area 415 includes audio source information areas 420d and 420e, which correspond to background and foreground audio sources, respectively. The “background” audio source information area 420d may, for example, correspond to street noise as well as the noise of the vehicle 425. The “foreground” audio source information area 420e may, in some examples, correspond only to the speech of the person 405b. Another significant difference is that the audio source summary area 415 includes virtual sliders 440a and 440b at the edges of audio source information areas 420d and 420e, respectively. A user may interact with virtual sliders 440a and 440b in order to indicate a desire to increase or decrease the level of the background or foreground audio, respectively. In Figure 4C, a user’s finger 435 is shown interacting with the virtual slider 440b in order to indicate a desired modification to the foreground sound level.

[0204] A GUI such as the GUI 400b provides various potential advantages. For example, the apparatus 101 that is presenting the GUI 400b has reduced the number of possible audio sources about which information is provided, as well as the types of information provided, to a manageable number of user feedback elements. The audio source summary area 415 of GUI 400b is even simpler than that of GUI 400a, because all sound sources are categorized as either foreground or background. The user is able to provide feedback to modify the level of either foreground or background audio, which is convenient but not overly complicated. Accordingly, the user is not overwhelmed with a large amount of information, but may instead focus on the tasks involved with capturing a desired type of video clip, along with the associated audio. As with other disclosed examples, the types, numbers and arrangements of elements shown in Figure 4C are merely provided by way of example. [0205] Figures 4D and 4E illustrate example elements of a GUI that may be presented in accordance with the third or fourth customization layers of Figure 3B. In other words, the GUI 400c of Figures 4D and 4E may be used for advanced customization, for a post-editing process, or both. The GUI 400c and the video clip 401c are being presented on an instance of the apparatus 101 of Figure 1A, which is a cell phone in these examples. In these examples, the GUI 400c is being overlaid on the video clip 401c, which portrays a restaurant scene. As with other disclosed examples, the types, numbers and arrangements of elements shown in Figures 4D and 4E are merely provided as examples.

[0206] In these examples, Figures 4D and 4E show a GUI 400c at two different times, images corresponding to video frames of a video clip 401c at the two different times, a person 405a in the video clip 401c at the two different times, an audio source summary area 415c, and audio source information areas 430a, 430b, 430c and 43 Od within the audio source summary area 415c, audio source labels and audio source level information areas 432a, 432b, 432c and 432d within the audio source information areas 430a, 430b, 430c and 43 Od, respectively, average audio source level indications 433 within the audio source information areas 430a, 430b, 430c and 430d, areas 437a within the audio source level information areas 432a, 432b, 432c and 432d, and areas 437b within at least some of audio source level information areas 432a, 432b, 432c and 432d. Figure 4E also includes a user input area 460 configured for receiving a user selection of augmented audio capture. In this example, the augmented audio capture involves replacing a candidate audio source with external audio.

[0207] According to these examples, the audio source labels for the audio source information areas 430a, 430b, 430c and 430d are “cafeteria noise,” “talking person,” “talking offscreen” and “music playing,” respectively. In these examples, the talking person is the person 405a, the talking offscreen corresponds to speech of the person using the apparatus 101 and the “music playing” corresponds to background music that is currently playing in the restaurant. According to these examples, “cafeteria noise” corresponds to background noise in the restaurant, which an audio classifier implemented by the apparatus 101 has classified as being in the category of cafeteria noise. In these examples, the audio source level information areas 432a, 432b, 432c and 432d indicate the estimated current audio levels corresponding to the background noise in the restaurant, speech of the person 405a, the speech of the person using the apparatus 101 and the background music, respectively. According to these examples, a user may interact with (e.g., touch) the plus symbol or the minus symbol in any of the audio source information areas 430a, 430b, 430c and 43 Od in order to indicate a desire to increase or decrease, respectively, the relative signal level of the respective audio source with respect to that of the other audio sources. In some examples, the control system of the apparatus 101 may be configured to create and store user input metadata corresponding to user input received via the plus symbol or the minus symbol. The audio source level change may or may not occur during the capture process, depending on the particular implementation. In some examples, the audio source level change may occur during a post-capture editing process and may be based, at least in part, on user input metadata corresponding to user input received via the plus symbol or the minus symbol.

[0208] According to these examples, areas 437a and 437b correspond to modified and unmodified audio levels, respectively, within audio source level information areas 432a, 432b, 432c and 432d. In these examples, the average audio source level indications 433 indicate audio source signal levels for the background noise in the restaurant, speech of the person 405a, the speech of the person using the apparatus 101 and the background music. The average audio source level indications 433 may, for example, indicate the audio source signal for each audio source, or group of audio sources, over a time interval.

[0209] In the example shown in Figure 4D, the textual feedback area 450 states, “here background music was detected.” In the example shown in Figure 4E, the textual feedback area 450 states, “the background music can now be replaced with a streamed version.” According to some examples, indications of this type may be responsive to detecting, for example by the control system of the apparatus 101, a candidate sound source for augmented audio capture involving replacement of the candidate sound source by external audio. For example, the control system may be configured to detect and identify background music, and to propose replacing the background music that is currently being detected with another version of the same music or with other music selectable by the user. In this example, user input area 460 of Figure 4E shows an image corresponding to another version of the same song that is being detected and is configured for receiving a user’s selection of that song for augmented audio capture.

[0210] In some implementations, a GUI such as the GUI 400c may be used for a post-capture editing process. For example, a virtual slider, a virtual dial, etc., may be presented on a GUI such as the GUI 400c to allow a user to select of a ratio between augmented audio capture and real-world audio capture. In one such example, a slider may be presented in one of more of the audio source level information areas 432a, 432b and 432c, allowing a user to modify augmented audio to include relatively more or relatively less unmodified audio.

[0211] A GUI such as the GUI 400c provides various potential advantages. Although the GUI 400c is not as simplified as the GUI 400a or the GUI 400b, the GUI 400c allows for more advanced customization, which may occur before, during or after audio capture.

[0212] Figure 5 A is a flow diagram that outlines various example methods 500 according to some disclosed implementations. The example methods 500 may be partitioned into blocks, such as blocks 505, 510, 515, 520, 525, 530, 535, and 540. The various blocks may be described as operations, processes, methods, steps, acts or functions. The blocks of methods 500, like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of methods 500 may be performed concurrently.

Moreover, some implementations of methods 500 may include more or fewer blocks than shown and/or described. The blocks of methods 500 may be performed by one or more devices, for example, the device that is shown in Figure 1 A.

[0213] Block 505 involves “receiving, by a control system of a device, audio data from a microphone system.” The control system may be, or may include, the CPU 141 of Figure 1 A. The control system may, for example, include a general purpose single- or multi-chip processor, one or more digital signal processors (DSPs), one or more application specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, one or more discrete hardware components, or combinations thereof. In some examples, the microphone system may be a microphone system of a mobile device that includes the control system. Processing may continue to block 510.

[0214] Block 510 involves “receiving, by the control system, video data from a camera system.” In some examples, the camera system may include one or more cameras of a mobile device that includes the control system. Processing may continue to block 515.

[0215] Block 515 involves “identifying, by the control system and based at least in part on the audio data and the video data, two or more audio sources in an audio scene.” Audio sources may also be referred to herein as sound sources. In some examples, the identifying may involve creating an inventory of sound sources in the audio scene. The inventory of sound sources may include one or more actual sound sources and one or more potential sound sources. According to some examples, the identifying may involve performing, by the control system, a first sound source separation process. According to some such examples, post-capture audio processing may involve performing a second sound source separation process. The second sound source separation process may be performed by the control system or by one or more other devices, such as one or more servers, depending on the particular implementation. In some examples, the identifying may involve detecting, by the control system and based at least in part on the video data, one or more potential sound sources. In some such examples, at least one potential sound source may not be indicated by the audio data. Processing may continue to block 520.

[0216] Block 520 involves “estimating, by the control system and based on the audio data, at least one audio characteristic of each of the two or more audio sources.” In some examples, block 520 may involve estimating a level of each of the two or more audio sources. Processing may continue to block 525.

[0217] Block 525 involves “storing, by the control system, audio data and video data received during a capture phase.” Block 525 may, for example, involve storing the audio data and video data in a memory of the device that is receiving the audio data and video data. The capture phase may correspond to the capture phase 116 that is described herein, for example with reference to Figure 1C. In some examples, blocks 505, 510, 515 and 520 may be performed, at least in part, during the capture phase. According to some examples, blocks 505, 510, 515 and 520 may be performed, at least in part, during a pre-capture phase, which may correspond to the pre-capture phase 114 that is described herein, for example with reference to Figure 1C. Processing may continue to block 530.

[0218] Block 530 involves “controlling, by the control system, a display of the device to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images, the GUI including an audio source image corresponding to the at least one audio characteristic of each of the two or more audio sources, the GUI including one or more user input areas for receiving user input.” In some examples, block 530 may involve controlling a display of the apparatus 101 to present a GUI like that shown in Figure 3 A or one of the GUIs shown in Figures 4C-4E. The virtual sliders 440a and 440b of Figure 4C are examples of the “one or more user input areas for receiving user input.” Processing may continue to block 535.

[0219] Block 535 involves “receiving, by the control system and prior to the capture phase, user input via the one or more user input areas.” User interaction #1 of Figure 2A provides an example of such user input. Block 535 may, for example, involve receiving user input via one of the virtual sliders 440a and 440b of Figure 4C, e.g., via the user’s finger 435. In other examples, block 535 may involve receiving input via another type of virtual slider, via a virtual knob, a virtual dial, etc. In some alternative examples, user input may be received via one or more voice commands, one or more gestures, etc. Processing may continue to block 540.

[0220] Block 540 involves “causing, by the control system, audio data received during the capture phase to be modified according to the user input.” The “causal” effects that are described with reference to Figure 2A provide examples. Block 540 may, for example, involve modifying audio data corresponding to a selected audio source or a selected category of audio sources. Block 540 may, for example, involve causing audio data received during the capture phase to be modified during the capture phase. In some such examples, block 540 may involve a beamforming process corresponding to a selected audio source. In some examples, methods 500 may involve receiving, by the control system and after the capture phase has begun, user input via the user input area. According to some such examples, block 540 may involve causing audio data received throughout a duration of the capture phase to be modified according to the user input. The “retro-causal” effects that are described with reference to Figure 2A provide examples.

[0221] Alternatively, or additionally, block 540 may involve causing audio data received during the capture phase to be modified during a post-capture editing phase. In some such examples, block 525 may involve storing modified audio data that has been modified according to the user input. According to some examples, methods 500 may involve storing unmodified audio data received during the capture phase.

[0222] In some examples, methods 500 may involve creating and storing, by the control system, user input metadata corresponding to user input received via the user input area. As noted above, block 540 may involve causing audio data received during the capture phase to be modified during a post-capture editing phase. In some examples, causing the audio data received during the capture phase to be modified according to the user input may involve post-capture audio processing that is based, at least in part, on the user input metadata. According to some examples, the control system may be configured to perform at least a portion of the post-capture audio processing. Alternatively, or additionally, another control system — such as a control system of a server — may be configured to perform at least a portion of the post-capture audio processing.

[0223] According to some examples, methods 500 may involve classifying, by the control system, the two or more audio sources into two or more audio source categories. In some such examples, the GUI may include a user input area portion corresponding to each of the two or more audio source categories. One such category may be “cafeteria noise,” e.g., as described with reference to Figure 3A, or 4D or 4E. Another such category may be “street noise,” e.g., as described with reference to Figures 4A and 4B. In some examples, classifying the two or more audio sources into two or more audio source categories may be based on the inventory of sound sources. According to some examples, the two or more audio source categories may include a background category and a foreground category. In some examples, the one or more user input areas of the GUI may include at least one user input area configured for receiving user input regarding a selected level, or ratio of levels, for each of the two or more audio source categories. The virtual sliders 440a and 440b of Figure 4C are examples.

[0224] In some examples, methods 500 may involve determining one or more types of actionable feedback regarding the audio scene. According to some such examples, the GUI may be based, in part, on the one or more types of actionable feedback. Figure 4B provides one example of actionable feedback: the audio source level information area 410c shows an exclamation point inside a triangle, indicating the estimated current audio level corresponding to the bus may be so high as to be problematic. In this example, an instance of a speech separator may fail to separate speech if the level of a background audio signal, such as that corresponding to the bus, is too high. Notifying a user performing the capture about this situation may prompt corrective action by the user, e.g., moving closer to an audio source of interest, pausing an audio clip until the background audio level has decreased (e.g., waiting until the bus has moved on), etc.

[0225] According to some examples, methods 500 may involve detecting, by the control system, one or more candidate sound sources for augmented audio capture, the augmented audio capture involving replacement of a candidate sound source by external audio or synthetic audio. One such example is described with reference to Figures 4D and 4E, wherein the control system detected background music and indicated that the background music was a candidate sound source for augmented audio capture, which in this example involved replacement of the candidate sound source by external audio. Foley effects are examples of synthetic audio.

[0226] In some examples, the GUI may include at least one user input area configured for receiving a user selection of a selected potential sound source or a selected candidate sound source. For example, the GUI 400c may, in some implementations, allow a user to select another sound source, or category of sound source, such as the cafeteria noise, e.g., by touching the corresponding audio source information area 430a.

[0227] According to some examples, the GUI may include at least one user input area configured for receiving a user selection of augmented audio capture. The user input area 460 of Figure 4E is an example of one such user input area. The augmented audio capture may involve external audio or synthetic audio for a selected potential sound source or a selected candidate sound source. In some alternative examples, the GUI 400c, or another disclosed GUI, may provide a user with a selection of synthetic audio, such as possible Foley effects, for augmented audio capture corresponding to a selected potential sound source or a selected candidate sound source.

[0228] In some examples, the GUI may include at least one user input area configured for receiving a user selection of a ratio between augmented audio capture and real-world audio capture. As noted above, in some examples the GUI 400c, or a similar GUI, may allow a user selection of a ratio between augmented audio capture and real-world audio capture.

[0229] Figure 5B is a flow diagram that outlines various example methods 550 according to some disclosed implementations. The example methods 550 may be partitioned into blocks, such as blocks 555a, 555b, 558a, 558b, 560, 562, 564 and 565. The various blocks may be described as operations, processes, methods, steps, acts or functions. The blocks of methods 550, like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of methods 550 may be performed concurrently. Moreover, some implementations of methods 550 may include more or fewer blocks than shown and/or described. The blocks of methods 550 may be performed by one or more devices, for example, the device that is shown in Figure 1 A.

[0230] Block 555a involves “receiving audio data.” Block 555a, like block 505 of Figure 5 A, may involve receiving, by a control system of a device, audio data from a microphone system. The control system may be, or may include, the CPU 141 of Figure 1 A. The control system may, for example, include a general purpose single- or multi-chip processor, one or more digital signal processors (DSPs), one or more application specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, one or more discrete hardware components, or combinations thereof In some examples, the microphone system may be a microphone system of a mobile device that includes the control system. Processing may continue from block 555a to block 558a.

[0231] Block 555b involves “receiving video data.” Block 555a, like block 510 of Figure 5A, may involve receiving, by a control system of a device, video data from a camera system. In some examples, the camera system may be a camera system of a mobile device that includes the control system. According to some examples, blocks 555a and 555b may be performed concurrently. Processing may continue from block 555b to block 558b.

[0232] Block 558a involves “classifying an audio scene.” In this example, block 558a involves selecting, by the control system and based at least in part on the audio data received in block 555a, the most likely environmental context from a group of possible environmental contexts such as “urban park,” “airport,” “subway station,” “cafe,” “living room,” etc. Block 558a may, for example, involve implementing, by the control system, an audio scene classification (ASC) method. The ASC method may, for example, involve using a trained neural network that is implemented by the control system. The ASC method may, in some instances, be one of the methods described in B. Ding, et al., “Acoustic scene classification: a comprehensive survey,” in “Expert Systems with Applications,” Volume 238, Part B, 121902 (Elsevier, March 15, 2024), which is hereby incorporated by reference. Processing may continue from block 558a to block 564.

[0233] Block 558b involves “classifying a video scene.” In this example, block 558b involves selecting, by the control system and based at least in part on the video data received in block 555b, the most likely environmental context from a group of possible environmental contexts. Block 558b may, for example, involve implementing, by the control system, a video scene classification (VSC) method such as Multiscale Vision Transformers (MViTv2), a 3- Dimensional-CNN (3D-CNN)-based VSC method, a CNN-Recurrent Neural Network (CNN- RNN)-based VSC method, or another applicable VCS method. Processing may continue from block 558b to block 560. [0234] Block 560 involves “classifying video objects.” In this example, block 560 involves selecting, by the control system and based at least in part on the video data received in block 555b and the audio scene classification of block 558b, objects in currently-obtained video images. Block 560 may, for example, involve implementing, by the control system, a videobased object detection method such as one of the methods described in J. Redmon, et al., “You Only Look Once: Unified, Real-Time Object Detection” (arXiv: 1506.02640v5 [cs.CV] May 9, 2016), which is hereby incorporated by reference, or another applicable video-based object detection method. Processing may continue from block 560 to block 562.

[0235] Block 562 involves “identifying potential audio sources.” In this example, block 562 involves identifying potential audio sources by the control system and based at least in part on the video object classification of block 560. The operations of block 562 may, in some examples be based at least in part on the video data received in block 555b. For example, if the video object classification of block 560 identifies a person in the scene, determining whether the person is a potential audio source may be based, at least in part, one or more activities of the person, such as whether the person’s mouth is moving, whether the person is playing a musical instrument, whether the person is pounding on a table, etc. Some types of objects, such as empty chairs or tables, may not be classified as potential audio sources in some examples. According to some examples, block 562 may involve identifying candidate sound sources for augmented audio capture. In some such examples, the augmented audio capture may involve replacement of a candidate sound source by external audio or synthetic audio. One example of the output of block 562 is shown in Figure 5C and is described below. Processing may continue from block 562 to block 564.

[0236] Block 564 involves “identifying and classifying audio sources, and estimating audio source levels.” In this example, block 564 involves identifying and classifying audio sources, and estimating audio source levels, by the control system and based at least in part on the potential audio sources identified in block 562, the audio scene classification of block 558a and the audio data received in block 555a. The operations of block 564 may, in some examples, involve creating a data structure, such as a lookup table, that includes a list of detected audio sources present in the audio scene. In some examples, the data structure may indicate whether each audio source present in the audio scene is a “foreground” audio source or a “background” audio source, with the foreground audio sources being audio sources that are estimated to be of interest and background audio sources being audio sources that are estimated to be noise, or at least estimated to be less significant than the foreground audio sources. Block 564 may involve selecting a semantic label for each audio source, such as “human talker #1,” “human talker #2,” “musician #1,” etc., for foreground audio sources and “street noise,” “babble noise,” “restaurant noise,” “wind noise,” “traffic noise,” etc., for background audio sources, and including the semantic label in the data structure. In some instances, multiple background audio sources present in the audio scene may be grouped into a single background category, such as “background noise,” in the data structure.

[0237] According to some examples, block 564 may involve generating a sequence of timestamps associated with a binary flag indicating the presence or absence of a particular audio source. In some examples, block 564 may involve a signal level measurement for each audio source, which may also be associated with the sequence of timestamps. According to some examples, block 564 may involve generating a detection probability for that source, which may also be associated with the sequence of timestamps. One example of the output of block 564 is shown in Figure 5D and is described below.

[0238] In some implementations of block 564, the semantic labels for audio sources may be generated by an instance of an audio tagger that is implemented by the control system. Such an audio tagger may be implemented in many ways. For example, the audio tagger may be constructed by using a convolutional neural network (CNN). Such an audio tagger may operate on a spectral representation of an audio signal (e.g., a mel spectrogram) and may be configured to update a set of semantic labels or “tags” along with confidence intervals for these tags with a selected or pre-defined time resolution (e.g., 1 second). The CNN may be trained to provide an output comprising indications of tags (within a large set of all possible tags) corresponding to audio objects present in that segment along with another output representing a confidence level for the tags indicating a probability of object detection. For example, a value close to 1.0 may indicate a high level of confidence and value close to zero may indicate a low level of confidence. Such a confidence level may be determined according to the estimated object detection probability, which may be computed by the same CNN. In some examples, the audio tagger may be based on the so-called YAMNet architecture operating on a Mel spectrogram of an audio signal. YAMNet is a deep neural network that can classify over 500 different types of sound sources, including a siren, dog barks, laughter, etc. The YAMNet architecture can be obtained by using the MobileNet-vl architecture disclosed in Howard, Andrew G. "MobileNets: Efficient convolutional neural networks for mobile vision applications," (arXiv preprint arXiv: 1704.04861 (2017) https://arxiv.org/pdf/1704.04861), which is hereby incorporated by reference. Such an architecture facilitates assigning semantic labels from a large set of semantic labels to segments of a waveform.

[0239] The signal level for individual audio sources, which may be referred to herein as audio objects, may be determined as follows. An instance of an audio source separator, which may be referred to as a “first source separator” may be implemented by a control system, such as the control system of a device used for audio and video capture. The first source separator may, in some examples, be used to find a distribution of signal level (or signal energy) between all foreground components of the audio scene and the background components. Then, knowing the signal level of the foreground audio sources and the set of foreground audio sources present in a segment of an audio signals, the control system may estimate the signal levels of the individual audio sources knowing their detection probabilities by computing levels of individual foreground audio sources by weighting the total level with respect to detection probabilities for these components. The detection probabilities may be provided by analysis of the video signal associated with the audio signal, for example by detection of active talkers according to whose mouth is moving at the time of detected apparent speech.

[0240] In some embodiments, object detection in the video associated with the audio scene may be used to perform an adjustment of the signal energy distribution. For example, if multiple talkers are simultaneously present in the scene, video analysis can be used to identify the active talkers and the instantaneous signal energy associated with the talkers may be distributed between the active talkers. Processing may continue from block 564 to block 565.

[0241] Block 565 involves “outputting an inventory of audio sources.” In this example, the inventory of audio sources corresponds with at least a portion of the data structure generated in block 564. One such inventory of audio sources is described below with reference to Figure 5D. In some examples, block 565 may involve causing, by the control system, at least a portion of the inventory of audio sources to be displayed, such as in a GUI. The audio source summary areas 415a, 415b and 415c of Figures 4A-4D provide examples of such an audio source inventory display. According to some examples, block 565 may involve storing the data structure generated in block 564. [0242] Figure 5C shows a table that represents example elements of a video object data structure according to some disclosed implementations. In some examples, the table 570 may be generated in block 562 of Figure 5B. In this example, the table 570 includes fields 572, 574, 576 and 568. The number of fields, types of fields and order of fields shown in table 570 are merely examples. For example, in some implementations a video object data structure may include an estimation probability for each estimated video object type, a time stamp, etc. In some examples, a video object data structure may not include a prediction of whether a video object is a background object or a foreground object. Moreover, the field headings (“video object ID,” etc.) are merely presented for the viewer’s convenience: an actual video object data structure would not necessarily include such field headings.

[0243] In this example, the table 570 shows information regarding video objects in a cafe or cafeteria environment, such as that shown in Figures 4D and 4E. In Figure 5C, the video object ID field 572 includes an identification number for each estimated video object. Here, the estimated video object type field 574 an estimated video object type for every identified video object in the table 570. According to some examples, the video scene classification of block 558b may correspond with a set of possible video object types. For example, the video scene classification of block 558b may have indicated that the video scene corresponds to a cafe or cafeteria environment and the set of possible video object types may include objects that are commonly found in a cafe or cafeteria environment.

[0244] According to this example, field 576 indicates the control system’s estimation of whether, at the particular time corresponding to the table 570, each video object type is a potential audio source. In this example, potted plants, empty chairs and empty tables are not considered to be potential audio sources, whereas people and clocks are considered to be potential audio sources.

[0245] In this example, field 578 indicates the control system’s estimation of whether, at the particular time corresponding to the table 570, each actual or potential audio source is likely to be a foreground audio source or a background audio source. Referring to Figure 4D, the people shown in the background of the video clip 401a, near the counter, may be regarded as actual or potential background audio sources, whereas the person 405a near the center of the video clip 401a may be regarded as an actual or potential foreground audio source. [0246] Figure 5D shows a table that represents example elements of an audio source inventory data structure according to some disclosed implementations. In some examples, the table 580 may be generated in block 564 or block 565 of Figure 5B. In these examples, the table 580 includes fields 582, 584, 586, 568, 590 and 592. The number of fields, types of fields and order of fields shown in table 580 are merely examples. For example, in some implementations a sound source inventory data structure may include an estimation probability for each estimated audio source type, a time stamp, etc. Other examples of the table 580 may not include a field indicating how an actual or potential audio source was detected. Moreover, the field headings (“audio source ID,” etc.) are merely presented for the viewer’s convenience: an actual audio source inventory data structure would not necessarily include such field headings.

[0247] In these examples, the table 580 shows information regarding audio sources in a cafe or cafeteria environment, such as that shown in Figures 4D and 4E. In Figure 5D, the audio source ID field 582 includes an identification number for each estimated audio source. Here, the estimated audio source type field 584 an estimated audio source type for every identified audio source in the table 580. According to some examples, the audio scene classification of block 558a may correspond with a set of possible audio source types. In these examples, the audio scene classification of block 558b has indicated that the video scene corresponds to a cafe or cafeteria environment and the set of possible audio source types includes objects that are commonly found in a cafe or cafeteria environment.

[0248] According to these examples, field 586 indicates how each actual or potential audio source was detected. For example, the “cafeteria noise” of audio source 1 and the “background conversations” of audio source 5 were primarily detected according to audio data received in block 555a, but were expected due to the video scene and video object classifications of blocks 558b and 560. In these examples, audio source 3 was only detected in the audio data, whereas audio source 6 was only detected in the video data.

[0249] In these examples, field 588 indicates the control system’s estimation of whether, at the time corresponding to the table 580, each actual or potential audio source is likely to be a foreground audio source or a background audio source. According to these examples, field 590 indicates the control system’s estimation of whether, at the particular time corresponding to the table 580, each actual audio source is likely to be audible to a human listener. In these examples, field 592 indicates the control system’s estimation of the audio signal level of each actual or potential audio source at the time corresponding to the table 580. According to these examples, the audio signal levels are indicated on a scale from zero to ten, with estimated audio values rounded up or down to whole integers. For example, an estimated audio signal level of 0.45 would be rounded down to 0 and an estimated audio signal level of 0.55 would be rounded up to 1. In some examples, a threshold may be applied to the estimated audio signal levels to estimate whether each actual audio source is likely to be audible to a human listener. For example, the threshold may be 0.3, 0.5, 0.8, 1.0, etc. According to some examples, the threshold may vary depending on the level of one or more other audio sources, such as one or more background noise levels. Other examples may involve different ranges of audio signal levels, such as zero to one, zero to one hundred, etc.

[0250] Figure 6 shows examples of modules that may be implemented according to some disclosed examples. In this example, the control system portion 606 is configured for implementing a first audio source separation module 605, a video object classifier 608, which includes a video object detection module 610 and a video object confidence estimation module 620, an audio source level estimation module 615, an analysis module 625, a GUI module 630, and a file generation module 635. According to some examples, the inputs and outputs of the modules that are shown in Figure 6 may be provided during a capture phase, such as the capture phase 116 that is described with reference to Figures 1C and 2A. In some such examples, the control system portion 606 may be a portion of the control system of a capture device or of a capture system that includes more than one device.

[0251] In this example, the first audio source separation module 605 is configured to determine and output separated audio source data 607a-607n for each of n separate audio sources based, at least in part, on received audio data 601 from a microphone system and received side information 604. In some examples, n may be an integer of two or more. As noted elsewhere herein, the term “audio source” has the same meaning as the term “sound source” in this disclosure. According to some examples, the first audio source separation module 605 may be configured to perform what is referred to herein as a “first audio source separation process” or a “first sound source separation process.” The first sound source separation process may, in some examples, be implemented by a neural network, for example, a neural network trained to separate a specific type of audio signal from a mix containing this audio signal and some arbitrary background audio. Such a neural network may, in some examples, be trained using a training objective based on regression. The architecture of the first source separator may, in some examples, depend on the category of an audio scene. For a scene involving human talkers, the first source separator may, for example, be an instance of the speech versus background separator that is disclosed in U.S. Patent Publication No. 2023368807A1, which is hereby incorporated by reference.

[0252] More generally, a first sound source separator may be implemented by imposing — by a control system configured to implement the first audio source separation module 605 — a set of typical foreground objects (e.g. human talkers, musical instruments, etc.), and designing for that set specialized foreground vs background audio source separators. An instance of such a separator for a specific signal category may include a neural network model operating on timefrequency representation of an audio signal trained to generate separated audio from unseparated audio in that time-frequency representation. In some examples, such a network may be implemented using a convolutional neural network that was developed for image segmentation, such as a U-Net autoencoding framework. An instance of such a sound source separator for a specific signal category may be configured for estimating of a pair of separated signals, where one signal contains all foreground objects belonging to a category and another one contains all the background. The estimation of audio source levels may, for example, be performed by measuring signal levels of the separated audio components of the mixed signal in the input to the first audio source separation module 605.

[0253] According to some examples, the source separation module 605 may be configured to select a particular instance of the sound source separator by using the side information 604 to define one or more sound source separation targets. The side information 604 may be provided by the user or may be provided automatically by the control system based on analyzing the content of input video and audio data.

[0254] In some such examples, a post-capture editing process, such as the post-capture editing process 124 of Figure 1C, may involve performing a second sound source separation process, for example by one or more servers. This is potentially advantageous because the second sound source separation process may be more accurate but may be too computationally intensive for the capture device to perform during the pre-capture phase 114 or the capture phase 116.

[0255] According to some examples, the side information 604 may be, or may include, audio source labels assigned by an audio source classifier, video object source labels assigned by a video object classifier, or both. The audio source classifier and video object classifier may, in some instances, be one or more of those described with reference to Figure 5B. For example, the audio source classifier may, in some instances, be configured to perform block 558a of Figure 5B, block 562 of Figure 5B, block 564 of Figure 5B, or combinations thereof The video object classifier may, in some instances, be configured to perform block 558b of Figure 5B, block 560 of Figure 5B, or both. In some examples, the separated audio source data 607a-607n may include separated audio components of the received audio data 601, as well as “tags” or other identification data, annotations, etc.

[0256] According to some implementations, the first audio source separation module 605 may be configured to identify foreground audio scene components and background audio scene components. In some such examples, the separated audio source data 607a-607n may include “tags” or other identification data, annotations, etc., indicating which separated audio components of the received audio data 601 are estimated to be foreground audio scene components and which are estimated to be background audio scene components. Accordingly, in some instances the first audio source separation module 605 may be configured to implement what is referred to herein as an “audio tagger.”

[0257] As disclosed elsewhere herein, in some examples an audio tagger may be, or may include, a convolutional neural network (CNN) implemented by a control system. Some such audio taggers may operate on a spectral representation of an audio signal (e.g., a mel spectrogram) and may update a set of tags along with confidence intervals for these tags with some pre-defined time resolution (e.g., 1 second). The CNN may be trained to provide an output including indications of tags (within a large set of all possible tags) corresponding to audio objects present in that segment along with another output representing a confidence level for the tags. The output representing the confidence level may, for example, be a probability of object detection, such as a value between zero and one, where a value closer to 1.0 would indicate a large confidence and a value close to zero would indicate a low confidence. Such a confidence level may be determined according to the estimated object detection probability, which may be computed by the same CNN. According to some examples, the audio tagger may be implemented as a trained neural network constructed according to the so-called the YAMNet architecture, as described in more detail elsewhere herein. [0258] In this example, the audio source level estimation module 615 is configured to estimate an audio signal level for each audio component of the separated audio source data 607a-607n, and to output corresponding audio signal levels 617a-617n to the analysis module 625. In some examples, the separated audio source data 607a-607n also may be provided to the analysis module 625. According to some examples, the audio signal levels 617a-617n may be, or may include, instantaneous audio signal levels. In some examples, the audio signal levels 617a-617n may be, or may include, audio signal levels that are averaged over a time interval, which may be a fixed time interval or a variable time interval. According to some examples, the time interval may be on the order of tens of milliseconds, hundreds of milliseconds, seconds, tens of seconds, etc. According to some implementations, the audio source level estimation module 615 may be configured to find the distribution of signal level (or signal energy) between all foreground components of the audio scene and the overall signal level of the background audio scene components. In some implementations, the first audio source separation module 605 may be configured for both audio source separation and for audio source level estimation.

[0259] According to this example, the video object classifier 608 is configured to detect objects in currently-obtained video data. The video object classifier 608 may, for example, be implemented by a CNN. In some examples, the video object classifier 608 may one or more of the methods described in J. Redmon, et al., “You Only Look Once: Unified, Real-Time Object Detection” (arXiv: 1506.02640v5 [cs.CV] May 9, 2016), which is hereby incorporated by reference, or another applicable video-based object detection method.

[0260] In this example, the video object detection module 610 is configured to determine video objects and output corresponding estimated video object data 612a-612v for each of v separate video objects based, at least in part, on received video data 602 from a camera system and received side information 604. In some examples, v may be an integer of two or more. As noted elsewhere herein, in some examples the side information 604 may be, or may include, video object source labels assigned by a video object classifier. The video object classifier may, in some instances, be configured to perform block 558b of Figure 5B, block 560 of Figure 5B, or both.

[0261] According to this example, the video object confidence estimation module 620 is configured to estimate a confidence level for each of the video objects corresponding to the video object data 612a-612v output by the video object detection module 610, and to output corresponding video object confidence levels 622a-622v to the analysis module 625. In some examples, the video object confidence levels 622a-622v may vary from zero to one, with zero being the lowest confidence level and one being the highest confidence level. Other examples may use other numerical ranges. In some examples, the video object data 612a-612r also may be provided to the analysis module 625.

[0262] In this example, the analysis module 625 is configured to output, to the GUI module 630, audio scene analysis information 627 based at least in part on received video object confidence levels 622a-622v and audio signal levels 617a-617n. In some examples, the audio scene analysis information 627 may be based at least in part on side information 604. According to some examples, the audio scene analysis information 627 may be based at least in part on video data 602 received from the camera system, video object data 612a-612r, separated audio source data 607a-607n, or combinations thereof. The analysis module 625 may, for example, be implemented by one or more CNNs.

[0263] The audio scene analysis information 627 may, for example, include audio source inventory data and audio scene analysis metadata. The audio source inventory data may, in some examples, include a data structure similar to the audio source inventory data structure that is described with reference to Figure 5D. The audio scene analysis metadata may, for example, include audio source labels, audio source level data, audio source location data, combinations thereof, etc. In some examples, the audio source level data may include data indicating the audio source signal for each audio source, or group of audio sources, over a time interval (see the description of the average audio source level indications 433 of Figures 4D and 4E). The audio source location data may, for example, include location data relative to the current video scene, such that audio object locations may be associated with one or more corresponding people, objects, etc., in the video scene. The audio scene analysis information 627 may, in some examples, include audio source category information that groups the audio sources in an inventory of audio sources into two or more audio source categories, which may include a foreground category and a background category. Further details of one example of the analysis module 625 are provided below with reference to Figure 7.

[0264] According to this example, the GUI module 630 is configured for audio scene visualization and user input collection via one or more types of GUIs. Accordingly, in this example, the GUI module 630 is configured for controlling one or more displays of a display system to present one or more types of GUIs. In some examples, the one or more types of GUIs may be overlaid on images corresponding to video data 602 that is received during a capture phase or a pre-capture phase, for example as described with reference to Figure 3 A-5A. At least one of the GUIs may include an audio source image corresponding to at least one audio characteristic of one or more audio sources indicated by the audio scene analysis information 627.

[0265] In this example, at least one GUI provided by the GUI module 630 includes one or more user input areas for receiving user input. Accordingly, in this example, the GUI module 630 is configured for receiving user input 628, for example via one or more touch screen locations corresponding to the one or more user input areas. In some examples, the user input 628 may indicate a user’s desire to increase or decrease a level of one or more audio sources, such as a level of a talker whose image is being presented concurrently with a GUI. In some instances, the user input 628 may indicate a selected potential sound source or a selected candidate sound source for augmented audio capture. The augmented audio capture may, for example, involve replacement of a selected candidate sound source by external audio or synthetic audio, the addition of external audio or synthetic audio for a selected potential sound source, or combinations thereof. In some examples, the augmented audio capture may involve a beamforming process to enhance the audio of a selected candidate sound source.

[0266] According to this example, the GUI module 630 — or another module that is implemented by the control system portion 606 — is configured for generating and storing metadata corresponding with one or more types of received user input. At least one of the one or more types of GUIs may, in some examples, be similar to one or more of those shown in Figures 3 A and 4A-4E.

[0267] According to this example, the file generation module 635 is configured to generate an audio asset stream 637 that includes the audio data 601 and video data 602 that is received during a capture phase. According to some implementations, the file generation module 635 — or another component of the control system portion 606 — may be configured to store an audio asset corresponding to the audio asset stream 637. In some implementations, the audio asset stream 637 may include user input metadata, audio scene analysis metadata, or both. According to some implementations, the audio asset stream 637 may include modified audio data, modified video data, or both, that are modified during a capture phase. [0268] The context of an audio scene may be associated with one or more assumptions about the intent of a capture device user regarding a particular scene, for example regarding what audio source or sources the user regards as being the most important. Such an intent may be indicated by the user’s interaction with a GUI, such as a GUI provided by the GUI module 630 of Figure 6. For example, the intent of a capture device user regarding a particular scene may be indicated by received user input indicating the user’s intention to focus on human talkers detected in the scene. In some examples, user input may indicate that a user is relatively more interested in another type of sound source or potential sound source, such as a musician, an animal, a vehicle, an aircraft, a fountain or waterfall, etc. In some instances, user input indicating a user’s interest may be received via a GUI, such as input regarding a change in an audio source (for example, input regarding a desired signal level increase). Alternatively, or additionally, user input indicating a user’s interest may be, or may correspond with, a user’s control of a camera system, such as zooming in on a particular video object (such as a person), centering a particular video object in the current video frame, etc.

[0269] In the absence of, or in addition to, user input that indicates a user’s main sound source or potential sound source of interest, in some implementations a control system (such as the control system of a capture device or a capture system) may be configured to select what the control system deems to be the most likely hypothesis regarding the context of the audio scene. Alternatively, or additionally, in some examples even if the control system has received one or more previous indications of the user’s intent, the control system may evaluate the most likely hypothesis regarding the current context of the audio scene and whether the user’s focus of attention may have changed. The control system may provide such a context hypothesis evaluation when, for example, a determined time interval has elapsed since the last clear indication of the user’s intent, when the indication of the user’s intent is ambiguous, etc.

[0270] Figure 7 shows example components of an analysis module that is configured for context hypothesis evaluation according to some disclosed examples. In this example, the analysis module 625 is an instance of the analysis module 625 of Figure 6, is implemented by the control system portion 606 and is configured to generate audio scene analysis information 627. According to this example, the analysis module 625 includes an audio scene analysis module 705, a context hypothesis evaluation module 710 and an audio scene inventory filtering module 715. [0271] In this example, the audio scene analysis module 705 is configured to generate scene analysis data 707 based at least in part on side information 604, audio signal levels 617a-617n for corresponding audio sources a-n and video object confidence levels 622a-622v for corresponding video objects a-v. In some examples, the audio scene analysis module 705 may be implemented via a CNN. According to some examples, the audio scene analysis module 705 may be configured to generate the audio scene analysis information 627 based at least in part on video data 602 received from the camera system. The side information 604 may, for example, include audio source labels assigned by an audio source classifier, video object source labels assigned by a video object classifier, or both. The scene analysis data 707 may, for example, include a set of audio scene components, which may be referred to as a “first set of audio scene components.” The scene analysis data 707 may, in some examples, include audio source labels, audio source levels, audio source coordinates (e.g., relative to images in the input video data). [0272] According to some examples, the audio scene analysis module 705 may be configured to estimate correspondences, or lack thereof, between the audio sources a-n and the video objects a-v. For example, the audio scene analysis module 705 may be configured to detect (for example, based on the video data 602) that the mouths of one or more individuals in the audio scene foreground are currently moving at a time when one or more audio sources corresponding to speech have been detected in the audio data 601 received from a microphone system. The audio scene analysis module 705 may be configured to estimate which current “talker” audio source corresponds to each of the one or more individuals whose mouths are currently moving. Similarly, in some instances there may be two audio scene foreground talkers detected in the audio data 601, but the mouth of only one individual in the audio scene foreground may be currently moving. In some examples, the audio scene analysis module 705 may estimate that the other audio scene foreground talker is an interviewer and/or is a person who is currently operating a capture device. However, in some alternative examples, the audio scene analysis module 705 may simply categorize the unseen foreground talker as a talking person not in the video scene, as shown in Figure 5D. This type of situation corresponds with the “talking offscreen” audio source label example in the GUI 400 of Figures 4D and 4E.

[0273] In some implementations, the audio scene analysis module 705 may be configured to determine whether one or more aspects of the current audio are inconsistent with a video-based analysis. In some such examples, the scene analysis data 707 may indicate whether one or more aspects of the current audio are inconsistent with a video-based analysis. According to some such examples, the audio scene analysis module 705 may be configured to make comparisons between audio source levels and video object confidence levels. For example, if the video object confidence level for a particular object is greater than a video object confidence threshold and a corresponding audio source signal level is less than an audio source signal threshold, the scene analysis data 707 may indicate (e.g., in metadata, by setting a flag, by setting a value, etc.) that there is currently an inconsistency between an audio-based analysis and a video-based analysis of a particular audio source or potential audio source. Such an inconsistency may be an indication that one or more aspects of the audio capture system are operating near or below a desired performance threshold.

[0274] According to this example, the context hypothesis evaluation module 710 is configured to estimate a current scene context hypothesis 712 based, at least in part, on the scene analysis data 707 and side information 604. According to some examples, the context hypothesis evaluation module 710 may be configured to access a memory in which a list of context hypotheses is stored.

[0275] The list of context hypotheses may, for example, include one or more contexts involving the capture of audio and video corresponding to human talkers, one or more contexts involving the capture of audio and video corresponding to musical instruments, one or more contexts involving the capture of audio and video corresponding to a nature scene, etc. In some such examples, each hypothesis may be associated with the presence of a set of audio sources (also referred to herein as audio objects or sound sources) that are typically associated with the corresponding context. For example, if the context is “urban sidewalk” and involves the capture of audio and video corresponding to human talkers, the set of audio sources may include foreground talkers, background talkers, automobiles, buses, trucks, dogs, sirens, etc. In some such examples, the context hypothesis evaluation module 710 may be configured to score each hypotheses based on the presence of corresponding audio or video objects and the confidence of their detection.

[0276] In some such examples, the context hypothesis evaluation module 710 may be configured to score each hypotheses using a predefined utility function. According to some such examples, the context hypothesis evaluation module 710 may be configured to select the hypothesis for which the utility function achieves its maximum value. . For example, a utility function L_S X_A,X_S) for a scene s can be evaluated given the audio scene analysis data 707 and side information 604 data, where the audio scene analysis data is used to derive a set X_A of detection probabilities of (expected) audio sources (typically) associated with that scene, and the side information data is used to derive a set of (expected) non-audio objects X_s (typically) associated with such a scene (e.g., objects detected in the video feed). In general, the sets X_A and X_s are predefined for possible scene categories (by listing typical objects expected for such scene). However, the number of objects included in X_A and X_s may differ for different scenes. Let us denote the number of objects in the respective sets by and | X_s | , and we denote that an object is included in the set X by x_t e X. Then, a utility function may be computed as where p(x_z) represents the detection probability of object p(Xj). If object x_t is not detected in a scene p(Xj)=0, otherwise the detection probability can be estimated in the audio scene analysis module 705. The context evaluation module 701 may contain several predefined functions L_S(X_A,X_S), each of them associated with a different number of objects in sets X_A and X_s. However, due to normalization with respect to | and | X_s | , the different functions can be compared, and the block 701 may classify the scene as the one represented by the utility function achieving the largest value L_S X_A, X_S . For example, a scene associated with capture in a restaurant environment may comprise X_A for expected objects such as [human talkers, restaurant noise, background music, etc.], while a scene associated with a street capture may be associated with X_A for expected objects such as [street noise, car hom, human talkers, wind, etc.]. Similarly, for a scene associated with capture in a restaurant environment, the set X_s may comprise objects such as [dining table, person, etc.], and for a scene associated with a street capture, the set X_s may comprise objects such as [car, bus, street lamp, etc,]. Such sets of typical objects may be predefined for the most likely categories of scenes where audio capture will be performed (e.g., “outdoor nature”, “street”, “indoor restaurant”, “sports”, “default”, etc.).

[0277] In this example, the audio scene inventory filtering module 715 is configured to generate audio scene analysis information 627 based at least in part on side information 604, scene analysis data 707 and a current scene context hypothesis 712. In some examples, the audio scene inventory filtering module 715 may be implemented via a CNN. The audio scene analysis information 627 may be, or may include, what will be referred to herein as a “second set of audio scene components.” The audio scene analysis information 627 may, in some examples, include audio source labels, audio source levels, audio source coordinates (e.g., relative to images in the input video data) regarding the second set of audio scene components.

[0278] The second set of audio scene components may, in some instances, be a subset of the first set of audio scene components of the scene analysis data 707. According to some examples, the audio scene inventory filtering module 715 may be configured to prioritize and/or rank the first set of audio scene components to determine the second set of audio scene components according to current or recent audio scene activity, current or recent video scene activity, current or recent user indications. The user indications may include user input, a current framing of the video scene, or both. The user indications may, for example, correspond with a person or object at the center of the video scene, a person or object on which the camera system is currently focusing, etc.

[0279] At least some of second set of audio scene components may, in some instances, be represented in a GUI provided by GUI module 630. The audio source information areas 430a, 430b, 430c and 430d of Figures 4D and 4E are examples. The second set of audio scene components (or a subset of the second set of audio scene components that is currently being shown in a GUI) may, in some instances, vary over time according to whether an audio source is currently providing audio that is estimated by the control system to be audible, according to whether an audio source has provided audio that is estimated by the control system to be audible during a recent time interval, etc. For example, the GUI 401a of Figure 4B indicates that a bus is one of the audio scene components. If bus sounds are not detected for a time interval, the control system may be configured to remove the audio source information area 420c from the GUI 401a and, in some examples, from the current “second set of audio scene components.”

[0280] Figure 8 is a flow diagram that outlines various example methods 800 according to some disclosed implementations. The example methods 800 may be partitioned into blocks, such as blocks 805, 810, 815 and 820. The various blocks may be described as operations, processes, methods, steps, acts or functions. The blocks of methods 800, like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of methods 800 may be performed concurrently. Moreover, some implementations of methods 800 may include more or fewer blocks than shown and/or described. The blocks of methods 800 may be performed by one or more devices, for example, the device that is shown in Figure 1 A.

[0281] Block 805 involves “retrieving, by a control system and from a memory system, an audio data asset file saved from a prior capture phase.” The control system may be, or may include, the CPU 141 of Figure 1A. The control system may, for example, include a general purpose single- or multi-chip processor, one or more digital signal processors (DSPs), one or more application specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, one or more discrete hardware components, or combinations thereof. According to some examples, the control system may be a control system of a server. In some examples, the audio data asset file may correspond to an audio asset stream 637 that was produced by the file generation module 635 of Figure 6 and was saved in a memory system. Accordingly, the audio data asset file may include the audio data and video data that is received during a capture phase, user input metadata, audio scene analysis metadata, etc. Processing may continue to block 810.

[0282] Block 810 involves “implementing, by the control system, a second audio source separation process on audio data from the audio data asset file.” In some examples, the second audio source separation process may be relatively more precise than an audio source separation process that was used during the capture phase. According to some examples, the second audio source separation process may be relatively more computationally intensive than an audio source separation process that was used during the capture phase. In some instances, the second audio source separation process may be implemented by a neural network trained by a regressionbased process, which may be similar to the regression-based processes of some disclosed instances of the first audio source separation process that are described with reference to Figure 6. In some alternative examples, the second audio source separation process may be implemented by a generative neural network. In some embodiments, the second audio source separation process may be implemented by a control system (e.g., of a server) according to an algorithm that includes a set of specialized generative separators. Each generative separator of the set of specialized generative separators may, in some examples, be optimized for a specific audio signal category. The metadata about audio objects created by the capture system (such as scene analysis metadata, user input metadata, or both) may, in some examples, be used in the second audio source separation process to select an instance of a specialized audio separation algorithm from the set of specialized generative separators. Processing may continue to block 815.

[0283] Block 815 involves “creating, by the control system, a remix of the audio scene.” According to this example, the remix will include the audio sources and related audio data produced by second audio source separation process of block 810. In some examples, the remixing process of block 815 — or a process prior to the remixing process of block 815 — may be based, at least in part, on user input metadata. For example, the user input metadata may include metadata corresponding to a desired level increase for a particular audio source. This level increase may be implemented in block 815 (or another block of method 800). In some examples, block 815 (or another block of method 800) may involve replacement of a candidate sound source (e.g., as indicated by user input metadata by external audio or synthetic audio).

Processing may continue to block 820.

[0284] Block 820 involves “storing, by the control system, the remix of the audio scene as an updated audio asset file.” According to this example, block 820 involves storing the actual output of remix block 815 or a modified version of the output of block 815. In some examples, the updated audio asset file may be a backwards-compatible audio asset file that may include the output of remix block 815, context metadata and a copy of the audio data from the audio data asset file received in block 805. According to some examples, the updated audio asset file may be a standard media container, such as an MP4 file, an IVAS file, or a Dolby AC-4 file.

[0285] Figure 9 is a flow diagram that outlines various example methods 900 according to some disclosed implementations. The example methods 900 may be partitioned into blocks, such as blocks 905, 910, 915, 920, 925, 930 and 935. The various blocks may be described as operations, processes, methods, steps, acts or functions. The blocks of methods 900, like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of methods 900 may be performed concurrently. Moreover, some implementations of methods 900 may include more or fewer blocks than shown and/or described. The blocks of methods 900 may be performed by one or more devices, for example, the device that is shown in Figure 1 A.

[0286] Block 905 involves “receiving, by a control system of a device, audio data from a microphone system.” The control system may be, or may include, the CPU 141 of Figure 1 A. The control system may, for example, include a general purpose single- or multi-chip processor, one or more digital signal processors (DSPs), one or more application specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, one or more discrete hardware components, or combinations thereof In some examples, the microphone system may be a microphone system of a mobile device that includes the control system. Processing may continue to block 910. [0287] Block 910 involves “receiving, by the control system, video data from a camera system.” In some examples, the camera system may include one or more cameras of a mobile device that includes the control system. Processing may continue to block 915.

[0288] Block 915 involves “creating, by the control system and based at least in part on the audio data, the video data, or both, an inventory of audio sources.” Audio sources may also be referred to herein as sound sources. The inventory of audio sources may, in some examples, be provided via, or stored (at least temporarily) as, a data structure similar to the audio source inventory data structure that is described with reference to Figure 5D. The inventory of audio sources may, in some examples, be provided as part of the audio scene analysis information 627 that is described with reference to Figure 6, or as part of the scene analysis data 707 that is described with reference to Figure 7. The inventory of audio sources may include one or more actual audio sources and one or more potential audio sources. Processing may continue to block 920.

[0289] Block 920 involves “selecting, by the control system, a subset of one or more selected audio sources from the inventory of audio sources.” In some examples, block 920 may involve selecting one or more audio sources for which audio data is currently being received, indicating that the one or more audio sources are currently emitting sound. According to some examples, block 920 may involve selecting one or more audio sources according to user input. In some examples, the selecting may involve estimating, by the control system, which audio sources in the inventory of audio sources are most significant sound sources. The subset of one or more selected audio sources may include audio sources estimated to be the most significant sound sources. According to some examples, the selecting may involve estimating which audio sources in the inventory of audio sources correspond to talkers. The subset of one or more selected audio sources may include audio sources estimated to be talkers. Processing may continue to block 925. [0290] Block 925 involves “estimating, by the control system and based on the audio data, at least one audio characteristic of at least the one or more selected audio sources.” In some examples, block 925 may involve estimating a current level of each of the one or more selected audio sources, an average level of each of the one or more selected audio sources, one or more other audio characteristics of at least the one or more selected audio sources, or combinations thereof Processing may continue to block 930.

[0291] Block 930 involves “storing, by the control system, audio data and video data received during a capture phase.” Processing may continue to block 935.

[0292] Block 935 involves “controlling, by the control system, a display of the device to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images, wherein the GUI includes an audio source image corresponding to the at least one audio characteristic of the subset of one or more selected audio sources.” In some examples, block 935 may involve controlling a display of the apparatus 101 to present a GUI like that shown in Figure 3 A or one of the GUIs shown in Figures 4C-4E. The audio source level information areas 432a, 432b, 432c and 432d — and the average audio source level indications 433 — that are within the audio source information areas 430a, 430b, 430c and 43 Od, respectively, are examples of audio source images corresponding to the at least one audio characteristic.

[0293] In some examples, the GUI may include one or more user input areas configured to receive user input. The virtual sliders 440a and 440b of Figure 4C are examples of GUI user input areas configured to receive user input. In other examples, the one or more user input areas may include another type of virtual slider, a virtual knob, a virtual dial, etc.

[0294] According to some examples, method 900 may involve classifying, by the control system, the audio sources in the inventory of audio sources into two or more audio source categories. In some such examples, the GUI may include a user input area portion corresponding to at least one of the two or more audio source categories. In some examples, one of the audio source categories may be a foreground category corresponding to the one or more selected audio sources. According to some examples, one of the audio source categories may be a background category corresponding to one or more audio sources of the inventory of audio sources that were not in the subset of one or more selected audio sources. [0295] In some examples, method 900 may involve performing, by the control system, a first sound source separation process. In some such examples, method 900 may involve updating, by the control system, an audio scene state based at least in part on the first sound source separation process, and causing, by the control system, the GUI to be updated according to an updated audio scene state. According to some examples, the creating process of block 915 may be based, at least in part, on the first sound source separation process. Some disclosed methods may involve performing post-capture audio processing on the audio data received during the capture phase. In some examples, the post-capture audio processing may involve a second sound source separation process that is more complex than the first sound source separation process.

[0296] According to some examples, method 900 may involve detecting, by the control system and based at least in part on the video data, one or more potential sound sources. In some such examples, at least one of the one or more potential sound sources may not be indicated by the audio data. For example, at least one of the one or more potential sound sources may be indicated by the video data. According to some examples, the inventory of audio sources may include the one or more potential sound sources.

[0297] In some examples, method 900 may involve detecting, by the control system, one or more candidate sound sources for augmented audio capture. In some such examples, the augmented audio capture may involve replacement of a candidate sound source by external audio or synthetic audio. According to some examples, the GUI may include at least one user input area configured to receive a user selection of a selected potential sound source or a selected candidate sound source. In some examples, the GUI may include at least one user input area configured to receive a user selection of augmented audio capture. The augmented audio capture may, for example, include external audio, synthetic audio, or both, for the selected potential sound source or the selected candidate sound source.

[0298] Some disclosed methods involve providing a GUI that includes at least one user input area configured to receive a user selection of a ratio between augmented audio capture and real- world audio capture. In some instances, the GUI may be provided during a post-capture editing process.

[0299] According to some examples, method 900 may involve causing, by the control system, the display to present audio source labels in the GUI. The audio source labels in the audio source information areas 430a, 430b, 430c and 43 Od of Figures 4D and 4E provide examples. In some examples, at least one of the audio source labels may correspond to an audio source identified by the control system based on the audio data, the video data, or both.

[0300] In some examples, method 900 may involve updating, by the control system, an estimate of a current audio scene and causing, by the control system, the GUI to be updated according to updated estimates of the current audio scene. Updating the estimate of the current audio scene may involve implementing, by the control system, an audio classifier, a video classifier, or both. In some examples, updating the estimate of the current audio scene may involve implementing, by the control system, an audiovisual classifier. According to some examples, the updated estimates of the current audio scene may include an updated level estimate for one or more audio sources.

[0301] Figure 10A represents elements of an audio source inventory according to some disclosed implementations. In some examples, the audio source inventory 1000 may be generated in block 564 or block 565 of Figure 5B, or in block 915 of Figure 9. According to some examples, the audio source inventory 1000 may be provided via, or stored (at least temporarily) as, a data structure similar to the audio source inventory data structure that is described with reference to Figure 5D. The audio source inventory 1000 may, in some examples, be provided as part of the audio scene analysis information 627 that is described with reference to Figure 6, or as part of the scene analysis data 707 that is described with reference to Figure 7. In these examples, the audio source inventory 1000 includes audio sources 1001 that are currently present in the received audio data and potential audio sources 1004 that are not currently present in the received audio data — or which are present but for which the corresponding audio data is below a threshold level — but which have been detected in the video feed. The audio sources 1001 include a subset of selected audio sources 1002, which have been selected for possible augmentation or replacement.

[0302] In some instances, the selected audio sources 1002 may have been selected based, at least in part, on a user’s actions, such as user input via a GUI, a user’s having zoomed in on a particular video object, a user’s framing (e.g., centering) of a particular video object, etc. According to some examples, the selected audio sources 1002 may have been selected by a control system, such as a control system of a capture device. In some such examples, the control system may have selected one or more audio sources for which the corresponding audio data is of low quality (such as background music that is partially masked by background noise), which is estimated to be near or below a level of human audibility, etc. Alternatively, or additionally, the control system may have selected one or more potential or actual audio sources based, at least in part, on an estimated audio or video scene context. For example, if the context is “nature scene” or the like, an animal in the video feed that is not currently producing sound, or is producing sound that is below a threshold, may be selected for possible augmentation or replacement.

[0303] The number of elements, types of elements and order of elements shown in audio source inventory 1000 are merely examples. For example, in some instances the potential audio sources 1004 may include a subset of potential audio sources selected for association with synthetic audio data or with external audio data. The subset of potential audio sources may, for example, be selected according to an estimated audio scene context, user preferences, or combinations thereof.

[0304] Figure 10B is a flow diagram that outlines various example methods 1005 according to some disclosed implementations. The example methods 1005 may be partitioned into blocks, such as blocks 1010, 1015, 1020, 1025, 1030, 1035 and 1040. The various blocks may be described as operations, processes, methods, steps, acts or functions. The blocks of Figure 10B are performed during a capture phase according to some examples. The blocks of methods 1005, like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of methods 1005 may be performed concurrently. Moreover, some implementations of methods 1005 may include more or fewer blocks than shown and/or described. The blocks of methods 1005 may be performed by one or more devices, for example, the device that is shown in Figure 1 A.

[0305] Block 1010 involves “receiving, by a control system of a device, audio data from a microphone system and video data from a camera system.” The control system may be, or may include, the CPU 141 of Figure 1A. The control system may, for example, include a general purpose single- or multi-chip processor, one or more digital signal processors (DSPs), one or more application specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, one or more discrete hardware components, or combinations thereof. In some examples, the microphone system may be a microphone system of a mobile device that includes the control system. According to some examples, the camera system may include one or more cameras of a mobile device that includes the control system. In other examples, the microphone system, the camera system, or both, may reside in one or more devices other than the device that includes the control system. Processing may continue to block 1015.

[0306] Block 1015 involves “creating, by the control system and based at least in part on the audio data, the video data, or both, an inventory of audio sources.” The inventory of audio sources may, in some examples, correspond with the audio source inventory 1000 of Figure 10A. Accordingly, the inventory of audio sources may, in some examples, be provided via, or stored (at least temporarily) as, a data structure similar to the audio source inventory data structure that is described with reference to Figure 5D, may, in some examples, be provided as part of the audio scene analysis information 627 that is described with reference to Figure 6, or may, in some examples, be provided as part of the scene analysis data 707 that is described with reference to Figure 7. Processing may continue to block 1020.

[0307] Block 1020 involves “controlling, by the control system, a display of the device to provide a graphical user interface (GUI) including a representation of at least some audio sources of the inventory of audio sources.” In some examples, block 1020 may involve controlling a display of the apparatus 101 to present a GUI like that shown in Figure 3 A or one of the GUIs shown in Figures 4C-4E. The audio source information areas 430a, 430b, 430c and 43 Od, respectively, are examples of “a representation of at least some audio sources of the inventory of audio sources.” In some examples, block 1020 — or another block of the methods 1005 — may involve selecting a subset of the audio sources in the inventory of audio sources. According to some such examples, the audio scene inventory filtering module 715 of Figure 7 may select the subset of audio sources. Processing may continue to block 1025.

[0308] Block 1025 involves “receiving, by the control system, and via the GUI, user input regarding augmentation or replacement of audio data corresponding to one or more selected audio sources.” In some examples, the GUI may include one or more user input areas configured to receive user input. According to some examples, block 1025 may involve receiving user input via a touch on the person 405a, on one of the other video objects or on one of the audio source information areas 420a, 420b or 420c shown in Figures 4A and 4B. In some examples, block 1025 may involve receiving user input via a touch on the person 405b, on one of the other video objects or on one of the audio source information areas 420d or 420e shown in Figure 4C. According to some examples, block 1025 may involve receiving user input via a touch on the person 405a, on one of the other video objects or on one of the audio source information areas 430a, 430b, 430c or 430d shown in Figures 4D and 4E. Processing may continue to block 1030. [0309] Block 1030 involves “creating, by the control system, metadata corresponding to the user input.” Processing may continue to block 1035.

[0310] Block 1035 involves “creating, by the control system, a media asset including the metadata, audio data and video data received during a capture phase.” In some examples, block 1035 may involve the operations of the file generation module 635 of Figure 6. Processing may continue to block 1040.

[0311] Block 1040 involves “storing, by the control system, the media asset in a memory.” According to some examples, block 1040 may involve storing the media asset in a memory of the capture device, storing the media asset in a memory of another device (such as a memory device of a cloud-based service), or both.

[0312] Figure 10C is a flow diagram that outlines additional example methods 1070 according to some disclosed implementations. The example methods 1070 may be partitioned into blocks, such as blocks 1075, 1080, 1085, 1090 and 1095. The various blocks may be described as operations, processes, methods, steps, acts or functions. The blocks of Figure 10C are performed during a post-capture editing phase according to some examples. The blocks of methods 1070, like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of methods 1070 may be performed concurrently. Moreover, some implementations of methods 1070 may include more or fewer blocks than shown and/or described. The blocks of methods 1070 may be performed by one or more devices, for example, the device that is shown in Figure 1 A.

[0313] Block 1075 involves “obtaining, by a control system, a media asset including metadata, audio data and video data received during a capture phase, the metadata corresponding to augmentation or replacement of audio data corresponding to one or more user-selected audio sources.” The control system may be, or may include, the CPU 141 of Figure 1 A. The control system may, for example, include a general purpose single- or multi-chip processor, one or more digital signal processors (DSPs), one or more application specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, one or more discrete hardware components, or combinations thereof. [0314] In some examples, block 1075 may involve obtaining metadata corresponding to a label generated by the control system indicating that one or more audio sources or potential audio sources have been selected for possible augmentation or replacement. In some such examples, the metadata may correspond to user input indicating a user’s selection of one or more audio sources or potential audio sources, whether during a capture phase or a previous post-capture editing phase. According to some examples, the metadata may correspond to user input indicating a user’s desire to alter at least one audio characteristic of an audio source, such as the level of the audio source.

[0315] According to some examples, block 1075 — or another block of the methods 1070 — may involve obtaining unmodified audio data corresponding to at least one audio source. In some such examples, block 1075 may involve obtaining unmodified audio data corresponding to all audio sources for which audio data were received during a capture phase. In some examples, block 1075 may involve obtaining modified audio data corresponding to at least one audio source. In some such examples, block 1075 may involve obtaining modified audio data corresponding to at least one audio source for which the audio data was previously augmented or replaced, at least in part, during a capture phase or a previous post-capture editing phase. Processing may continue to block 1080.

[0316] Block 1080 involves “obtaining, by the control system, synthetic audio data, external audio data, or both, for augmentation or replacement of audio data corresponding to a user- selected audio source.” In some examples, block 1080 may involve obtaining one or more types of replacement of audio data from a memory, via downloading or streaming, etc. For example, block 1080 may involve obtaining audio data corresponding to music, audio data corresponding to pre-recorded sound effects (which may be referred to herein as Foley effects), such as ticking clock effects, closing door effects, breaking glass effects, etc. According to some examples, block 1080 may involve generating synthetic audio data, or obtaining generated synthetic audio data. In some such examples, the synthetic audio data may be generated by a neural network. Processing may continue to block 1085.

[0317] Block 1085 involves “augmenting or replacing, by the control system, the audio data corresponding to the selected audio source, to produce modified audio data comprising the augmented audio data, the replacement audio data, or both.” In some examples, block 1085 may involve increasing or decreasing a level of an audio source, for example according to metadata corresponding to a user’s desire to increase or decrease the level of the audio source. According to some examples, block 1085 may involve entirely replacing the audio data corresponding to the selected audio source, for example replacing recorded background music with stored, streamed or downloaded music. In some instances, block 1085 — or another block of the methods 1070 — may involve associating generated, stored, streamed or downloaded audio with a selected potential audio source. In some such examples in which the selected potential audio source is a video object (such as a door, a clock, an animal, a fountain, etc.) the associated audio may be placed in the audio scene such that the apparent location of the associated audio matches the location of the video object. Processing may continue to block 1090.

[0318] Block 1090 involves “mixing, by the control system, the modified audio data with audio data in the media asset corresponding to other audio sources, to produce a modified media asset.” According to some examples, the mixing process — or another associated process of the methods 1070 — may be an audio source mixing process that is known by those of skill in the art, which may include an equalization process, a balancing process, a compression process, a process of adding or reducing reverberation effects, etc. In some examples, the mixing may be as described with reference to block 815. Processing may continue to block 1095.

[0319] Block 1095 involves “storing, by the control system, the modified media asset in a memory.” According to some examples, block 1095 may involve storing the media asset in a memory of the capture device, storing the media asset in a memory of another device (such as a memory device of a cloud-based service), or both. In some examples, block 1095 may involve storing modified and unmodified audio data. According to some examples, block 1095 may involve storing one or more types of metadata, such as metadata generated during a capture process, metadata corresponding to aspects of a post-capture editing process, etc. Storing such metadata, along with video data and unmodified audio data, may produce what is referred to herein as a “backwards-compatible media asset.” The audio portion of a backwards-compatible media asset may be referred to herein as a “backwards-compatible audio asset.”

[0320] Figure 11 A shows examples of media assets and a mixing module according to some disclosed examples. In these examples, Figure 11 A shows a media asset 1105, a mixing module 1120 and a backwards-compatible media asset 1130. According to these examples, the media asset 1105 includes unmodified audio data 1105, modified unmixed audio data 1110 and context metadata 1115. In these examples, a control system portion 1106 is configured for implementing the mixing module 1120. According to these examples, the backwards-compatible media asset 1130 includes a copy of the unmodified audio data 1105, modified mixed audio data 1125 that is output by the mixing module 1120, and a copy of the context metadata 1115. In this example, the modified mixed audio data 1125 is in a format that is playable on various consumer devices, such as IVAS, Dolby AC-4, MP-4, etc.

[0321] In some examples, the media asset 1105 may have been stored after a capture phase, whereas in other examples the media asset 1105 may have been stored after a post-capture editing phase. Accordingly, the modified mixed audio data 1125 may have been modified — at least in part — during a capture phase, during a post-capture editing phase, or both. The modification may have involved augmentation, replacement, or both.

[0322] According to these examples, the mixing module 1120 is configured to mix the modified unmixed audio data 1110 according to input 1117, which may include user input, default settings, or both. The user input may, for example, correspond to user input received during a post-capture editing process. In some instances, the user input may be received responsive to a user’s preview of one or more audio files in the modified unmixed audio data, a previous mix produced by the mixing module, etc. According to some examples, N audio components may be provided to the mixing module 1120. In some such examples, the mixing module 1120 may output M components obtained by linear transformation of the N input components. Such a linear transformation would generally have N x M degrees of freedom (e.g., corresponding to a rectangular transformation matrix). The coefficients of the transformation matrix may be adjusted based on user input or default values may be used. In some examples, the mixing module 1120 may perform the transformation directly on time samples, such as Pulse Code Modulation (PCM) samples, whereas in other examples the mixing module 1120 may perform the transformation on time-frequency slots provided by a filterbank (e.g., a Quadrature Mirror Filter (QMF)).

[0323] In these examples, the mixing module 1120 is configured to output the modified mixed audio data 1125 and to store the modified mixed audio data 1125 as part of the backwards- compatible media asset 1130.

[0324] In some examples, the mixing module 1120 may be configured to mix the modified unmixed audio data 1110 according to at least a portion of the context metadata 1115. In some examples, at least a portion of the context metadata 1115 may correspond to a scene context hypothesis 712 output by the context hypothesis evaluation module 710 of Figure 7, context metadata produced according to user input, or both.

[0325] According to these examples, the control system portion 1106 is also configured to copy the unmodified audio data 1105 and the context metadata 1115, and to store copies of the unmodified audio data 1105 and the context metadata 1115 as components of the backwards- compatible media asset 1130.

[0326] Figure 1 IB is a flow diagram that outlines various example methods 1150 according to some disclosed implementations. The example methods 1150 may be partitioned into blocks, such as blocks 1155, 1160, 1165, 1170, 1175 and 1180. The various blocks may be described as operations, processes, methods, steps, acts or functions. The blocks of methods 1150, like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of methods 1150 may be performed concurrently. Moreover, some implementations of methods 1150 may include more or fewer blocks than shown and/or described. The blocks of methods 1150 may be performed by one or more devices, for example, the device that is shown in Figure 1 A.

[0327] Block 1155 involves “receiving, by a control system of a device, audio data from a microphone system.” The control system may be, or may include, the CPU 141 of Figure 1 A. The control system may, for example, include a general purpose single- or multi-chip processor, one or more digital signal processors (DSPs), one or more application specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, one or more discrete hardware components, or combinations thereof. In some examples, the microphone system may be a microphone system of a mobile device that includes the control system. Processing may continue to block 1160.

[0328] Block 1160 involves “receiving, by the control system, video data from a camera system.” In some examples, the camera system may include one or more cameras of a mobile device that includes the control system. In other examples, the microphone system, the camera system, or both, may reside in one or more devices other than the device that includes the control system. Processing may continue to block 1165.

[0329] Block 1165 involves “creating, by the control system and based at least in part on the audio data, the video data, or both, an inventory of audio sources.” The inventory of audio sources may, in some examples, be as described herein with reference to Figure 10 A, as described herein with reference to block 1015 of Figure 10B, or both. Accordingly, in some examples the inventory of audio sources may include one or more actual audio sources and one or more potential audio sources. Processing may continue to block 1170.

[0330] Block 1170 involves “selecting, by the control system, one or more selected audio sources from the inventory of audio sources, wherein the one or more selected audio sources are selected for possible augmentation or replacement.” In some examples, the one or more selected audio sources may have been selected based, at least in part, on a user’s actions, such as user input previously received via a GUI, a user’s having zoomed in on a particular video object, a user’s framing (e.g., centering) of a particular video object, etc.

[0331] In some examples, the selecting may involve estimating which audio sources in the inventory of audio sources correspond to talkers. In some such examples, the one or more selected audio sources do not include audio sources estimated to be talkers.

[0332] Alternatively, or additionally, in some examples, at least one audio source may have been selected by a control system, such as a control system of a capture device. In some such examples, the control system may have selected one or more audio sources for which the corresponding audio data is of low quality (such as background music that is partially masked by background noise), which is estimated to be near or below a level of human audibility, etc. According to some examples, the control system may have selected one or more potential or actual audio sources based, at least in part, on an estimated audio or video scene context. In some such examples, at least one audio source may be selected based, at least in part, on the video data, e.g., by selecting a video object that is an actual or potential audio source. According to some such examples, the audio scene inventory filtering module 715 of Figure 7 may be configured to perform block 1170, at least in part. Processing may continue to block 1175.

[0333] Block 1175 involves “storing, by the control system, audio data and video data received during a capture phase.” Processing may continue to block 1180.

[0334] Block 1180 involves “controlling, by the control system, a display of the device to display images corresponding to the video data and to display a graphical user interface (GUI) overlaid on the images, wherein the GUI indicates the one or more selected audio sources.” The GUI may be displayed prior to a capture phase, during a capture phase, or both. The GUI may, in some instances, include audio source labels. At least one of the audio source labels may correspond to an audio source or potential audio source identified by the control system based on the audio data, the video data, or both.

[0335] In some examples, block 1180 may involve controlling a display of the apparatus 101 to present a GUI like that shown in Figure 3 A or one of the GUIs shown in Figures 4C-4E. The audio source information areas 430a, 430b, 430c and 43 Od, respectively, are examples in which a GUI indicates “the one or more selected audio sources.” In some examples, the GUI may present a representation of a potential audio source for which audio data is not currently being received — or for which audio data is currently being received at a level that is below a threshold level — but which is nonetheless a video object that has been identified by the control system as a potential audio source. According to some examples, the GUI may represent one or more audio sources that are selected for possible augmentation or replacement differently from other audio sources, e.g., in a different color.

[0336] In some examples, the GUI may include one or more user input areas configured to receive user input. In some such examples, the audio source information areas may be configured to receive user input, e.g., to allow a user to select one or more audio sources for augmentation or replacement. According to some examples, the GUI may represent one or more audio sources that are selected for possible augmentation or replacement with a textual prompt, associated with augmentation or replacement of audio data corresponding to the one or more selected audio sources. For example, the GUI may represent one or more selected audio sources with a prompt such as “modify?” or “augment or replace?”

[0337] In some instances of the methods 1150, the GUI may represent one or more selected audio sources with a prompt regarding augmentation of the audio data corresponding to the selected audio source, for example indicating a proposed type of augmentation. In some such examples, the proposed type of augmentation may involve a microphone beamforming process for augmentation of the audio data corresponding to the selected audio source. According to some examples, the GUI may represent one or more selected audio sources with a prompt regarding replacement of the audio data corresponding to the selected audio source. For example, the prompt may indicate a proposed type of augmentation, such as replacing the audio data corresponding to the first selected audio source with synthetic audio data or with external audio data. [0338] Some examples of the methods 1150 may involve receiving, by the control system, user input via the GUI indicating augmentation or replacement of audio data corresponding to a selected audio source. Some such examples may involve augmenting or replacing, by the control system, the audio data corresponding to the selected audio source, to produce modified audio data. The modified audio data may be augmented audio data or replacement audio data. Some examples may involve labeling, by the control system, the augmented audio data or the replacement audio data and storing a label along with the augmented audio data or the replacement audio data. The label may, for example be a type of audio metadata.

[0339] Some examples of the methods 1150 may involve causing, by the control system, the GUI to indicate that the audio data corresponding to a selected audio source is augmented audio data or replacement audio data, in other words indicating that the audio data corresponding to the selected audio source has been augmented or replaced. Some examples of the methods 1150 may involve causing, by the control system, the GUI to indicate one or more audio sources corresponding to unmodified audio data.

[0340] Some examples of the methods 1150 may involve storing, by the control system, unmodified audio data corresponding to at least one audio source that has been selected for augmentation or replacement, e.g., after audio data corresponding to a selected audio source has been augmented or replaced.

[0341] Some examples of the methods 1150 may allow a user to interpolate between augmented or replacement audio data and “real world” or unmodified audio data. Some such examples may involve causing, by the control system, a GUI to indicate an audio source having corresponding augmented audio data or replacement audio data. Some such examples may involve causing, by the control system, the GUI to include one or more user input areas for receiving user input for modifying the augmented audio data or replacement audio data according to the unmodified audio data. The modifying may, for example, involve interpolating between the augmented audio data or replacement audio data and the unmodified audio data. In some examples, modifying may involve replacing the augmented audio data or replacement audio data with the unmodified audio data. According to some examples, the GUI may be displayed after the capture phase and during a post-capture review process.

[0342] Figure 12A shows examples of media assets and an interpolator according to some disclosed examples. In these examples, Figure 12A shows a media asset 1208, an interpolator 1220 and a media asset with artistic intent set 1218. According to these examples, the media asset 1208 includes unmodified audio data 1205, modified unmixed audio data 1210 and context metadata 1215. In these examples, a control system portion 1206 is configured for implementing the interpolator 1220. According to these examples, the media asset with artistic intent set 1218 includes a copy of the unmodified audio data 1205, adjusted modified unmixed audio data 1212 that is output by the interpolator 1220, and a copy of the context metadata 1215.

[0343] In some examples, the media asset 1208 may have been stored after a capture phase, whereas in other examples the media asset 1208 may have been stored after a post-capture editing phase. Accordingly, the modified unmixed audio data 1210 may have been modified — at least in part — during a capture phase, during a post-capture editing phase, or both. The modification may have involved augmentation, replacement, or both.

[0344] However, in this example, if a post-capture editing phase has already begun, it is not yet complete. Instead, a user is providing user input 1203 to the interpolator 1220 to adjust the unmodified audio data 1205, the modified unmixed audio data 1210, or both, to the user’s satisfaction. After the user has adjusted the unmodified audio data 1205, the modified unmixed audio data 1210, or both, to satisfactorily represent the user’s artistic intent, the user may terminate the interpolation process and cause adjusted modified unmixed audio data 1212 to be output by the interpolator 1220 and stored as part of the media asset with artistic intent set 1218. [0345] According to some examples, the interpolator 1220 may be configured to interpolate between the modified unmixed audio data 1210 and the unmodified audio data 1205 according to user input 1203. Alternatively, or additionally, the interpolator 1220 may be configured to interpolate between the unmodified audio data 1205 and synthetic or external audio data 1207 according to user input 1203. The synthetic or external audio data 1207 may, for example, include music, sound effects, etc., which may be pre-recorded or generated. In some instances, the interpolator 1220 may be configured to interpolate based, at least in part, on the context metadata 1215. For example, the interpolator 1220 may be configured to propose an initial amount of interpolation based on the context metadata 1215, which could be modified according to user input 1203 if the user so desires.

[0346] According to these examples, the control system portion 1206 is also configured to copy the unmodified audio data 1205 and the context metadata 1215, and to store copies of the unmodified audio data 1205 and the context metadata 1215 as components of the media asset with artistic intent set 1218.

[0347] Figures 12B, 12C and 12D illustrate example elements of GUIs that may be presented during a post-editing process that includes interpolation. In these examples, the GUIs 1260a, 1260b and 1260c of Figures 12B, 12C and 12D, respectively, are provided on a display 1255 of an apparatus 1251, which is a cell phone and is an instance of the apparatus 101 of Figure 1A in these examples. In these examples, a control system (not shown) of the apparatus 1251 is controlling the display 1255 to present the GUIs 1260a, 1260b and 1260c. As with other disclosed examples, the types, numbers and arrangements of elements shown in 12B-12D are merely provided as examples.

[0348] The GUI 1260a of Figure 12B includes audio source information areas 1230a, 1230b, 1230c, 1230d and 1230e, and textual prompts 1262a and 1262b. In this example, the audio source information areas 1230b, 1230d and 1230e have been selected by the control system as candidates for possible modification, or further modification, and are displayed differently from the audio source information areas 1230a and 1230c. Moreover, the audio source information areas 1230b, 1230d and 1230e have associated textual prompts 1262b, which inquire whether a user would like to modify the corresponding audio source. According to this example, the textual prompts 1262a encourages the user to touch an audio source area (meaning one of the audio source information areas 1230a-1230e) if the user would like to select a corresponding audio source for modification, or for further modification.

[0349] According to this example, the GUI 1260b of Figure 12C has been presented in response to detecting a user’s touch in the audio source information area 1230e of Figure 12B. In this example, the GUI 1260b of Figure 12C includes a textual prompt 1262c and an interpolation control 1264a, which includes a virtual slider 1266 in this example. According to this example, the interpolation control 1264a controls an interpolator — such as the interpolator 1220 of Figure 12A — according to an indicated ratio of modified audio to unmodified audio, ranging from 0/1 — meaning completely unmodified — to 1/1, which means completely modified. In this example, the textual prompt 1262c encourages the user to move the virtual slider 1266 to select a desired ratio of modified audio to unmodified audio. According to some examples, audio corresponding to the selected ratio may be provided by speakers of the apparatus 1251. In some examples, a video may also be presented, such as a video that shows a scene which includes a video object corresponding to the selected audio source. According to some examples, the user may be able to select a different ratio for different parts — e.g., for various time intervals — of the audio corresponding to the audio source.

[0350] In some examples, the alternative GUI 1260c of Figure 12D may be presented in response to detecting a user’s touch in the audio source information area 1230e of Figure 12B. According to this example, the GUI 1260c of Figure 12D includes a textual prompt 1262d and an interpolation control 1264b, which also includes a virtual slider 1266 in this example. According to this example, the interpolation control 1264b controls an interpolator according to an indicated percentage of modified audio to unmodified audio, ranging from 0 percent — meaning completely unmodified — to 100 percent, which means completely modified. In this example, the textual prompt 1262d encourages the user to move the virtual slider 1266 to select a desired percentage of modified audio to unmodified audio. According to some examples, audio corresponding to the selected percentage may be provided by speakers of the apparatus 1251. In some examples, the user may be able to select a different percentage for different time intervals of the audio corresponding to the audio source. According to some examples, a video may also be presented, such as a video that shows a scene which includes a video object corresponding to the selected audio source.

[0351] Figure 12E is a flow diagram that outlines various example methods 1270 according to some disclosed implementations. The example methods 1270 may be partitioned into blocks, such as blocks 1272, 1275, 1277, 1280, 1282, 1285, 1287, 1290 and 1292. The various blocks may be described as operations, processes, methods, steps, acts or functions. The blocks of methods 1270, like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of methods 1270 may be performed concurrently. Moreover, some implementations of methods 1270 may include more or fewer blocks than shown and/or described. The blocks of methods 1270 may be performed by one or more devices, for example, the device that is shown in Figure 1 A. Some blocks of the methods 1270 may be performed during a capture phase and other blocks of blocks of the methods 1270 may be performed during a post-capture editing phase. The post-capture editing phase may or may not be performed on the same device(s) used for the capture phase, depending on the particular implementation. [0352] Block 1272 involves “receiving, by a control system of a device, audio data from a microphone system.” The control system may be, or may include, the CPU 141 of Figure 1 A. The control system may, for example, include a general purpose single- or multi-chip processor, one or more digital signal processors (DSPs), one or more application specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, one or more discrete hardware components, or combinations thereof In some examples, the microphone system may be a microphone system of a mobile device that includes the control system. Processing may continue to block 1275.

[0353] Block 1275 involves “receiving, by the control system, video data from a camera system.” In some examples, the camera system may include one or more cameras of a mobile device that includes the control system. In other examples, the microphone system, the camera system, or both, may reside in one or more devices other than the device that includes the control system. Processing may continue to block 1277.

[0354] Block 1277 involves “creating, by the control system and based at least in part on the audio data, the video data, or both, an inventory of audio sources.” The inventory of audio sources may, in some examples, be as described herein with reference to Figure 10 A, as described herein with reference to block 1015 of Figure 10B, or both. Some examples may involve detecting, by the control system and based at least in part on the video data, one or more potential sound sources. Accordingly, in some examples the inventory of audio sources may include one or more actual audio sources and one or more potential audio sources. In some examples, at least one of the potential sound sources may not be indicated by the audio data. Processing may continue to block 1280.

[0355] Block 1280 involves “selecting, by the control system, at least a first selected audio source from the inventory of audio sources, the first selected audio source being selected for augmentation or replacement.” In the examples shown in Figures 4D and 4E, the selected audio source was detected background music, which was replaced by a streamed version of the same music. According to some examples, the methods 1270 may involve controlling a display to present, prior to or during the capture phase, an audio data modification GUI that includes a user prompt associated with augmentation or replacement of audio data corresponding to one or more selected audio sources of the inventory of audio sources. The audio data modification GUI may, in some examples, include a user prompt associated with replacement of the audio data corresponding to a selected audio source. The replacement may be associated with replacing the audio data corresponding to the selected audio source with synthetic audio data or with external audio data. The textual prompt of Figure 4D provides one example of this type of user prompt. In some examples, the audio data modification GUI may include a user prompt, a virtual control, or combinations thereof, associated with augmentation or diminution of the audio data corresponding to a selected audio source. The plus and minus symbols in the audio source information areas 430a, 430b, 430c and 430d of Figures 4D and 4E are examples of such virtual controls. According to some examples, augmentation may be associated with a microphone beamforming process for augmentation of the audio data corresponding to the selected audio source. For example, if a user touches one of the plus symbols in Figures 4D and 4E, in some implementations the control system may initiate a microphone beamforming process.

[0356] In some examples, one or more selected audio sources may have been selected based, at least in part, on a user’s actions, such as user input previously received via a GUI, a user’s having zoomed in on a particular video object, a user’s framing (e.g., centering) of a particular video object, etc. In some examples, the selecting may involve estimating which audio sources in the inventory of audio sources correspond to talkers. In some such examples, the one or more selected audio sources do not include audio sources estimated to be talkers. Alternatively, or additionally, in some examples, at least one audio source may have been selected by a control system, such as a control system of a capture device. In some such examples, the control system may have selected one or more audio sources for which the corresponding audio data is of low quality (such as background music that is partially masked by background noise), which is estimated to be near or below a level of human audibility, etc. According to some examples, the control system may have selected one or more potential or actual audio sources based, at least in part, on an estimated audio or video scene context. In some such examples, at least one audio source may be selected based, at least in part, on the video data, e.g., by selecting a video object that is an actual or potential audio source. Processing may continue to block 1282.

[0357] Block 1282 involves “augmenting or replacing, by the control system, audio data corresponding to the first selected audio source, to produce first modified audio data, the first modified audio data comprising at least one of first augmented audio data or first replacement audio data.” In the example shown in Figure 4E, the selected audio source was detected background music, which was replaced by a streamed version of the same music. This replacement is an example of block 1282. In some examples, block 1282 may involve augmentation of the audio data, such a microphone beamforming process for augmentation of the audio data corresponding to a selected audio source. Processing may continue to block 1285.

[0358] Block 1285 involves “storing, by the control system, the first modified audio data.” The modified unmixed audio data 1210 of Figure 12A is one example of stored modified audio data. Processing may continue to block 1287.

[0359] Block 1287 involves “storing, by the control system, audio data and video data received during a capture phase, the audio data including first unmodified audio data corresponding to at least the first selected audio source.” The unmodified audio data 1205 of Figure 12A is one example of stored unmodified audio data. Processing may continue to block 1290.

[0360] Block 1290 involves “controlling, by the control system, a display of the device to present images corresponding to the video data and to display a post-capture graphical user interface (GUI) overlaid on the images, wherein the post-capture GUI indicates at least the first selected audio source and one or more user input areas to receive user input.” The GUI may be like the GUIs shown in Figure 12B, 12C or 12D, with one or more additional images of video data that includes at least a selected audio source. The GUI may, in some instances, include audio source labels, such as the audio source labels in the audio source information areas 1230a- 1230e of Figure 12B. At least one of the audio source labels may correspond to an audio source or potential audio source identified by the control system based on the audio data, the video data, or both. In some examples, the post-capture GUI may include at least one user input area configured to receive a user selection of a ratio between modified and unmodified audio data, such as a ratio between the first modified audio data and the first unmodified audio data. The at least one user input area may be, or may include, a slider. The slider may be configured to allow a user to select a ratio between modified and unmodified audio data, a percentage of modification, etc. In some examples, the slider may be configured to allow a user to select a ratio from zero percent to 100 percent. Processing may continue to block 1292.

[0361] Block 1292 involves “editing, during a post-capture phase review process, the first modified audio data to include at least a portion of the first unmodified audio data based on the user input received by the post-capture GUI.” According to some examples, block 1292 may involve an interpolation process, such as interpolating between the first modified audio data and the first unmodified audio data. In some examples, block 1292 may involve interacting with a GUI like those of Figure 12B, 12C, or 12D.

[0362] Some examples of the methods 1270 may involve receiving, by the control system, audio data modification user input via the audio data modification GUI indicating augmentation or replacement of audio data corresponding to the first selected audio source. The editing may be effective to provide the first modified audio data responsive to the audio data modification user input. Some examples may involve labeling, by the control system, the augmented audio data or the replacement audio data and storing a label along with the augmented audio data or the replacement audio data. The label may, for example be a type of audio metadata.

[0363] Some examples of the methods 1270 may involve causing, by the control system, the GUI to indicate that the audio data corresponding to a selected audio source is augmented audio data or replacement audio data, in other words indicating that the audio data corresponding to the selected audio source has been augmented or replaced. Some examples of the methods 1270 may involve causing, by the control system, the GUI to indicate one or more audio sources corresponding to unmodified audio data.

[0364] Some examples of the methods 1270 may involve causing, by the control system, the post-capture GUI to indicate that the audio data corresponding to the first selected audio source is modified audio data. Some such examples may involve causing, by the control system, the post-capture GUI to indicate one or more audio sources corresponding to unmodified audio data. [0365] Some examples of the methods 1270 may involve causing, by the control system, the display to display audio source labels in the audio data modification GUI or the post-capture GUI. In some such examples, at least one of the audio source labels corresponds to an audio source or potential audio source identified by the control system based on the audio data, the video data, or both.

[0366] Figure 13 shows examples of media assets and an interpolator according to some disclosed examples. In these examples, Figure 13 shows a media asset 1308, a cloud processing system 1325 and a backwards-compatible media asset 1330. According to these examples, the media asset 1308 includes unmodified audio data 1305, modified audio data 1310 and context metadata 1315. In these examples, the cloud processing system 1325 includes toolset selection module 1302, editing toolbox 1304 and processing module 1312. The cloud processing system 1325 may, for example, be implemented by one or more servers. According to these examples, the backwards-compatible media asset 1330 includes a copy of the unmodified audio data 1305, processed and modified audio data 1325, which includes processed audio data 1320 that is output by the cloud processing system 1325, and a copy of the context metadata 1315.

[0367] In some examples, the media asset 1308 may have been stored after a capture phase, whereas in other examples the media asset 1308 may have been stored after a post-capture editing phase. Accordingly, the modified audio data 1310 may have been modified — at least in part — during a capture phase, during a post-capture editing phase, or both. The modification may have involved augmentation, replacement, or both.

[0368] However, in this example, if a post-capture editing phase has already begun, it is not yet complete. Instead, in this example, the unmodified audio data 1305 is being provided to the cloud processing system 1325 for processing of the audio corresponding to one or more audio sources of the unmodified audio data 1305. In some examples, at least some of the modified audio data 1310 may also be provided to the cloud processing system 1325 for processing.

[0369] According to this example, the processing will involve the application of one or more selected audio processing tools 1308, which are selected from the editing toolbox 1304 by the toolset selection module 1302. The audio processing tools 1308 may, for example, be implemented via software stored on one or more non-transitory and computer-readable storage media. The editing toolbox 1304 may include various types of audio processing tools 1308, such as one or more generative tools for specific audio categories, one or more signal generators (such as generators of sound effects, also known as Foley effects), one or more audio morphing tools (such as tools for morphing audio corresponding to human speech), one or more speech enhancement tools, one or more tools for automating the mixing of audio corresponding to different audio sources of an audio scene, etc.

[0370] In some examples, the processing may involve what is referred to herein as a “second audio source separation process,” which may be relatively more accurate than an audio source separation process that was previously applied, e.g., during the capture phase. In some such examples, the unmodified audio data 1305 may include a copy of the original audio data obtained by a microphone system during the capture phase. In some examples, individual audio sources output by the second audio source separation process may be further processed according to one or more selected audio processing tools 1308. [0371] According to some examples, the toolset selection module 1302 may be controlled, at least in part, according to input context metadata 1315. In some examples, the context metadata 1315 may be generated by a control system during the capture process. According to some such examples, the context metadata 1315 may be based, for example, on the estimated presence of human speech in audio captured by the microphone system during the capture phase. At least some of the context metadata 1315 may, for example, correspond to user input (e.g., obtained during the capture phase) regarding one or more audio sources, such as a desire to increase the signal level of a talker’s audio, to decrease the level of street noise, to enhance the audio corresponding to one or more musical performers, etc. If, for example, the user has provided input during the capture phase indicating a desire to increase the signal level of a talker’s audio, the toolset selection module 1302 may be configured to automatically select one or more tools for enhancing the speech of that talker, for boosting the intelligibility of the speech of that talker, etc., without requiring further input during the post-capture editing process. This aspect represents an improvement and a technical advantage over previously-deployed methods.

[0372] In some examples, the cloud processing system 1325 may function automatically, without the need for human input. In some such examples, the cloud processing system 1325 may be configured to determine when the audio processing is complete and to cause the processing module 1312 to store the processed audio data 1320 as at least part of the processed and modified audio data 1325.

[0373] However, in some examples, the processing module 1312 may perform at least some processing according to optional user input 1317. The user input 1317 may control the processing module 1312, the toolset selection module 1302, or both. According to some such examples, after the user is satisfied with the audio processing provided by the cloud processing system 1325, the user may provide user input 1317 causing the processed audio data 1320 to be stored as at least part of the processed and modified audio data 1325.

[0374] According to these examples, the cloud processing system 1325 is also configured to copy the unmodified audio data 1305 and the context metadata 1315, and to store copies of the unmodified audio data 1305 and the context metadata 1315 as components of the backwards- compatible media asset 1330.

[0375] Figure 14 is a flow diagram that outlines various example methods 1400 according to some disclosed implementations. The example methods 1400 may be partitioned into blocks, such as blocks 1402, 1405, 1407, 1410, 1412, 1415, 1417 and 1420. The various blocks may be described as operations, processes, methods, steps, acts or functions. The blocks of methods 1400, like other methods described herein, are not necessarily performed in the order indicated. In some implementations, one or more of the blocks of methods 1400 may be performed concurrently. Moreover, some implementations of methods 1400 may include more or fewer blocks than shown and/or described. The blocks of methods 1400 may be performed by one or more devices, for example, the device that is shown in Figure 1 A. Some blocks of the methods 1400 may be performed during a capture phase and other blocks of blocks of the methods 1400 may be performed during a post-capture editing phase. The post-capture editing phase may or may not be performed on the same device(s) used for the capture phase, depending on the particular implementation.

[0376] Block 1402 involves “receiving, by a control system of a device, audio data from a microphone system.” The control system may be, or may include, the CPU 141 of Figure 1 A. The control system may, for example, include a general purpose single- or multi-chip processor, one or more digital signal processors (DSPs), one or more application specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, one or more discrete hardware components, or combinations thereof. In some examples, the microphone system may be a microphone system of a mobile device that includes the control system. Processing may continue to block 1405. [0377] Block 1405 involves “receiving, by the control system, video data from a camera system.” In some examples, the camera system may include one or more cameras of a mobile device that includes the control system. In other examples, the microphone system, the camera system, or both, may reside in one or more devices other than the device that includes the control system. Processing may continue to block 1407.

[0378] Block 1407 involves “identifying, by the control system and based at least in part on the audio data and the video data, two or more audio sources in an audio scene.” Block 1407 may involve, or be performed subsequent to, an audio source separation process. According to some examples, block 1407 may involve making an inventory of audio sources in the current audio scene. The inventory of audio sources may, in some examples, be as described herein with reference to Figure 10A, as described herein with reference to block 1015 of Figure 10B, or both. Some examples may involve detecting, by the control system and based at least in part on the video data, one or more potential sound sources. Accordingly, in some examples the inventory of audio sources may include one or more actual audio sources and one or more potential audio sources. In some examples, at least one of the potential sound sources may not be indicated by the audio data. Processing may continue to block 1410.

[0379] Block 1410 involves “storing, by the control system, audio data and video data received during a capture phase.” The unmodified audio data 1305 of Figure 13 is one example of stored audio data that was received during a capture phase. Processing may continue to block 1412. [0380] Block 1412 involves “controlling, by the control system, a display of the device to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images, wherein the GUI includes an audio source image corresponding to each of the two or more audio sources, and wherein the GUI includes one or more user input areas to receive user input.” In some examples, block 1412 may involve controlling a display of the apparatus 101 to present a GUI like that shown in Figure 3 A or one of the GUIs shown in Figures 4C-4E. Figures 4D and 4E, for example, show GUIs that provide information about four audio sources and include user input areas, such as the plus and minus icons for each of the four audio sources in the audio source information areas 430a, 430b, 430c and 43 Od. Accordingly, in some examples the one or more user input areas may include at least one user input area configured for receiving user input regarding a selected level. Processing may continue to block 1415.

[0381] Block 1415 involves “receiving, by the control system, user input via the one or more user input areas, the user input corresponding to at least one of the two or more audio sources.” The user input may, for example, include a user touch on the plus icon of the “Talking Person” audio source information areas 430b of Figure 4D or 4E, a user touch on the minus icon of the “Cafeteria Noise” audio source information areas 430a, etc. Processing may continue to block 1417.

[0382] Block 1417 involves “generating, by the control system, revision metadata corresponding to the user input.” The revision metadata may include metadata corresponding to the aforementioned user touch on the plus icon of the “Talking Person” audio source information areas 430b of Figure 4D or 4E, the aforementioned user touch on the minus icon of the “Cafeteria Noise” audio source information areas 430a, etc. The revision metadata may, in some examples, be stored as part of what is referred to herein as “context metadata,” such as the context metadata 1315 of Figure 13. In some examples, the control system may be configured to generate at least some context metadata that corresponds to user actions other than GUI input, such as zooming in on a video object that is an actual or potential audio source, centering a video frame on a video object that is an actual or potential audio source, etc. According to some examples, the control system may be configured to generate at least some context metadata that may not directly correspond to user input, such as context metadata indicating the presence of one or more human talkers, context metadata indicating the presence of one or more performing musicians, etc. Processing may continue to block 1420.

[0383] Block 1420 involves “storing, by the control system, the revision metadata with at least the audio data received during the capture phase.” Some examples of the methods 1400 may involve storing other types of metadata that are created by the control system during the capture phase, such as other types of context metadata.

[0384] Some examples of the methods 1400 may involve causing audio data corresponding to the revision metadata to be modified according to the revision metadata. As noted elsewhere, in some examples, the context metadata 1315 of Figure 13 may include revision metadata. In some examples, the cloud processing system 1325 may cause audio data corresponding to the revision metadata to be modified based, at least in part, on context metadata 1315 that includes the revision metadata. In other examples, another device or system — such as the capture device — may cause audio data corresponding to the revision metadata to be modified based, at least in part, on the revision metadata. Accordingly, in some examples, causing the audio data to be modified may involve modifying, by the control system, audio data corresponding to the revision metadata. In other examples, causing the audio data received during the capture phase to be modified may involve sending, by the control system, the revision metadata and the audio data received during the capture phase to one or more other devices. For example, causing the audio data received during the capture phase to be modified may involve sending, by the control system, the revision metadata and the audio data received during the capture phase to one or more servers.

[0385] According to some examples, the audio data corresponding to the revision metadata may include unmodified audio data received during the capture phase. The unmodified audio data 1305 of Figure 13 provides one example. [0386] In some examples, the audio data corresponding to the revision metadata may include modified audio data. The modified audio data may, for example, include augmented audio data, replacement audio data, or both. The modified audio data 1310 of Figure 13 provides one example.

[0387] According to some examples, causing the audio data received during the capture phase to be modified may involve applying an audio enhancement tool to audio data corresponding to the revision metadata. In some such examples, the audio data corresponding to the revision metadata may include speech audio data corresponding to speech from at least one person and the audio enhancement tool may be, or may include, a speech enhancement tool. According to some examples, the audio enhancement tool may be, or may include, a sound source separation tool.

[0388] Some examples of the methods 1400 may involve receiving, by the control system and after the capture phase has begun, modification user input via the one or more user input areas. Some such examples may involve causing, by the control system, audio data received during the capture phase to be modified according to the modification user input. In some instances, the audio data received during the capture phase may be modified during the capture phase. For example, audio data received during the capture phase may be modified according to a microphone beamforming process that is performed during the capture phase. In some examples, the storing process of block 1410 — or another aspect of the methods 1400 — may involve storing modified audio data that has been modified according to the user input.

[0389] According to some examples, the identifying process of block 1407 — or another aspect of the methods 1400 — may involve performing, by the control system, a first sound source separation process. In some examples, causing the audio data to be modified may involve performing a second sound source separation process, or causing the second sound source separation process to be performed.

Example IVAS Codec Framework

[0390] Figure 15 is a block diagram of an example immersive voice and audio services (IVAS) coder/decoder (“codec”) framework 1500 for encoding and decoding IVAS bitstreams, according to one or more embodiments. IVAS is expected to support a range of audio service capabilities, including but not limited to mono to stereo upmixing and fully immersive audio encoding, decoding and rendering. IVAS is also intended to be supported by a wide range of devices, endpoints, and network nodes, including but not limited to: mobile and smart phones, electronic tablets, personal computers, conference phones, conference rooms, virtual reality (VR) and augmented reality (AR) devices, home theatre devices, and other suitable devices.

[0391] IVAS codec 1500 includes IVAS encoder 1501 and IV AS decoder 1504. IVAS encoder 1501 includes spatial encoder 1502 that receives N channels of input spatial audio (e.g., FOA, HO A). In some implementations, spatial encoder 1502 implements SPAR and DirAC for analyzing/downmixing N dmx spatial audio channels, as described in further detail below. The output of spatial encoder 1502 includes a spatial metadata (MD) bitstream (BS) and N dmx channels of spatial downmix. The spatial MD is quantized and entropy coded. In some implementations, quantization can include fine, moderate, coarse and extra coarse quantization strategies and entropy coding can include Huffman or Arithmetic coding. In some implementations, the framework may permit not more than 3 levels of quantization at a given operating mode; however, with decreasing bitrates, in some such implementations the three levels become increasingly coarser overall, to meet bitrate requirements. Core audio encoder 1503 (e.g., based on a mono Enhanced Voice Services (EVS) encoding unit) encodes N dmx channels (N dmx = 1-16 channels) of the spatial downmix into an audio bitstream, which is combined with the spatial MD bitstream into an IVAS encoded bitstream transmitted to IVAS decoder 1504. As described below, given bitrate constraints for low bit rate Scene Based Audio (SBA), in some implementations the number of channels will be limited to a single channel.

[0392] IVAS decoder 1504 includes core audio decoder 1505 (e.g., EVS decoder) that decodes the audio bitstream extracted from the IVAS bitstream to recover the N dmx audio channels. Spatial decoder/renderer 1506 (e.g., SPAR/DirAC) decodes the spatial MD bitstream extracted from the IVAS bitstream to recover the spatial MD, and synthesizes/renders output audio channels using the spatial MD and a spatial upmix for playback on various audio systems with different speaker configurations and capabilities.

[0393] Various features and aspects will be appreciated from the following enumerated example embodiments (“EEEs”):

[0394] EEE1 A. A method, comprising: receiving, by a control system of a device, audio data from a microphone system; receiving, by the control system, video data from a camera system; creating, by the control system and based at least in part on the audio data, the video data, or both, an inventory of audio sources; selecting, by the control system, a subset of one or more selected audio sources from the inventory of audio sources; estimating, by the control system and based on the audio data, at least one audio characteristic of at least the one or more selected audio sources; storing, by the control system, audio data and video data received during a capture phase; and controlling, by the control system, a display of the device to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images, wherein the GUI includes an audio source image corresponding to the at least one audio characteristic of the subset of one or more selected audio sources.

[0395] EEE2A. The method of claim EEE1 A, wherein the GUI includes one or more user input areas configured to receive user input.

[0396] EEE3A. The method of claim EEE1 A or claim EEE2A, further comprising classifying, by the control system, the audio sources in the inventory of audio sources into two or more audio source categories, wherein the GUI includes a user input area portion corresponding to at least one of the two or more audio source categories.

[0397] EEE4A. The method of claim EEE3 A, wherein one of the audio source categories is a foreground category corresponding to the one or more selected audio sources.

[0398] EEE5A. The method of claim EEE3 A or claim EEE4A, wherein one of the audio source categories is a background category corresponding to one or more audio sources of the inventory of audio sources that were not in the subset of one or more selected audio sources. [0399] EEE6A. The method of any one of claims EEE1 A- EEE5A, wherein the selecting comprises estimating which audio sources in the inventory of audio sources are most significant sound sources, and wherein the subset of one or more selected audio sources includes audio sources estimated to be the most significant sound sources.

[0400] EEE7A. The method of any one of claims EEE1 A- EEE6A, wherein the selecting comprises estimating which audio sources in the inventory of audio sources correspond to talkers, and wherein the subset of one or more selected audio sources includes audio sources estimated to be talkers.

[0401] EEE8A. The method of any one of claims EEE1 A- EEE7 A, further comprising performing, by the control system, a first sound source separation process.

[0402] EEE9A. The method of claim EEE8A, further comprising updating, by the control system, an audio scene state based at least in part on the first sound source separation process, and causing, by the control system, the GUI to be updated according to an updated audio scene state.

[0403] EEE10A. The method of claim EEE8A or claim EEE9A, wherein the creating is based, at least in part, on the first sound source separation process.

[0404] EEE11 A. The method of any one of claims EEE8A- EEE10A, further comprising performing post-capture audio processing on the audio data received during the capture phase, wherein the post-capture audio processing comprises a second sound source separation process that is more complex than the first sound source separation process.

[0405] EEE12A. The method of any one of claims EEE1 A- EEE11 A, further comprising detecting, by the control system and based at least in part on the video data, one or more potential sound sources.

[0406] EEE13A. The method of claim EEE12A, wherein at least one of the one or more potential sound sources is not indicated by the audio data.

[0407] EEE14A. The method of claim EEE12A or claim EEE13A, wherein the inventory of audio sources includes the one or more potential sound sources.

[0408] EEE15A. The method of any one of claims EEE1A- EEE14A, further comprising detecting, by the control system, one or more candidate sound sources for augmented audio capture, wherein the augmented audio capture comprises replacement of a candidate sound source by external audio or synthetic audio.

[0409] EEE16A. The method of claim EEE15A, wherein the GUI includes at least one user input area configured to receive a user selection of a selected potential sound source or a selected candidate sound source.

[0410] EEE17A. The method of claim EEE16A, wherein the GUI includes at least one user input area configured to receive a user selection of augmented audio capture, wherein the augmented audio capture includes at least one of external audio or synthetic audio for the selected potential sound source or the selected candidate sound source.

[0411] EEE18A. The method of claim EEE17A, wherein the GUI includes at least one user input area configured to receive a user selection of a ratio between augmented audio capture and real-world audio capture.

[0412] EEE19A. The method of any one of claims EEE1A- EEE18A, further comprising causing, by the control system, the display to display audio source labels in the GUI. [0413] EEE20A. The method of claim EEE19A, wherein at least one of the audio source labels corresponds to an audio source identified by the control system based on the audio data, the video data, or both.

[0414] EEE21 A. The method of any one of claims EEE1 A- EEE20A, wherein the audio data are received from the microphone system of the device and the video data are received from the camera system of the device.

[0415] EEE22A. The method of any one of claims EEE1 A- EEE21 A, further comprising updating, by the control system, an estimate of a current audio scene and causing, by the control system, the GUI to be updated according to updated estimates of the current audio scene.

[0416] EEE23 A. The method of claim EEE22A, wherein updating the estimate of the current audio scene comprises implementing, by the control system, an audio classifier and a video classifier.

[0417] EEE24A. The method of claim EEE22A, wherein updating the estimate of the current audio scene comprises implementing, by the control system, an audiovisual classifier.

[0418] EEE25A. The method of any one of claims EEE22A- EEE24A, wherein the updated estimates of the current audio scene include an updated level estimate for one or more audio sources.

[0419] EEE26A. One or more non-transitory media having instructions stored thereon for controlling one or more devices to perform a method, the method comprising: receiving, by a control system of a device, audio data from a microphone system; receiving, by the control system, video data from a camera system; creating, by the control system and based at least in part on the audio data, the video data, or both, an inventory of audio sources; selecting, by the control system, a subset of one or more selected audio sources from the inventory of audio sources; estimating, by the control system and based on the audio data, at least one audio characteristic of at least the one or more selected audio sources; storing, by the control system, audio data and video data received during a capture phase; and controlling, by the control system, a display of the device to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images, wherein the GUI includes an audio source image corresponding to the at least one audio characteristic of the subset of one or more selected audio sources. [0420] EEE27A. The one or more non-transitory media of claim EEE26A, wherein the GUI includes one or more user input areas configured to receive user input.

[0421] EEE28A. The one or more non-transitory media of claim EEE26A or claim EEE27A, further comprising classifying, by the control system, the audio sources in the inventory of audio sources into two or more audio source categories, wherein the GUI includes a user input area portion corresponding to at least one of the two or more audio source categories.

[0422] EEE29A. The one or more non-transitory media of claim EEE28A, wherein one of the audio source categories is a foreground category corresponding to the one or more selected audio sources and one of the audio source categories is a background category corresponding to one or more audio sources of the inventory of audio sources that were not in the subset of one or more selected audio sources.

[0423] EEE30A. An apparatus, comprising: an interface system; a memory system; a display system including at least one display; and a control system configured to: receive, via the interface system, audio data from a microphone system; receive, via the interface system, video data from a camera system; create, based at least in part on the audio data, the video data, or both, an inventory of audio sources; select a subset of one or more selected audio sources from the inventory of audio sources; estimate, based on the audio data, at least one audio characteristic of at least the one or more selected audio sources; store, in the memory system, audio data and video data received during a capture phase; and control, by the control system, a display of the display system to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images, wherein the GUI includes an audio source image corresponding to the at least one audio characteristic of the subset of one or more selected audio sources.

[0424] EEE31 A. The apparatus of claim EEE30A, wherein the GUI includes one or more user input areas configured to receive user input.

[0425] EEE32A. The apparatus of claim EEE30A or claim EEE31 A, further comprising classifying, by the control system, the audio sources in the inventory of audio sources into two or more audio source categories, wherein the GUI includes a user input area portion corresponding to at least one of the two or more audio source categories.

[0426] EEE33A. The apparatus of claim EEE32A, wherein one of the audio source categories is a foreground category corresponding to the one or more selected audio sources and one of the audio source categories is a background category corresponding to one or more audio sources of the inventory of audio sources that were not in the subset of one or more selected audio sources. [0427] EEE34A. The apparatus of any one of claims EEE30A- EEE33 A, wherein the apparatus includes the microphone system and the camera system.

[0428] EEE1B. A method, comprising: receiving, by a control system of a device, audio data from a microphone system; receiving, by the control system, video data from a camera system; creating, by the control system and based at least in part on the audio data, the video data, or both, an inventory of audio sources; selecting, by the control system, one or more selected audio sources from the inventory of audio sources, wherein the one or more selected audio sources are selected for possible augmentation or replacement; storing, by the control system, audio data and video data received during a capture phase; and controlling, by the control system, a display of the device to display images corresponding to the video data and to display a graphical user interface (GUI) overlaid on the images, wherein the GUI indicates the one or more selected audio sources.

[0429] EEE2B. The method of claim EEE1B, wherein the GUI includes one or more displayed user input areas to receive user input.

[0430] EEE3B. The method of claim EEE2B, wherein the GUI includes a user prompt associated with augmentation or replacement of audio data corresponding to the one or more selected audio sources.

[0431] EEE4B. The method of claim EEE2B or claim EEE3B, wherein the selecting is based, at least in part, on previously-received user input.

[0432] EEE5B. The method of any one of claims EEE1B- EEE4B, wherein at least a first selected audio source of the one or more selected audio sources is selected based, at least in part, on the video data.

[0433] EEE6B. The method of claim EEE5B, wherein the audio data corresponding to the first selected audio source is below a threshold level.

[0434] EEE7B. The method of any one of claims EEE4B- EEE6B, wherein the GUI includes a user prompt associated with augmentation or replacement of audio data corresponding to the first selected audio source.

[0435] EEE8B. The method of claim EEE7B, wherein the GUI includes a user prompt associated with augmentation of the audio data corresponding to the first selected audio source and wherein the augmentation involves a microphone beamforming process for augmentation of the audio data corresponding to the first selected audio source.

[0436] EEE9B. The method of claim EEE7B or claim EEE8B, wherein the GUI includes a user prompt associated with replacement of the audio data corresponding to the first selected audio source and wherein the replacement involves replacing the audio data corresponding to the first selected audio source with synthetic audio data or with external audio data.

[0437] EEE10B. The method of any one of claims EEE7B- EEE9B, further comprising: receiving, by the control system, user input via the GUI, wherein the received user input indicates augmentation or replacement of audio data corresponding to the first selected audio source; and augmenting or replacing, by the control system, the audio data corresponding to the first selected audio source, to produce augmented audio data or replacement audio data.

[0438] EEE1 IB. The method of claim EEE10B, further comprising: labeling, by the control system, the augmented audio data or the replacement audio data; and storing a label along with the augmented audio data or the replacement audio data.

[0439] EEE12B. The method of claim EEE1 IB, wherein the label comprises audio metadata. [0440] EEE13B. The method of any one of claims EEE10B- EEE12B, further comprising causing, by the control system, the GUI to indicate that the audio data corresponding to the first selected audio source is augmented audio data or replacement audio data.

[0441] EEE14B. The method of claim EEE13B, further comprising causing, by the control system, the GUI to indicate one or more audio sources corresponding to unmodified audio data.

[0442] EEE15B. The method of any one of claims EEE10B- EEE14B, further comprising storing, by the control system, unmodified audio data corresponding to at least the first selected audio source.

[0443] EEE16B. The method of claim EEE15B, further comprising: causing, by the control system, the GUI to indicate an audio source having corresponding augmented audio data or replacement audio data; and causing, by the control system, the GUI to include one or more user input areas for receiving user input for modifying the augmented audio data or replacement audio data according to the unmodified audio data.

[0444] EEE17B. The method of claim EEE16B, wherein the modifying comprises interpolating between the augmented audio data or replacement audio data and the unmodified audio data. [0445] EEE18B. The method of claim EEE16B, wherein the modifying comprises replacing the augmented audio data or replacement audio data with the unmodified audio data.

[0446] EEE19B. The method of any one of claim EEE1B- EEE18B, wherein the selecting involves estimating which audio sources in the inventory of audio sources correspond to talkers and wherein the one or more selected audio sources do not include audio sources estimated to be talkers.

[0447] EEE20B. The method of any one of claim EEE1B- EEE19B, further comprising detecting, by the control system and based at least in part on the video data, one or more potential sound sources , wherein at least one of the one or more potential sound sources is not indicated by the audio data and wherein the inventory of audio sources includes the one or more potential sound sources.

[0448] EEE21B. The method of any one of claim EEE1B- EEE20B, further comprising causing, by the control system, the display to display audio source labels in the GUI.

[0449] EEE22B. The method of claim EEE21B, wherein at least one of the audio source labels corresponds to an audio source or potential audio source identified by the control system based on the audio data, the video data, or both.

[0450] EEE23B. The method of any one of claim EEE1B- EEE22B, wherein the audio data are received from a microphone system of the device and the video data are received from a camera system of the device.

[0451] EEE24B. The method of any one of claim EEE1B- EEE23B, wherein the GUI is displayed prior to the capture phase, during the capture phase, or both.

[0452] EEE25B. The method of any one of claim EEE1B- EEE24B, wherein the GUI is displayed after the capture phase and during a post-capture review process.

[0453] EEE26B. One or more non-transitory media having instructions stored thereon for controlling one or more devices to perform a method, the method comprising: receiving, by a control system of a device, audio data from a microphone system; receiving, by the control system, video data from a camera system; creating, by the control system and based at least in part on the audio data, the video data, or both, an inventory of audio sources; selecting, by the control system, one or more selected audio sources from the inventory of audio sources, wherein the one or more selected audio sources are selected for possible augmentation or replacement; storing, by the control system, audio data and video data received during a capture phase; and controlling, by the control system, a display of the device to display images corresponding to the video data and to display a graphical user interface (GUI) overlaid on the images, wherein the GUI indicates the one or more selected audio sources.

[0454] EEE27B. The one or more non-transitory media of claim EEE26B, wherein the GUI includes one or more displayed user input areas to receive user input.

[0455] EEE28 B. The one or more non-transitory media of claim EEE27B, wherein the GUI includes a user prompt associated with augmentation or replacement of audio data corresponding to the one or more selected audio sources.

[0456] EEE29B. The one or more non-transitory media of claim EEE27B or claim EEE28B, wherein the selecting is based, at least in part, on previously-received user input.

[0457] EEE30B. The one or more non-transitory media of any one of claims EEE26B- EEE28B, wherein at least a first selected audio source of the one or more selected audio sources is selected based, at least in part, on the video data.

[0458] EEE31B. An apparatus, comprising: an interface system; a display system including one or more displays; a memory system; and a control system configured to: receive, via the interface system, audio data from a microphone system; receive, via the interface system, video data from a camera system; create, based at least in part on the audio data, the video data, or both, an inventory of audio sources; select one or more selected audio sources from the inventory of audio sources, wherein the one or more selected audio sources are selected for possible augmentation or replacement; store, in the memory system, audio data and video data received during a capture phase; and control a display of the display system to display images corresponding to the video data and to display a graphical user interface (GUI) overlaid on the images, wherein the GUI indicates the one or more selected audio sources.

[0459] EEE32B. The apparatus of claim EEE3 IB, wherein the GUI includes one or more displayed user input areas to receive user input.

[0460] EEE33B. The apparatus of claim EEE32B, wherein the GUI includes a user prompt associated with augmentation or replacement of audio data corresponding to the one or more selected audio sources.

[0461] EEE34B. The apparatus of claim EEE32B or claim EEE33B, wherein the selecting is based, at least in part, on previously-received user input. [0462] EEE35B. The apparatus of any one of claims EEE3 IB- EEE34B, wherein at least a first selected audio source of the one or more selected audio sources is selected based, at least in part, on the video data.

[0463] EEE1C. A method, comprising: receiving, by a control system of a device, audio data from a microphone system; receiving, by the control system, video data from a camera system; creating, by the control system and based at least in part on the audio data, the video data, or both, an inventory of audio sources; selecting, by the control system, at least a first selected audio source from the inventory of audio sources, the first selected audio source being selected for augmentation or replacement; augmenting or replacing, by the control system, audio data corresponding to the first selected audio source, to produce first modified audio data, the first modified audio data comprising at least one of first augmented audio data or first replacement audio data; storing, by the control system, the first modified audio data; storing, by the control system, audio data and video data received during a capture phase, the audio data including first unmodified audio data corresponding to at least the first selected audio source; controlling, by the control system, a display of the device to present images corresponding to the video data and to display a post-capture graphical user interface (GUI) overlaid on the images, wherein the post-capture GUI indicates at least the first selected audio source and one or more user input areas to receive user input; and editing, during a post-capture phase review process, the first modified audio data to include at least a portion of the first unmodified audio data based on the user input received by the post-capture GUI.

[0464] EEE2C. The method of claim EEE1C, wherein the editing comprises interpolating between the first modified audio data and the first unmodified audio data.

[0465] EEE3C. The method of claim EEE1C or claim EEE2C, wherein the post-capture GUI includes at least one user input area configured to receive a user selection of a ratio between the first modified audio data and the first unmodified audio data.

[0466] EEE4C. The method of claim EEE3C, wherein the at least one user input area comprises a slider.

[0467] EEE5C. The method of claim EEE4C, wherein the slider is configured to allow a user to select a ratio from zero percent to 100 percent.

[0468] EEE6C. The method of any one of claims EEE1C- EEE5C, wherein the selecting is based, at least in part, on user input. [0469] EEE7C. The method of any one of claims EEE1C- EEE6C, wherein the selecting is based, at least in part, on the video data.

[0470] EEE8C. The method of claim EEE7C, wherein the audio data corresponding to the first selected audio source is below a threshold level.

[0471] EEE9C. The method of any one of claims EEE1C- EEE8C, wherein the controlling further comprises adapting the display to present, prior to or during the capture phase, an audio data modification GUI that includes a user prompt associated with augmentation or replacement of audio data corresponding to one or more selected audio sources of the inventory of audio sources.

[0472] EEE10C. The method of claim EEE9C, wherein the audio data modification GUI includes a user prompt associated with augmentation of the audio data corresponding to a selected audio source and wherein the augmentation is associated with a microphone beamforming process for augmentation of the audio data corresponding to the selected audio source.

[0473] EEE11C. The method of claim EEE9C or claim EEE10C, wherein the audio data modification GUI includes a user prompt associated with replacement of the audio data corresponding to a selected audio source and wherein the replacement is associated with replacing the audio data corresponding to the selected audio source with synthetic audio data or with external audio data.

[0474] EEE12C. The method of any one of claims EEE9C- EEE11C, further comprising receiving, by the control system, audio data modification user input via the audio data modification GUI indicating augmentation or replacement of audio data corresponding to the first selected audio source; and wherein the editing is effective to provide the first modified audio data responsive to the audio data modification user input.

[0475] EEE13C. The method of claim EEE12C, further comprising: labeling, by the control system, modified audio data; and storing a label along with the modified audio data.

[0476] EEE14C. The method of claim EEE13C, wherein the label comprises audio metadata. [0477] EEE15C. The method of any one of claims EEE1C- EEE14C, further comprising causing, by the control system, the post-capture GUI to indicate that the audio data corresponding to the first selected audio source is modified audio data. [0478] EEE16C. The method of claim EEE15C, further comprising causing, by the control system, the post-capture GUI to indicate one or more audio sources corresponding to unmodified audio data.

[0479] EEE17C. The method of any one of claims EEE1C- EEE16C, wherein the selecting involves estimating which audio sources in the inventory of audio sources correspond to talkers and wherein the one or more selected audio sources do not include audio sources estimated to be talkers.

[0480] EEE18C. The method of any one of claims EEE1C- EEE17C, further comprising detecting, by the control system and based at least in part on the video data, one or more potential sound sources, wherein at least one of the one or more potential sound sources is not indicated by the audio data and wherein the inventory of audio sources includes the one or more potential sound sources.

[0481] EEE19C. The method of any one of claims EEE1C- EEE18C, further comprising causing, by the control system, the display to display audio source labels in the audio data modification GUI or the post-capture GUI.

[0482] EEE20C. The method of claim EEE19C, wherein at least one of the audio source labels corresponds to an audio source or potential audio source identified by the control system based on the audio data, the video data, or both.

[0483] EEE21C. The method of any one of claims EEE1C- EEE20C, wherein the audio data are received from a microphone system of the device and the video data are received from a camera system of the device.

[0484] EEE22C. One or more non-transitory media having instructions stored thereon for controlling one or more devices to perform a method, the method comprising: receiving, by a control system of a device, audio data from a microphone system; receiving, by the control system, video data from a camera system; creating, by the control system and based at least in part on the audio data, the video data, or both, an inventory of audio sources; selecting, by the control system, at least a first selected audio source from the inventory of audio sources, the first selected audio source being selected for augmentation or replacement; augmenting or replacing, by the control system, audio data corresponding to the first selected audio source, to produce first modified audio data, the first modified audio data comprising at least one of first augmented audio data or first replacement audio data; storing, by the control system, the first modified audio data; storing, by the control system, audio data and video data received during a capture phase, the audio data including first unmodified audio data corresponding to at least the first selected audio source; controlling, by the control system, a display of the device to present images corresponding to the video data and to display a post-capture graphical user interface (GUI) overlaid on the images, wherein the post-capture GUI indicates at least the first selected audio source and one or more user input areas to receive user input; and editing, during a post-capture phase review process, the first modified audio data to include at least a portion of the first unmodified audio data based on the user input received by the post-capture GUI.

[0485] EEE23C. The one or more non -transitory media of claim EEE22C, wherein the editing comprises interpolating between the first modified audio data and the first unmodified audio data.

[0486] EEE24C. The one or more non -transitory media of claim EEE22C or claim EEE23C, wherein the post-capture GUI includes at least one user input area configured to receive a user selection of a ratio between the first modified audio data and the first unmodified audio data.. [0487] EEE25C. The one or more non-transitory media of claim EEE24C, wherein the at least one user input area comprises a slider.

[0488] EEE26C. An apparatus, comprising: an interface system; a memory system; a display system including at least one display; and a control system configured to: receive audio data from a microphone system; receive video data from a camera system; create, based at least in part on the audio data, the video data, or both, an inventory of audio sources; select at least a first selected audio source from the inventory of audio sources, the first selected audio source being selected for augmentation or replacement; augment or replace audio data corresponding to the first selected audio source, to produce first modified audio data, the first modified audio data comprising at least one of first augmented audio data or first replacement audio data; store the first modified audio data; store audio data and video data received during a capture phase, the audio data including first unmodified audio data corresponding to at least the first selected audio source; control, by the control system, a display of the device to present images corresponding to the video data and to display a post-capture graphical user interface (GUI) overlaid on the images, wherein the post-capture GUI indicates at least the first selected audio source and one or more user input areas to receive user input; and edit, during a post-capture phase review process, the first modified audio data to include at least a portion of the first unmodified audio data based on the user input received by the post-capture GUI.

[0489] EEE27C. The apparatus of claim EEE26C, wherein the editing comprises interpolating between the first modified audio data and the first unmodified audio data.

[0490] EEE28C. The apparatus of claim EEE26C or claim EEE27C, wherein the postcapture GUI includes at least one user input area configured to receive a user selection of a ratio between the first modified audio data and the first unmodified audio data.

[0491] EEE29C. The apparatus of claim EEE28C, wherein the at least one user input area comprises a slider.

[0492] EEE1D. A method, comprising: receiving, by a control system of a device, audio data from a microphone system; receiving, by the control system, video data from a camera system; identifying, by the control system and based at least in part on the audio data and the video data, two or more audio sources in an audio scene; storing, by the control system, audio data and video data received during a capture phase; controlling, by the control system, a display of the device to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images, wherein the GUI includes an audio source image corresponding to each of the two or more audio sources, and wherein the GUI includes one or more user input areas to receive user input; receiving, by the control system, user input via the one or more user input areas, the user input corresponding to at least one of the two or more audio sources; generating, by the control system, revision metadata corresponding to the user input; and storing, by the control system, the revision metadata with at least the audio data received during the capture phase.

[0493] EEE2D. The method of claim EEE1D, further comprising causing audio data corresponding to the revision metadata to be modified according to the revision metadata.

[0494] EEE3D. The method of claim EEE2D, wherein the audio data corresponding to the revision metadata includes unmodified audio data received during the capture phase.

[0495] EEE4D. The method of claim EEE2D or claim EEE3D, wherein the audio data corresponding to the revision metadata includes modified audio data and wherein the modified audio data includes augmented audio data or replacement audio data. [0496] EEE 5D. The method of any one of claims EEE2D-EEE4D, wherein causing the audio data to be modified comprises modifying, by the control system, audio data corresponding to the revision metadata.

[0497] EEE 6D. The method of any one of claims EEE2D-EEE4D, wherein causing the audio data received during the capture phase to be modified comprises sending, by the control system, the revision metadata and the audio data received during the capture phase to one or more other devices.

[0498] EEE 7D. The method of any one of claims EEE2D, EEE3D, EEE4D or EEE 6D, wherein causing the audio data received during the capture phase to be modified comprises sending, by the control system, the revision metadata and the audio data received during the capture phase to one or more servers.

[0499] EEE 8D. The method of any one of claims EEE2D- EEE 7D, wherein causing the audio data received during the capture phase to be modified comprises applying an audio enhancement tool to audio data corresponding to the revision metadata.

[0500] EEE 9D. The method of claim EEE 8D, wherein the audio data corresponding to the revision metadata includes speech audio data corresponding to speech from at least one person and wherein the audio enhancement tool comprises a speech enhancement tool.

[0501] EEE10D. The method of claim EEE 8D, wherein the audio enhancement tool comprises a sound source separation process.

[0502] EEE1 ID. The method of any one of claims EEE1D-EEE10D, further comprising: receiving, by the control system and after the capture phase has begun, modification user input via the one or more user input areas; and causing, by the control system, audio data received during the capture phase to be modified according to the modification user input.

[0503] EEE12D. The method of any one of claims EEE1D-EEE1 ID, wherein the one or more user input areas includes at least one user input area configured for receiving user input regarding a selected level.

[0504] EEE13D. The method of any one of claims EEE1D-EEE12D, wherein the identifying comprises creating, by the control system, an inventory of sound sources.

[0505] EEE14D. The method of claim EEE13D, wherein the inventory of sound sources includes actual sound sources and potential sound sources. [0506] EEE15D. The method of any one of claims EEE1D-EEE14D, wherein the storing comprises storing modified audio data that has been modified according to the user input.

[0507] EEE16D. The method of any one of claims EEE1D-EEE15D, wherein the identifying comprises performing, by the control system, a first sound source separation process and wherein causing the audio data to be modified comprises performing a second sound source separation process.

[0508] EEE17D. One or more non-transitory media having instructions stored thereon for controlling one or more devices to perform a method, the method comprising: receiving, by a control system of a device, audio data from a microphone system; receiving, by the control system, video data from a camera system; identifying, by the control system and based at least in part on the audio data and the video data, two or more audio sources in an audio scene; storing, by the control system, audio data and video data received during a capture phase; controlling, by the control system, a display of the device to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images, wherein the GUI includes an audio source image corresponding to each of the two or more audio sources, and wherein the GUI includes one or more user input areas to receive user input; receiving, by the control system, user input via the one or more user input areas, the user input corresponding to at least one of the two or more audio sources; generating, by the control system, revision metadata corresponding to the user input; and storing, by the control system, the revision metadata with at least the audio data received during the capture phase.

[0509] EEE18D. The one or more non-transitory media of claim EEE17D, further comprising causing audio data corresponding to the revision metadata to be modified according to the revision metadata.

[0510] EEE19D. The one or more non-transitory media of claim EEE18D, wherein the audio data corresponding to the revision metadata includes unmodified audio data received during the capture phase.

[0511] EEE20D. The one or more non-transitory media of claim EEE18D or claim EEE19D, wherein the audio data corresponding to the revision metadata includes modified audio data and wherein the modified audio data includes augmented audio data or replacement audio data. [0512] EEE21D. The one or more non-transitory media of any one of claims EEE18D- EEE20D, wherein causing the audio data to be modified comprises modifying, by the control system, audio data corresponding to the revision metadata.

[0513] EEE22D. An apparatus, comprising: an interface system; a display system including one or more displays; a memory system; and a control system configured to: receive, via the interface system, audio data from a microphone system; receive, via the interface system, video data from a camera system; identify, based at least in part on the audio data and the video data, two or more audio sources in an audio scene; store, in the memory system, audio data and video data received during a capture phase; control, a display of the display system to display images corresponding to the video data and to display, prior to and during the capture phase, a graphical user interface (GUI) overlaid on the images, wherein the GUI includes an audio source image corresponding to each of the two or more audio sources, and wherein the GUI includes one or more user input areas to receive user input; receive user input via the one or more user input areas, the user input corresponding to at least one of the two or more audio sources; generate revision metadata corresponding to the user input; and store, in the memory system, the revision metadata with at least the audio data received during the capture phase.

[0514] EEE23D. The apparatus of claim EEE22D, further comprising causing audio data corresponding to the revision metadata to be modified according to the revision metadata.

[0515] EEE24D. The apparatus of claim EEE23D, wherein the audio data corresponding to the revision metadata includes unmodified audio data received during the capture phase.

[0516] EEE25D. The apparatus of claim EEE23D or claim EEE24D, wherein the audio data corresponding to the revision metadata includes modified audio data and wherein the modified audio data includes augmented audio data or replacement audio data.

[0517] EEE26D. The apparatus of any one of claims EEE23D-EEE25D, wherein causing the audio data to be modified comprises modifying, by the control system, audio data corresponding to the revision metadata.

[0518] In accordance with example embodiments of the present disclosure, the processes disclosed herein may be implemented as computer software programs or on a computer-readable storage medium. For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program including program code for performing methods. In such embodiments, the computer program may be downloaded and mounted from the network via the communication unit 709, and/or installed from a removable medium, such as the removable medium 151 that is shown in Figure 1 A.

[0519] Generally, various example embodiments of the present disclosure may be implemented in hardware or special purpose circuits (e.g., control circuitry), software, logic or any combination thereof For example, the units discussed above can be executed by control circuitry, thus, the control circuitry may be performing — or configured to perform — the actions described in this disclosure. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device (e.g., control circuitry). While various aspects of the example embodiments of the present disclosure are illustrated and described as block diagrams, flowcharts, or using some other pictorial representation, it will be appreciated that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

[0520] Additionally, various blocks shown in the flowcharts may be viewed as method steps, and/or as operations that result from operation of computer program code, and/or as a plurality of coupled logic circuit elements constructed to carry out the associated function(s). For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program containing program codes configured to carry out the methods as described above.

[0521] In the context of the disclosure, a machine readable medium may be any tangible medium that may contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine readable medium may be a machine readable signal medium or a machine readable storage medium. A machine readable medium may be non- transitory and may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the machine readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

[0522] Computer program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These computer program codes may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus that has control circuitry, such that the program codes, when executed by the processor of the computer or other programmable data processing apparatus, cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented. The program code may execute entirely on a computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server or distributed over one or more remote computers and/or servers. [0523] While this document contains many specific implementation details, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can, in some cases, be excised from the combination, and the claimed combination may be directed to a sub combination or variation of a sub combination. Logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims.

Claims

CLAIMS What Is Claimed Is:

1. A method, comprising: receiving, by a control system of a device, audio data from a microphone system; receiving, by the control system, video data from a camera system; creating, by the control system and based at least in part on the audio data, the video data, or both, an inventory of audio sources; selecting, by the control system, at least a first selected audio source from the inventory of audio sources, the first selected audio source being selected for augmentation or replacement; augmenting or replacing, by the control system, audio data corresponding to the first selected audio source, to produce first modified audio data, the first modified audio data comprising at least one of first augmented audio data or first replacement audio data; storing, by the control system, the first modified audio data; storing, by the control system, audio data and video data received during a capture phase, the audio data including first unmodified audio data corresponding to at least the first selected audio source; controlling, by the control system, a display of the device to present images corresponding to the video data and to display a post-capture graphical user interface (GUI) overlaid on the images, wherein the post-capture GUI indicates at least the first selected audio source and one or more user input areas to receive user input; and editing, during a post-capture phase review process, the first modified audio data to include at least a portion of the first unmodified audio data based on the user input received by the post-capture GUI.

2. The method of claim 1, wherein the editing comprises interpolating between the first modified audio data and the first unmodified audio data.

3. The method of claim 1 or claim 2, wherein the post-capture GUI includes at least one user input area configured to receive a user selection of a ratio between the first modified audio data and the first unmodified audio data.

4. The method of claim 3, wherein the at least one user input area comprises a slider.

5. The method of claim 4, wherein the slider is configured to allow a user to select a ratio from zero percent to 100 percent.

6. The method of any one of claims 1- 5, wherein the selecting is based, at least in part, on user input.

7. The method of any one of claims 1- 6, wherein the selecting is based, at least in part, on the video data.

8. The method of claim 7, wherein the audio data corresponding to the first selected audio source is below a threshold level.

9. The method of any one of claims 1- 8, wherein the controlling further comprises adapting the display to present, prior to or during the capture phase, an audio data modification GUI that includes a user prompt associated with augmentation or replacement of audio data corresponding to one or more selected audio sources of the inventory of audio sources.

10. The method of claim 9, wherein the audio data modification GUI includes a user prompt associated with augmentation of the audio data corresponding to a selected audio source and wherein the augmentation is associated with a microphone beamforming process for augmentation of the audio data corresponding to the selected audio source.

11. The method of claim 9 or claim 10, wherein the audio data modification GUI includes a user prompt associated with replacement of the audio data corresponding to a selected audio source and wherein the replacement is associated with replacing the audio data corresponding to the selected audio source with synthetic audio data or with external audio data.

12. The method of any one of claims 9 - 11, further comprising receiving, by the control system, audio data modification user input via the audio data modification GUI indicating augmentation or replacement of audio data corresponding to the first selected audio source; and wherein the editing is effective to provide the first modified audio data responsive to the audio data modification user input.

13. The method of claim 12, further comprising: labeling, by the control system, modified audio data; and storing a label along with the modified audio data.

14. The method of claim 13, wherein the label comprises audio metadata.

15. The method of any one of claims 1 - 14, further comprising causing, by the control system, the post-capture GUI to indicate that the audio data corresponding to the first selected audio source is modified audio data.

16. The method of claim 15, further comprising causing, by the control system, the postcapture GUI to indicate one or more audio sources corresponding to unmodified audio data.

17. The method of any one of claims 1 - 16, wherein the selecting involves estimating which audio sources in the inventory of audio sources correspond to talkers and wherein the one or more selected audio sources do not include audio sources estimated to be talkers.

18. The method of any one of claims 1 - 17, further comprising detecting, by the control system and based at least in part on the video data, one or more potential sound sources, wherein at least one of the one or more potential sound sources is not indicated by the audio data and wherein the inventory of audio sources includes the one or more potential sound sources.

19. The method of any one of claims 1 - 18, further comprising causing, by the control system, the display to display audio source labels in the audio data modification GUI or the postcapture GUI.

20. The method of claim 19, wherein at least one of the audio source labels corresponds to an audio source or potential audio source identified by the control system based on the audio data, the video data, or both.

21. The method of any one of claims 1 - 20, wherein the audio data are received from a microphone system of the device and the video data are received from a camera system of the device.

22. One or more non-transitory media having instructions stored thereon for controlling one or more devices to perform a method, the method comprising: receiving, by a control system of a device, audio data from a microphone system; receiving, by the control system, video data from a camera system; creating, by the control system and based at least in part on the audio data, the video data, or both, an inventory of audio sources; selecting, by the control system, at least a first selected audio source from the inventory of audio sources, the first selected audio source being selected for augmentation or replacement; augmenting or replacing, by the control system, audio data corresponding to the first selected audio source, to produce first modified audio data, the first modified audio data comprising at least one of first augmented audio data or first replacement audio data; storing, by the control system, the first modified audio data; storing, by the control system, audio data and video data received during a capture phase, the audio data including first unmodified audio data corresponding to at least the first selected audio source; controlling, by the control system, a display of the device to present images corresponding to the video data and to display a post-capture graphical user interface (GUI) overlaid on the images, wherein the post-capture GUI indicates at least the first selected audio source and one or more user input areas to receive user input; and editing, during a post-capture phase review process, the first modified audio data to include at least a portion of the first unmodified audio data based on the user input received by the post-capture GUI.

23. The one or more non-transitory media of claim 22, wherein the editing comprises interpolating between the first modified audio data and the first unmodified audio data.

24. The one or more non-transitory media of claim 22 or claim 23, wherein the post-capture GUI includes at least one user input area configured to receive a user selection of a ratio between the first modified audio data and the first unmodified audio data..

25. The one or more non-transitory media of claim 24, wherein the at least one user input area comprises a slider.

26. An apparatus, comprising: an interface system; a memory system; a display system including at least one display; and a control system configured to: receive audio data from a microphone system; receive video data from a camera system; create, based at least in part on the audio data, the video data, or both, an inventory of audio sources; select at least a first selected audio source from the inventory of audio sources, the first selected audio source being selected for augmentation or replacement; augment or replace audio data corresponding to the first selected audio source, to produce first modified audio data, the first modified audio data comprising at least one of first augmented audio data or first replacement audio data; store the first modified audio data; store audio data and video data received during a capture phase, the audio data including first unmodified audio data corresponding to at least the first selected audio source; control, by the control system, a display of the device to present images corresponding to the video data and to display a post-capture graphical user interface (GUI) overlaid on the images, wherein the post-capture GUI indicates at least the first selected audio source and one or more user input areas to receive user input; and edit, during a post-capture phase review process, the first modified audio data to include at least a portion of the first unmodified audio data based on the user input received by the post-capture GUI.

27. The apparatus of claim 26, wherein the editing comprises interpolating between the first modified audio data and the first unmodified audio data.

28. The apparatus of claim 26 or claim 27, wherein the post-capture GUI includes at least one user input area configured to receive a user selection of a ratio between the first modified audio data and the first unmodified audio data.

29. The apparatus of claim 28, wherein the at least one user input area comprises a slider.