WO2025010217A1

WO2025010217A1 - Attention tracking with sensors

Info

Publication number: WO2025010217A1
Application number: PCT/US2024/036330
Authority: WO
Inventors: Dylan James HARPER-HARRIS; Jianbo MA; Andrea FANELLI; Davis R. BARCH; Cedric JOGUET-RECCORDON; Daniel Steven TEMPLETON; Jeffrey Ross Baker; Evan David GITTERMAN; Benjamin SOUTHWELL; Yin-Lee HO; Richard J. CARTWRIGHT
Original assignee: Dolby Laboratories Licensing Corp
Current assignee: Dolby Laboratories Licensing Corp
Priority date: 2023-07-04
Filing date: 2024-07-01
Publication date: 2025-01-09
Anticipated expiration: 2026-01-04

Abstract

Some disclosed methods involve obtaining sensor data from a sensor system during a content presentation and estimating user response events based on the sensor data. Some disclosed methods involve producing user attention analytics based at least in part on estimated user response events corresponding with estimated user attention to content intervals of the content presentation. Some disclosed methods involve causing the content presentation to be altered based, at least in part, on the user attention analytics and causing an altered content presentation to be provided.

Description

ATTENTION TRACKING WITH SENSORS

This application claims the benefit of priority from International Patent Application No. PCT/CN2023/105695 filed July 4, 2023, U.S. Provisional Application No. 63/513,318 filed July 12, 2023 and U.S. Provisional Application No. 63/568,378, filed March 21, 2024, each of which is incorporated by reference herein in its entirety.

TECHNICAL FIELD

This disclosure pertains to devices, systems and methods for estimating user attention levels and related factors based on signals from one or more sensors, as well as to responses to such estimated user attention levels.

BACKGROUND

Some methods, devices and systems for estimating user attention, such as user attention to advertising content, are known. Previously-implemented approaches to estimating user attention to media content involve assessing a person’s rating of the content after the person has consumed it, such as after the person has finished watching a movie or an episode of a television program, after the user has played an online game, etc. Although existing devices, systems and methods can provide benefits in some contexts, improved devices, systems and methods would be desirable.

SUMMARY

At least some aspects of the present disclosure may be implemented via one or more methods. In some instances, the method(s) may be implemented, at least in part, by a control system and/or via instructions (e.g., software) stored on one or more non- transitory media. Some disclosed methods involve obtaining sensor data from a sensor system during a content presentation and estimating user response events based on the sensor data. Some disclosed methods involve producing user attention analytics based at least in part on estimated user response events corresponding with estimated user attention to content intervals of the content presentation. Some disclosed methods involve causing the content presentation to be altered based, at least in part, on the user attention analytics and causing an altered content presentation to be provided.

Some or all of the operations, functions and/or methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, some innovative aspects of the subject matter described in this disclosure can be implemented via one or more non-transitory media having software stored thereon.

At least some aspects of the present disclosure may be implemented via one or more devices. For example, one or more devices (e.g., a system that includes one or more devices) may be capable of performing, at least in part, the methods disclosed herein. In some implementations, an apparatus is, or includes, an audio processing system having an interface system and a control system. The control system may include one or more general purpose single- or multi-chip processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or combinations thereof. The control system may be configured for implementing some or all of the methods disclosed herein.

In some examples, a system may include a head unit, a loudspeaker system, a sensor system and a control system. The control system may include one or more device analytics engines and a user attention analytics engine. The one or more device analytics engines may be configured to estimate user response events based on sensor data received from the sensor system. The user attention analytics engine may be configured to produce user attention analytics based at least in part on estimated user response events received from the one or more device analytics engines. The user attention analytics may correspond with estimated user attention to content intervals of a content presentation being provided via the head unit and the loudspeaker system.

The control system may be configured to cause the content presentation to be altered based, at least in part, on the user attention analytics and to cause an altered content presentation to be provided by the head unit, by the loudspeaker system, or by the head unit and the loudspeaker system.

In some examples, the system also may include an interface system configured for providing communication between the control system and one or more other devices via a network. In some examples, the altered content presentation may be, or may include, altered content received from the one or more other devices via the network. According to some examples, causing the content presentation to be altered may involve sending, via the an interface system, user attention analytics from the user attention analytics engine and receiving the altered content responsive to the user attention analytics.

According to some examples, causing the content presentation to be altered may involve causing the content presentation to be personalized or augmented. In some examples, causing the content presentation to be personalized or augmented may involve altering one or more of audio playback volume, audio rendering location, one or more other audio characteristics, or combinations thereof.

In some examples, the head unit may be, or may include, a television. According to some such examples, causing the content presentation to be personalized or augmented may involve altering one or more television display characteristics. In some examples, the head unit may be, or may include, a digital media adapter.

According to some examples, causing the content presentation to be personalized or augmented may involve altering a storyline, adding a character or other story element, altering a time interval during which a character is involved, altering a time interval devoted to another aspect of the content presentation, or combinations thereof.

In some examples, causing the content presentation to be personalized or augmented may involve providing personalized advertising content. In some such examples, providing personalized advertising content may involve providing advertising content corresponding to estimated user attention to one or more content intervals involving one or more products or services.

According to some examples, causing the content presentation to be personalized or augmented may involve providing or altering a laugh track.

In some examples, the head unit, the loudspeaker system and the sensor system may be in a first environment. According to some such examples, the control system may be further configured to cause the content presentation to be altered based, at least in part, on sensor data, estimated user response events, user attention analytics, or combinations thereof, corresponding to one or more other environments.

According to some examples, the control system may be further configured to cause the content presentation to be paused or replayed.

In some examples, the sensor system may include one or more cameras. According to some such examples, the sensor data may include camera data.

According to some examples, the sensor system may include one or more microphones. According to some such examples, the sensor data may include microphone data. In some such examples, the control system may be further configured to implement an echo management system to mitigate effects of audio played back by the loudspeaker system and detected by the one or more microphones.

In some examples, one or more first portions of the control system may be deployed in a first environment and a second portion of the control system may be deployed in a second environment. In some such examples, the one or more first portions of the control system may be configured to implement the one or more device analytics engines and wherein the second portion of the control system may be configured to implement the user attention analytics engine.

Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.

BRIEF DESCRIPTION OF THE DRAWINGS

Like reference numbers and designations in the various drawings indicate like elements.

Figure 1 A is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure.

Figure IB shows an environment that includes examples of components capable of implementing various aspects of this disclosure.

Figure 1C shows another environment that includes examples of components capable of implementing various aspects of this disclosure.

Figure 2 shows example components of an Attention Tracking System (ATS).

Figure 3A shows components of an ATS residing in a playback environment according to one example.

Figure 3B shows components of an ATS residing in the cloud according to one example.

Figure 4 shows components of a neural network capable of performing real-time acoustic event detection according to one example.

Figure 5 shows components of a device analytics engine configured for performing real-time pose estimation according to one example.

Figure 6 shows components of a ATS according to another example. Figures 7, 8, 9 and 10 show examples of providing feedback from an ATS during video conferences.

Figure 11 shows an example of presenting feedback from an ATS after a video conference.

Figure 12 is a flow diagram that outlines one example of a disclosed method.

DETAILED DESCRIPTION OF EMBODIMENTS

We currently spend a lot of time consuming media content, including but not limited to audiovisual content, interacting with media content, or combinations thereof. (For the sakes of brevity and convenience, both consuming and interacting with media content may be referred to herein as “consuming” media content.) Consuming media content may involve viewing a television program or a movie, watching or listening to an advertisement, listening to music or a podcast, gaming, video conferencing, participating in an online learning course, etc. Accordingly, movies, online games, video games, video conferences, advertisements online learning courses, podcasts, streamed music, etc., may he referred to herein as types of media content.

Previously-implemented approaches to estimating user attention to media content such as movies, television programs, etc., do not take into account how a person reacts while the person is in the process of consuming the media content. Instead, a person’s impressions may be assessed according to the person’s rating of the content after the user has consumed it, such as after the person has finished watching a movie or an episode of a television program, after the user has played an online game, etc. Current systems often track what content was selected by a user, where a user chose to stop the content, which sections of the content were replayed, etc. These metrics lack granular information about how consumers are engaging with the content whilst it is being consumed. Current systems are generally not aware if the user is even present at any particular time. For this reason, current user attention estimation methods do not generally provide any real-time information. Such real-time information would allow for much more detailed analytics in addition to real-time feedback to the playback of the content, which could improve the experience for the user.

It would be beneficial to estimate one or more states of a person while the person is in the process of consuming media content. Such states may include, or may involve, user attention, cognitive load, interest, etc. This real-time information would allow for much more detailed analytics in addition to real-time feedback to the playback of the content, which could improve the experience for the user. Various disclosed examples overcome the limitations of previously-implemented approaches to estimating user attention. Some disclosed techniques and systems utilize available sensors to detect user reactions, or lack of, in real-time. Some such examples involve using one or more cameras, eye trackers, ambient light sensors, microphones, wearable sensors, or combinations thereof. Some such examples involve measuring a person’s level of engagement, heart rate, cognitive load, attention, interest, etc., while the person is consuming media content by watching a television, playing a game, participating in a telecommunication experience (such as a videoconference, a video seminar, etc.), listening to a podcast, etc. Recent advancements in Al such as automatic speech recognition (ASR), emotion recognition and gaze tracking, have made an attention tracking system like this possible. Additionally, smart devices and a variety of human-oriented sensors have become commonplace are becoming commonplace in our lives. For example, we have microphones in our smart speakers, phones and televisions (TVs), cameras for gaming consoles, and galvanic skin response in our smart watches. Using some or all of these technologies in combination allows for an enhanced user attention tracking system to be achieved.

Moreover, the presence of multiple playback devices can make for a richer playback experience, which may involve audio playback out of satellite smart speakers, haptic feedback from mobile phones, etc. Orchestrating multiple devices to play back audio can make it difficult to detect audio events using microphones due to the echo. Therefore, some disclosed techniques and systems include descriptions of various approaches to perform echo management.

In addition to having a greater amount of data supplied by sensors, advancements with deep neural networks (DNN) have allowed for greater insight into the meaning behind data collected by these sensors. DNNs are now capable of tasks such as ASR, emotion recognition and gaze tracking. The outputs of such models can be indicators for real-time attention metrics.

Some disclosed examples involve using real-time sensor data to adapt a content presentation for the user in real time. Some such examples may involve altering one or more aspects of the media content in response to estimated attention, arousal, cognitive load, etc. Some disclosed examples involve aggregating real-time user attention information on both a user and content level to provide insights about future content options for the user and changes to the content for future consumers respectively. According to some examples, an attention tracking system consistent with the proposed techniques may form the following feedback loop:

A content presentation is played back out of one or more devices;

Users have some level of attention to the content being presented;

Sensors detect indications of user attention;

An attention analytics engine determines the attention level of users in real-time; and The content presentation is adapted in real-time to the improve the experience for the users.

In some examples, user attention analytics may be aggregated over time to determine how people engage with the content presentation and users’ affinity to different aspects of the content presentation. These user attention analytics can still be used in conjunction with traditional attention detection methods.

An example application of immediate real-time attention detection is a virtual laughtrack. A virtual laugh-track could laugh with the user whenever they find the content presentation humorous. Short-term uses of the real-time user feedback could include catered content selection and targeted ads. In some examples, longer-term analytics may be used for the purpose of adjusting a content presentation so it appeals to a larger audience, for improved targeted user advertising, etc.

What is meant by attention?

User attention, engagement, response and reaction may used interchangeably throughout this disclosure. In some embodiments of the proposed techniques and systems, user response may refer to any form of attention to content, such as an audible reaction, a body pose, a physical gesture, a heartrate, wearing a content-related article of clothing, etc. Attention may take many forms such as binary (e.g., a user said “Yes”), on a spectrum (e.g., excitement, loudness, leaning forward), or open-ended (e.g., a topic of a discussion, a multidimensional embedding). Attention may infer something in relation to a content presentation or to an object in the content presentation. On the other hand, attention to non-content related information may correspond to a low level of engagement with a content presentation.

According to some examples, attention to be detected may be in a short list (e.g., “Wow,” “Ahh,” “Red,” “Blue,” slouching, leaning forward, left hand up, right hand up) as prescribed by any combination of the user, content, content provider, user device, etc. One will appreciate that such a short list is not required. A short list of possible reactions, if supplied by the content presentation, may arrive through a metadata stream of the content presentation.

Figure 1 A is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure. As with other figures provided herein, the types, numbers and arrangements of elements shown in Figure 1 A are merely provided by way of example. Other implementations may include more, fewer and/or different types, numbers and arrangements of elements. According to some examples, the apparatus 150 may be configured for performing at least some of the methods disclosed herein. In some implementations, the apparatus 150 may be, or may include, one or more components of a workstation, one or more components of a home entertainment system, etc. For example, the apparatus 150 may be a laptop computer, a tablet device, a mobile device (such as a cellular telephone), an augmented reality (AR) wearable, a virtual reality (VR) wearable, an automotive subsystem (e.g., infotainment system, driver assistance or safety system, etc.), a game system or console, a smart home hub, a television or another type of device.

According to some alternative implementations the apparatus 150 may be, or may include, a server. In some such examples, the apparatus 150 may be, or may include, an encoder. In some examples, the apparatus 150 may be, or may include, a decoder. Accordingly, in some instances the apparatus 150 may be a device that is configured for use within an environment, such as a home environment, whereas in other instances the apparatus 150 may be a device that is configured for use in “the cloud,” e.g., a server.

According to some examples, the apparatus 150 may be, or may include, an orchestrating device that is configured to provide control signals to one or more other devices. In some examples, the control signals may be provided by the orchestrating device in order to coordinate aspects of displayed video content, of audio playback, or combinations thereof. In some examples, the apparatus 150 may be configured to alter one or more aspects of media content that is currently being provided by one or more devices in an environment in response to estimated user engagement, estimated user arousal or estimated user cognitive load. Some examples are disclosed herein.

In this example, the apparatus 150 includes an interface system 155 and a control system 160. The interface system 155 may, in some implementations, be configured for communication with one or more other devices of an environment. The environment may, in some examples, be a home environment. In other examples, the environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, an entertainment environment (e.g., a theatre, a performance venue, a theme park, a VR experience room, an e- games arena), etc. The interface system 155 may, in some implementations, be configured for exchanging control information and associated data with other devices of the environment. The control information and associated data may, in some examples, pertain to one or more software applications that the apparatus 150 is executing.

The interface system 155 may, in some implementations, be configured for receiving, or for providing, a content stream. In some examples, the content stream may include video data and audio data corresponding to the video data. The audio data may include, but may not be limited to, audio signals. In some instances, the audio data may include spatial data, such as channel data and/or spatial metadata. Metadata may, for example, have been provided by what may be referred to herein as an “encoder.”

The interface system 155 may include one or more network interfaces and/or one or more external device interfaces (such as one or more universal serial bus (USB) interfaces). According to some implementations, the interface system 155 may include one or more wireless interfaces. The interface system 155 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system, a gesture sensor system, or combinations thereof. Accordingly, while some such devices are represented separately in Figure 1 A, such devices may, in some examples, correspond with aspects of the interface system 155.

In some examples, the interface system 155 may include one or more interfaces between the control system 160 and a memory system, such as the optional memory system 165 shown in Figure 1A. Alternatively, or additionally, the control system 160 may include a memory system in some instances. The interface system 155 may, in some implementations, be configured for receiving input from one or more microphones in an environment.

The control system 160 may, for example, include a general purpose single- or multichip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or combinations thereof.

In some implementations, the control system 160 may reside in more than one device. For example, in some implementations a portion of the control system 160 may reside in a device within one of the environments referred to herein and another portion of the control system 160 may reside in a device that is outside the environment, such as a server, a game console, a mobile device (such as a smartphone or a tablet computer), etc. In other examples, a portion of the control system 160 may reside in a device within one of the environments depicted herein and another portion of the control system 160 may reside in one or more other devices of the environment. For example, control system functionality may be shared by an orchestrating device (such as what may be referred to herein as a smart home hub) and one or more other devices of the environment. In other examples, a portion of the control system 160 may reside in a device that is implementing a cloud-based service, such as a server, and another portion of the control system 160 may reside in another device that is implementing the cloud-based service, such as another server, a memory device, etc. The interface system 155 also may, in some examples, reside in more than one device.

In some implementations, the control system 160 may be configured to perform, at least in part, the methods disclosed herein. According to some examples, the control system 160 may implement one or more device analytics engines configured to estimate user response events based on sensor data received from the sensor system. In some examples, the control system 160 may implement a user attention analytics engine configured to produce user attention analytics based, at least in part, on estimated user response events received from the one or more device analytics engines. The user attention analytics may correspond with estimated user attention to content intervals of a content presentation. In some examples, the control system 160 may be configured to cause the content presentation to be altered based, at least in part, on the user attention analytics. According to some examples, the control system 160 may be configured to cause an altered content presentation to be provided by one or more displays, by a loudspeaker system, or by one or more displays and the loudspeaker system. Some examples of these components and processes are described below.

Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. The one or more non-transitory media may, for example, reside in the optional memory system 165 shown in Figure 1A and/or in the control system 160. Accordingly, various innovative aspects of the subject matter described in this disclosure can be implemented in one or more non-transitory media having software stored thereon. The software may, for example, include instructions for controlling at least one device to perform some or all of the methods disclosed herein. The software may, for example, be executable by one or more components of a control system such as the control system 160 of Figure 1 A.

In some examples, the apparatus 150 may include the optional microphone system 170 shown in Figure 1 A. The optional microphone system 170 may include one or more microphones. According to some examples, the optional microphone system 170 may include an array of microphones. In some examples, the array of microphones may be configured to determine direction of arrival (DOA) and/or time of arrival (TOA) information, e.g., according to instructions from the control system 160. The array of microphones may, in some instances, be configured for receive-side beamforming, e.g., according to instructions from the control system 160. In some implementations, one or more of the microphones may be part of, or associated with, another device, such as a speaker of the speaker system, a smart audio device, etc. In some examples, the apparatus 150 may not include a microphone system 170. However, in some such implementations the apparatus 150 may nonetheless he configured to receive microphone data for one or more microphones in an environment via the interface system 160. In some such implementations, a cloud-based implementation of the apparatus 150 may be configured to receive microphone data, or data corresponding to the microphone data, from one or more microphones in an environment via the interface system 160.

According to some implementations, the apparatus 150 may include the optional loudspeaker system 175 shown in Figure 1A. The optional loudspeaker system 175 may include one or more loudspeakers, which also may be referred to herein as “speakers” or, more generally, as “audio reproduction transducers.” In some examples (e.g., cloud-based implementations), the apparatus 150 may not include a loudspeaker system 175.

In some implementations, the apparatus 150 may include the optional sensor system 180 shown in Figure 1A. The optional sensor system 180 may include one or more touch sensors, gesture sensors, motion detectors, cameras, eye tracking devices, or combinations thereof. In some implementations, the one or more cameras may include one or more freestanding cameras. In some examples, one or more cameras, eye trackers, etc., of the optional sensor system 180 may reside in a television, a mobile phone, a smart speaker, a laptop, a game console or system, or combinations thereof. In some examples, the apparatus 150 may not include a sensor system 180. However, in some such implementations the apparatus 150 may nonetheless be configured to receive sensor data for one or more sensors (such as cameras, eye trackers, camera-equipped monitors, etc.) residing in or on other devices in an environment via the interface system 160. Although the microphone system 170 and the sensor system 180 are shown as separate components in Figure 1A, the microphone system 170 may be referred to, and may be considered as part of, the sensor system 180.

In some implementations, the apparatus 150 may include the optional display system 185 shown in Figure 1A. The optional display system 185 may include one or more displays, such as one or more light-emitting diode (LED) displays. In some instances, the optional display system 185 may include one or more organic light-emitting diode (OLED) displays. In some examples, the optional display system 185 may include one or more displays of a television, a laptop, a mobile device, a smart audio device, an automotive subsystem (e.g., infotainment system, driver assistance or safety system, etc.), or another type of device. In some examples wherein the apparatus 150 includes the display system 185, the sensor system 180 may include a touch sensor system and/or a gesture sensor system proximate one or more displays of the display system 185. According to some such implementations, the control system 160 may be configured for controlling the display system 185 to present one or more graphical user interfaces (GUIs).

According to some such examples the apparatus 150 may be, or may include, a smart audio device, such as a smart speaker. In some such implementations the apparatus 150 may be, or may include, a wakeword detector. For example, the apparatus 150 may be configured to implement (at least in part) a virtual assistant.

Figure IB shows an environment that includes examples of components capable of implementing various aspects of this disclosure. As with other figures provided herein, the types, numbers and arrangements of elements shown in Figure IB are merely provided by way of example. Other implementations may include more, fewer and/or different types, numbers and arrangements of elements.

In this example, the environment 100 A includes a head unit 101, which is a television (TV) in this example. In some implementations, the head unit 101 may be, or may include, a digital media adapter (DMA) such as an Apple TV™ DMA, an Amazon Fire™ DMA or a Roku™ DMA. According to this example, a content presentation being provided via the head unit 101 and a loudspeaker system that includes loudspeakers of the TV and the satellite speakers 102a and 102b. In this example, the attention levels of one or more of persons 105a, 105b, 105c, 105d and 105e are being detected by camera 106 on the TV, by microphones of the satellite speakers 102a and 102b, by microphones of the smart couch 104 by microphones of the smart table 103.

In this example, the sensors of the environment 100 A are primarily used for detecting auditory feedback and visual feedback that may be detected by the camera 106. However, in alternative implementations the sensors of the environment 100A may include additional types of sensors, such as one or more additional cameras, an eye tracker configured to collect gaze and pupil size information, one or more ambient light sensors, one or more heat sensors, one or more sensors configured to measure galvanic skin response, etc. According to some implementations, one or more cameras in the environment 100A — which may include the camera 106 — may be configured for eye tracker functionality.

The elements of Figure IB include:

101: A head unit 101, which is a TV in this example, providing an audiovisual content presentation and detecting user attention with microphones;

102a, 102b: Plurality of satellite speakers playing back content and detecting attention with microphone arrays;

103: A smart table detecting attention with a microphone array;

104: A smart couch detecting attention with a microphone array;

105a, 105b, 105c, 105d, 105e: Plurality of users attending to content in the environment 100; and

106: A camera mounted on the head unit 101.

Figure 1C shows another environment that includes examples of components capable of implementing various aspects of this disclosure. As with other figures provided herein, the types, numbers and arrangements of elements shown in Figure 1C are merely provided by way of example. Other implementations may include more, fewer and/or different types, numbers and arrangements of elements.

In this example, Figure 1C illustrates a scenario in which an Attention Tracking System (ATS) is implemented in an automotive setting. Accordingly, the environment 100B is an automotive environment in this example. The back seat passengers 105h and 105i are attending to content on their respective displays 101c and lOld. The front seat users 105f and 105g are attending to the drive, attending to a display 101b and attending to audio content playing out of the loudspeakers. The audio content may include music, a podcast, navigational directions, etc. An ATS is leveraging all the sensors in the car to determine each user’s degree of attention to content. The elements of Figure 1C include:

101b: The main display screen in the car with satnav, reverse camera, music controls, etc.;

101c, 1 Old: Passenger screens designed to play entertainment content such as movies; 105f, 105g, 105h, 105i: Plurality of users attending to content in the vehicle;

106b: Camera facing the outside of the vehicle detecting content such as billboards; 106c: Camera facing the interior of the vehicle detecting user attention;

301d, 301e, 30 If, 301g: Plurality of microphones picking up noises in the vehicle, including content playback, audio indications of user attention, etc.; and

304d, 304e, 304f, 304g: Plurality of speakers playing back content.

Figure 2 shows example components of an Attention Tracking System (ATS). As with other figures provided herein, the types, numbers and arrangements of elements shown in Figure 2 are merely provided by way of example. Other implementations may include more, fewer and/or different types, numbers and arrangements of elements.

According to this example, a content presentation 202 — also referred to as a content playback 202 — is being provided via devices in the environment 100C and users’ reactions are then detected by one or more sensors of the sensor system 180. In this example, the Attention Analytics Engine (AAE) 201 is configured to estimate user attention levels based on sensor data from the one or more sensors. In this example, analytics about how one or more people 105 attend to the content presentation can be aggregated over time: here, the content analytics and user affinity module 203 is configured to estimate individual user affinity and the affinity of groups to different aspects of the content presentation. The groups may correspond to particular demographics or populations. According to this example, the content presentation module 205 is configured to adjust the content presentation in real time according to attention analytics 206 received from the AAE 201. The adjustments made by the content presentation module 205 may include things such as content selection, virtual laughs and cheering, etc. Various additional examples of adjustments that may be made by the content presentation module 205 are disclosed herein. Real-time adjustments to the content made by the content presentation module 205 and corresponding user feedback detected by the sensor system 180 can create a feedback loop, honing into a better experience for the one or more people 105.

The elements of Figure 2 include: 105: One or more people attending to the content playback 202 and having their reactions detected by sensors

180: A sensor system including one or more sensors configured to detect real-time user responses to the content presentation 202;

201: The Attention Analytics Engine (AAE), which analyses user attention information in reference to the current content, by taking data from the sensor system 180 (or, in some implementations, results from one or more Device Analytics Engines (DAEs), which are produced using measurements from sensors. Examples of DAEs are provided below;

202: The playback of the content supplied by the content presentation module 205, through any combination of loudspeakers, displays, lights, etc.;

203: A content analytics and user affinity module, which is configured for the aggregation of attention analytics to provide insight regarding user attention to the content playback 202; and

205: A content presentation module, which is configured to supply the content to be played and is configured to make real-time adjustments to the content presentation based on attention analytics 206 supplied by the AAE 201.

In the example shown in Figure 2, the Attention Tracking System (ATS) 200 includes the AAE 201, the content analytics and user affinity module 203, the sensor system 180 and the content presentation module 205. In this example, the AAE 201, the content analytics and user affinity module 203 and the content presentation module 205 are implemented by an instance of the control system 160 of Figure 1A. In some examples, the AAE 201, the content analytics and user affinity module 203 and the content presentation module 205 may be implemented as instructions, such as software, stored on one or more non-transitory and computer-readable media.

According to some examples, the ATS includes an Attention Analytics Engine 201, sensors, and one or more Device Analytics Engines (not shown in Figure 2), each corresponding to each sensor or group of sensors, all operating in real time. In some such examples, the ATS is configured for: collecting information from available sensors; passing the sensor information through one or more Device Analytics Engines; and determining user attention by passing Device Analytics Engine results through an Attention Analytics Engine (201). The output of the ATS may vary according to the particular implementation. In some examples, the output of the ATS may be any type of attention analytics-related information, a content presentation corresponding to the attention analytics related information, etc. According to some examples, an ATS may be implemented via an Attention Analytics Engine 201 and a content presentation module 205 residing in a device within a playback environment (for example, implemented by one of the playback devices of the playback environment), via an Attention Analytics Engine 201 and a content presentation module 205 residing within “the cloud” (for example, implemented by one or more servers that are not within the playback environment), etc.

Figure 3A shows components of an ATS residing in a playback environment according to one example. As with other figures provided herein, the types, numbers and arrangements of elements shown in Figure 3A are merely provided by way of example. Other implementations may include more, fewer and/or different types, numbers and arrangements of elements.

In the example of Figure 3 A, the ATS 200 includes an Attention Analytics Engine (AAE) 201 and a content presentation module 205 residing in a head unit 101, which is a television (TV) in this example. A plurality of sensors, which include microphones 301a of the head unit 101, microphones 301b of the satellite speaker 102a and microphones 301c of the satellite speaker 102b, provide sensor data to the AAE 201 corresponding to what is occurring in the environment 100D. Other implementations may include additional types of sensors, such as one or more cameras, one or more eye trackers configured to collect gaze and pupil size information, one or more ambient light sensors, one or more heat sensors, one or more sensors configured to measure galvanic skin response, etc. As suggested by the dots and element number 102c, some implementations may include three or more satellite speakers.

The microphones 301a-301c will detect audio of a content presentation. In this example, echo management modules 302a, 302b and 302c are configured to suppress audio of the content presentation, allowing more reliable detection of sounds corresponding to the users’ reactions over the content in the signals from the microphones 301a-301c. In this example, the content presentation module 205 is configured to send echo reference information 306 to the echo management modules 302a, 302b and 302c. The echo reference information 306 may, for example, contain information about the audio being played back by the loudspeakers 304a, 304b and 304c. As a simple example, local echo paths 307a, 307b and 307c may be cancelled using a local echo reference with a local echo canceller. However, any type of echo management system could be used here, such as a distributed acoustic echo canceller.

According to the examples shown in Figure 3 A, the head unit 101 includes Device Analytics Engine (DAE) 303a, the satellite speaker 102a includes DAE 303b and the satellite speaker 102b includes DAE 303c. Here, the DAEs 303a, 303b and 303c are configured to detect user activity from sensor signals, which are microphone signals in these examples. There may be different implementations of the DAE 303, in some instances even within the same Attention Tracking System. The particular implementation of the DAE 303 may, for example, depend on the sensor type or mode. For example, some implementations of the DAE 303 may be configured to detect user activity from microphone signals, whereas other implementations of the DAE 303 may be configured to detect user activity, attention, etc., based on camera signals. Some DAEs may be multimodal, receiving and interpreting inputs from different sensor types. In some examples, DAEs may share sensor inputs with other DAEs. Outputs of DAEs 303 also may vary according to the particular implementation. DAE output may, for example, include detected phonemes, emotion type estimations, heart rate, body pose, a latent space representation of sensor signals, etc.

In the implementation shown in Figure 3A, the outputs 309a, 309b and 309c from the DAEs 303a, 303b and 303c, respectively, are fed into the AAE 201. Here, the AAE 201 is configured to combine the information of the DAEs 303a-303c to produce attention analytics. The AAE 201 may be configured to use various types of data distillation techniques, such as neural networks, algorithms, etc. For example, an AAE 201 may be configured to use natural language processing (NLP) using speech recognition output from one or more DAEs. In this example, the analytics produced by AAE 201 allow for real-time adjustments of content presentations by the content presentation module 205. The content presentation 310 is then provided to, and played out of, actuators that include the loudspeakers 304a, 304b and 304c, and the TV display screen 305 in this example. Other examples of actuators that could be used include lights and haptic feedback devices.

The elements of Figure 3 A include:

301a, 301b, 301c: Microphones picking up sounds in the environment 100D, including content playback, audio corresponding to user responses, etc.;

302a, 302b, 302c: Echo management modules configured to reduce the level of the content playback that is picked up in the microphones; 303a, 303b, 303c: Device analytics engines, which may be configured to convert sensor readings into the probability of certain attention events, such as laughter, gasping or cheering;

304a, 304b, 304c: The loudspeakers in each device playing back the audio from the content presentation module 205;

305: A TV display configured for displaying content from the content presentation module 205;

306a, 306b, 306c: Echo reference information; and

307a, 307b, 307c: The echo paths between a duplex device’s plurality of speakers to its plurality of microphones.

In these examples, the AAE 201, the content presentation module 205, the echo management module 302a and the DAE 303a are implemented by an instance 160a of the control system 160 of Figure 1 A, the echo management module 302b and the DAE 303b are implemented by another instance 160b of the control system 160 and the echo management module 302c and the DAE 303c are implemented by a third instance 160c of the control system 160. In some examples, the AAE 201, the content presentation module 205, the echo management modules 302a-302c and the DAEs 3O3a-3O3c may be implemented as instructions, such as software, stored on one or more non-transitory and computer-readable media.

Figure 3B shows components of an ATS residing in the cloud according to one example. As with other figures provided herein, the types, numbers and arrangements of elements shown in Figure 3B are merely provided by way of example. Other implementations may include more, fewer and/or different types, numbers and arrangements of elements.

In this example, the ATS 200 includes an AAE 201 and a content presentation module 205 that are implemented by a control system instance 160d of the control system 160 of Figure 1A. The control system instance 160d resides in the cloud 308. The cloud 308 may, for example, include one or more servers that reside outside of the environment 100E, but which are configured for communication with the head unit 101 and at least the satellite speakers 102a and 102b via a network. The echo reference information 306 and the local echo paths 307 may be as shown in Figure 3A, but have been removed from Figure 3B in order to lower the diagram’ s visual complexity. One of ordinary skill in the art will appreciate that any type of echo management system may be used here. The two key differences between this implementation and the implementation shown in Figure 3 A are:

The DAE results 309a, 309b and 309c are sent from the DAEs 303a, 303b and 303c, respectively, to the AAE 201 in the cloud 308 via a network; and

The content presentation module 205 is also implemented in the cloud 308 and is configured to provide the content 310 to the devices in the environment 100E via the network.

Acoustic Event Detection

Figure 4 shows components of a neural network capable of performing real-time acoustic event detection according to one example. As with other figures provided herein, the types, numbers and arrangements of elements shown in Figure 4 are merely provided by way of example. Other implementations may include more, fewer and/or different types, numbers and arrangements of elements.

Figure 4 shows example components of a neural network 400 that is capable of implementing a Device Analytics Engine 303 for a microphone 301. In this example, the neural network 400 is implemented by control system instance 160e. According to this example, microphone signals 410 from the microphone 301 are passed into the banding block 401 to create a time-frequency representation of the audio in the microphone signals 410. The resulting frequency bands 412 are passed into two-dimensional convolutional layers 402a and 402b, which are configured for feature extraction and down-sampling, respectively. In this example, a positional encoding 403 is stacked onto the features 411 output by the convolutional layers 402a and 402b, so the real-time streaming transformers 404 can consider temporal information. The embeddings 414 produced by the transformers are projected — using a fully connected layer 405 — into the number of desired unit scores 406. The unit scores may represent anything related to acoustic events, such as subword units, phonemes, laughter, cheering, gasping, etc. According to this example, a SoftMax module 407 is configured to normalize the unit scores 406 into unit probabilities 408 representing the posterior probabilities of acoustic events.

Other examples of attention-related acoustic events that could be detected for use in the proposed techniques and systems include: Sounds that could possibly indicate being engaged: laughing, screams, cheering, booing, crying, sniffling, groans, vocalizations of ooo, ahh and shh, talking about the content, cursing, etc.;

Sounds that could possibly indicate being unengaged: typing, door creaking, snoring, footsteps, vacuuming, washing dishes, chopping food, talking about something other than the content, different content playing on a device not connected to the attention system;

Sounds that could be used to indicate attention for specific content types, such as: o Movies: saying an actor’s name; o Sports: naming a player or team; o When music is in the content: whistling, applause, foot tapping, finger snapping, singing along with the content, a person making a repetitive noise corresponding with a rhythm of the content; o In children’s shows: children making emotive vocalizations, or responding to a “call and response” prompt; o Workout-related content: Grunting, heavy breathing, groaning, gasping;

Other noises that may help infer attention to the content based on the context of the content: such as silence during a dramatic time interval, naming an object, character or concept in a scene, etc.

The elements of Figure 4 include:

400: Example neural network architecture of a real-time audio event detector;

303d: A Device Analytics Engine configured to detect attention -related events in microphone input data;

401: A banding block configured to process time-domain input into banded timefrequency domain information;

402a, 402b: Two-dimensional convolution layers;

403: Positional encoding that stacks positional information onto the features 411 output by the convolutional layers 402a and 402b;

404: A plurality (six in this example) of real-time streaming transformer layers;

405: A fully connected linear layer to act as a projection to unit scores 414 output by the real-time streaming transformer layers 404; 406: Units scores representing different audio event classes. Unit scores may represent audio events such as laughter, gasping, cheering, etc.

407: A SoftMax module 407 configured to normalize the unit scores 406 into unit probabilities 408 representing the likelihood of acoustic events; and

408: The resulting unit probabilities.

Visual Detections

Figure 5 shows components of a device analytics engine configured for performing real-time pose estimation according to one example. As with other figures provided herein, the types, numbers and arrangements of elements shown in Figure 5 are merely provided by way of example. Other implementations may include more, fewer and/or different types, numbers and arrangements of elements.

Figure 5 shows example components of a Device Analytics Engine (DAE) 303e configured to estimate user attention from visual information. In this example, the DAE 303e is implemented by control system instance 160f. The DAE 303e may be configured to estimate user attention from visual information via a range of techniques, depending on the particular implementation. Some examples involve applying machine learning methods, using algorithms implemented via software, etc.

In the example shown in Figure 5, the DAE 303e includes a skeleton estimation module 501 and a pose classifier 502. The skeleton estimation module 501 is configured to calculate the positions of a person’s primary bones from the camera data 510, which includes a video feed in this example, and to output skeletal information 512. The skeleton estimation module 501 may be implemented with publicly available toolkits such as YOLO-Pose. The pose classifier 502 may be configured to implement any suitable process for mapping skeletal information to pose probabilities, such as a gaussian mixture model or a neural network. According to this example, the DAE 303e — in this example, the pose classifier 502 — is configured to the output the pose probabilities 503. In some examples, the DAE 303e also may be configured to estimate the distances of one or more parts of the user’s body according to the camera data.

Visual detections can reveal a range of attention information. Some examples include: Visuals that may indicate a person being positively engaged: leaning forward, lying back, moving in response to events in the content, wearing clothes signifying allegiance to something in the content, etc.;

Visuals that may indicate a person being negatively engaged: a facial expression indicating disgust, a person’s hand “flipping the bird,” etc.;

Visuals that may indicate a person being unengaged: a person is looking at a phone when the phone is not being used to provide the content presentation, the person is holding a phone to their head when the phone is not being used to provide the content presentation, the person is asleep, no person in the room is paying attention, no one is present, etc.

The elements of Figure 5 include:

303e: A Device Analytics Engine configured to perform real-time pose estimation based on the camera data 510;

501 : A skeleton estimation module configured to calculate the positions and rotations of a person’s primary bones from the camera data 510 and to output skeletal information 512;

502: A pose classifier configured to map the skeletal information 512 to pose probabilities 503; and

503: The resulting pose probabilities.

System and Methods for Measuring and Indicating User Attention During Video Interactions

This section describes numerous concepts directed at distilling sensor data captured during video consumption (e.g., viewing of videos such as movies, social media posts, etc., and the participation in video conferencing) into attention metrics which are then presented as feedback via a graphical interface. The section includes both techniques for providing live feedback during a conference and offline feedback via reports.

Attention Model

This section describes collection of data via sensors typically included with display systems (accelerometers, IMUs, microphones, etc.) or devices on or around users (e.g., various sensors on smart devices such as a smart watch). All inputs from sensors, user interface interactions, and camera image processing are analyzed using an attention analysis model. Figure 6 shows components of a ATS according to another example. As with other figures provided herein, the types, numbers and arrangements of elements shown in Figure 6 are merely provided by way of example. Other implementations may include more, fewer and/or different types, numbers and arrangements of elements.

According to this example, the ATS 200 has as inputs the independent measurements of various types of sensors in the sensor system 180, and outputs one or more distilled attention metrics. In the example shown in Figure 6, the ATS 200 includes Device Analytics Engines (DAEs) 303f, 303g, 303h, 3O3i and 303j, as well as AAE 201, all of which are implemented by control system instance 160g. According to this example, each of the DAEs 303f-303j receives and processes sensor data from a different type of sensor of the sensor system 180: the DAE 3O3f receives and processes camera data from one or more cameras, the DAE 303g receives and processes motion sensor data 602 from one or more motion sensors, the DAE 303h receives and processes wearable sensor data 604 from one or more wearable sensors, the DAE 3O3i receives and processes microphone data 410 from one or more microphones and the DAE 303j receives and processes user input 606 from one or more user interfaces. In this example, the DAE 3O3f produces DAE output 309f, the DAE 303g produces DAE output 309g, the DAE 303h produces DAE output 309h, the DAE 303i produces DAE output 309i and the DAE 303j produces DAE output 309j.

In this example, the AAE 201 is configured to implement one or more types of data distillation methods, data combination methods, or both, on the DAE output 309f-309j. In some examples, data distillation and/or combination may be performed by implementing a fuzzy inference system. Other data distillation and/or combination methods may involve implementing a support vector machine or a neural network. According to some examples, the AAE 201 may be configured to implement an attention analysis model that is tuned or trained against subjectively measured ground truth. In some examples, the AAE 201 may be configured to implement an attention analysis model that is tuned or trained with research training data gathered in a research lab setting. Such data might not be available in a consumer setting. The research training data may, for example, include sensor data from a wider array of sensors than would be typical in a consumer setting, such as for example the full complement of the sensors disclosed herein. In some examples, the research training data may include measurements typically not available to consumers, such as for example EEG recordings. According to this example, the AAE 201 is configured to output an estimated attention score 610 and individual estimated attention metrics 612. In this example, the estimated attention score 610 is a combined attention score that includes responses of all people who are currently consuming a content presentation and the individual estimated attention metrics 612 are individual attention scores for each person who is currently consuming the content presentation.

Figures 7, 8, 9 and 10 show examples of providing feedback from an ATS during video conferences. Figure 11 shows an example of presenting feedback from an ATS after a video conference. In some instances, the ATS may be as shown in Figure 6, or a similar ATS. According to some examples, the feedback shown in Figures 7-11 may be provided based on an ATS having fewer sensor types and fewer DAEs providing input to the AAE 201. As with other figures provided herein, the types, numbers and arrangements of elements shown in Figures 7-11 are merely provided by way of example. Other implementations may include more, fewer and/or different types, numbers and arrangements of elements.

Figure 7 shows an example of providing feedback from an ATS during a one-to-many video conference presentation. Figure 7 shows examples of various windows that may be presented on a presenter’s display during the video conference presentation. Window 705 shows a video of the presenter herself, window 710 shows content that is being presented and windows 715a, 715b and 715c show videos of the non-presenting participants. In this example, window 720 shows a graph indicating three types of ATS feedback: curve 725 indicates estimated excitement, curve 730 indicates estimated attention and curve 735 indicates estimated cognitive load. According to this example, the three types of ATS feedback are aggregated and are based on data from all of the non-presenting participants.

Figure 8 shows an example of providing feedback from an ATS during a many-to- many video conference discussion. Figure 8 shows examples of windows that may be presented on each participant’s display during the video conference discussion. Windows 805 show a video of each of the participants. In this example, each of the windows 805 includes an attention score 810, which indicates the estimated attention for that individual participant. In some examples, an ATS may estimate more than one type of feedback. In some such examples, the attention score 810 may indicate aggregated ATS feedback. In other examples, multiple attention scores 810 may be presented in each of the windows 805. In some alternative examples, attention scores 810 may not be presented on each participant’s display during the video conference discussion. In some such examples, attention scores 810 may be presented only on a subset of participants’ displays (for example, on one participant’s display) during the video conference discussion. The subset of participants may, for example, include a supervisor, a moderator, etc.

Figure 9 shows another example of providing feedback from an ATS during a many- to-many video conference discussion. Figure 9 shows examples of windows that may be presented on each participant’s display, or on a subset of participants’ displays, during the video conference discussion. Windows 905 show a video of each of the participants. In this example, window border 910a is shown with the thickest outline, indicating that the current speaker is being shown in the corresponding window 905a. According to this example, window borders 910b and 910c are shown with an outline that is less thick than that of the window border 910a, but thicker than that of the other windows 905, indicating that the participants shown in windows 905b and 905c are currently looking at the current speaker.

Figure 10 shows another example of providing feedback from an ATS during a many- to-many video conference discussion. Figure 10 shows examples of windows that may be presented on each participant’s display, or on a subset of participants’ displays, during the video conference discussion. Windows 1005 show a video of each of the participants. In this example, window border 1010a is shown with the thickest outline, indicating that the current speaker is being shown in the corresponding window 1005a. According to this example, window borders 1010b and 1010c are shown with an outline that is less thick than that of the window border 1010a, but thicker than that of the other windows 1005, indicating that the ATS is estimating a high level of attention for the participants shown in windows 1005b and 1005c.

Figure 11 shows an example of providing feedback from an ATS after a many-to- many video conference discussion. The table shown in Figure 11 ATS feedback for each of six video conference participants. In this example, row 1105 indicates speaking percentages for each participant and column 1110 indicates individual attention scores for each participant. The interior cells indicate the attention of individual participants to each other and to other stimuli. For example, cell 1115a indicates the attention of User 1 to User 2 and cell 1115b indicates the attention of User 6 to User 5. The cells in column 1120 indicate the attention of individual participants to stimuli or events other than the other video conference participants. For example, cell 1115c indicates the attention of User 6 to stimuli or events other than the other video conference participants.

Attention Feedback Loop Any decision made based on the information provided by the ATS may have its effectiveness evaluated using the ATS, to detect the users’ responses. In some examples, this may form a closed loop in which the decisions made using ATS information can be improved over time. According to some examples, this closed loop of user attention and responses to user attention can go a step further, and after a decision is known to have a particular effect on a user, it can be passed onto another user to experience.

Specific examples:

One or more users are detected laughing to a comedy. This laughter is detected with the users’ ATS-enabled device(s). Using this information, additional laughter is added to the comedy (for example, by a content presentation module 205 according to instructions from an AAE 201). The ATS detects that this increased the hilarity for the users, and so the system (for example, the content presentation module 205 according to instructions from the AAE 201) continues to add laughter when the ATS detects users laughing to comedy. Over time, the system starts to optimize on which types of added laughs each of the users finds most amusing.

One or more users are detected laughing to a comedic ad after their TV show has finished. This laughter is detected with the users’ ATS-enabled device(s). Using this information, it is hypothesized (for example, by an AAE 201) that the users engage more with comedic advertising after their show has finished. It is detected that this continues to be true for the users, and so the system (for example, the content presentation module 205 according to instructions from the AAE 201) continues to show comedic ads after TV shows end. Over time, the system starts to optimize on which types of comedic advertising the users find most amusing per individual. In some examples, an individual user’ s affinity for — or the affinity of a group for — comedic advertisements after their TV show is finished may be determined over many viewing sessions. In some such examples, such affinities may be determined by the Content Analytics and User Affinity block 203, whereas in other examples such affinities may be determined by the AAE 201.

It is detected that most audiences find an unintentionally funny moment in the content more humorous than the rest of the comedy. This amusement is detected with the users’ ATS-enabled device(s). Using this information, additional humorous moments like this are added to the same comedy. It is detected that this increased the hilarity for users in general. Over time, the system starts to optimize how many of these types of gags should be placed in the piece of content.

One or more users are detected laughing at a comedy at home. This laughter is detected with the users’ ATS-enabled device(s). Using this information, additional laughter is added to the comedy. It is detected that this increased the hilarity for the users, and so the system also starts to add laughter when it hears users laugh to a comedic podcast in the car. Over time the system starts to optimize on in which types of locations, and with which media types, the users find the laughs most amusing per individual.

The option to evaluate the effectiveness of decisions made and optimize decisions based on attention responses applies to all use cases listed in this disclosure.

Personalization and Augmentation using Attention Feedback Use Cases

Modern day personalization systems for content enjoyment rely on information the user explicitly provides by using control systems for the content, such as play and pause. For instance, current content recommendation systems base their decisions off what content the user selects to consume, what sections did they replay and if they left a like on a piece of content. This provides little insight into how a user is attending to the content or if the user is even present during playback. Additionally, slow personalizations may come in the form of fan feedback through explicitly leaving a comment or review about the content. This allows the content producer to cater the next piece of content for their audience, but the feedback loop is slow and only representative of the vocal users who leaves comments.

The possibilities of personalization systems expand greatly with the use of richer user attention metrics using sensors in real-time, such as microphones, cameras and accelerometers. Having such an attention tracking system could make it possible to determine what about the content the user is interested in. For example, whilst a user is watching a secret agent movie, the user might show particular interest in the suit, watch, car, location or action. Insight into details like this will allow for improved personalization systems.

Adaptions to the content enjoyment using richer attention information can be made in real time and/or over the long-term by aggregating attention metrics. These detailed attention metrics could also lead to better-informed recommendations and could provide content producers with real-time feedback about how users are attending to the content. When any personalization decisions are made using the attention information, the attention tracking system may be utilized again to determine the effect on the user’s or users’ attention. The ongoing use of an attention tracking system can form a closed loop in which many types of personalization can be optimized, or at least improved. In this section we will detail some possibilities and use cases of personalization systems using richer attention metrics using sensors in real-time.

The phrase “content enjoyment experience” as used herein may refer to anything that could affect a user’s experience whilst enjoying content, such as altering playback (e.g., volume, tv-backlighting level), content (e.g., the selected content, changing storyline, adding element), control systems (e.g., pausing, replaying).

The phrase “personalization” as used herein may refer to any content enjoyment experience that has been catered for a user or users. “Experience augmentation” refers to a real-time change — including but not limited to a real-time change — to the content enjoyment experience. “Recommendations” refer to any content suggested for a user. Personalizations, experience augmentations and recommendations may all be made based on attention information. These terms will be described in greater detail in their respective sections.

“Linear content” refers to any form of content that is intended to be played from start to finish without diverging paths or controls to the flow of the content (e.g., an episode of a Netflix™ series, an audio book, a song, podcast, TV show).

Detectable Attentions Lists and Metadata

The potential short list of detectable attention as described briefly in the introduction is detailed further in this section. The range of detectable attention may be specified in a list. Examples of the types of attention that may appear in this list could be in one of the following forms:

A specific response where the user does the exact thing. For example, the user says “Yes,” “No” or something that does not match, or the user can be detected raising either the left or right hand or neither.

A type of response, where the reaction has some level of a match to the type of response. For example, a user may be requested to “start moving” and the target attention type is movement from the user. In this case, detecting the user wiggling around would be a strong match. Another example could involve the content saying, “Are you ready?” to which the content is looking for an affirmative response. There may be many valid user responses that suggest affirmation, such as “Yes,” “Absolutely,” “Let’s do this” or a head nod.

An emotional response, where the user’ s emotion or a subset of their emotions are detected. For example, a content provider wishes to know the sentiment consumers had towards their latest release. They decide to add emotion to the short-list of detectable attention. Users consuming the content start to have a conversation about the content and only the sentiment is derived as an attribution to their emotional reaction. Another example involves a user who only wants to share emotion level on the dimension of elatedness. When the user is disgusted by the content they are watching, there disgust is not detected. However, a low level of elatedness is reflected in the attention detections.

A topic of discussion, where the ATS determines what topics arose in response to the content. For example, content producers want to know what questions their movie raises for audiences. After listing the topic of discussion as an attention listed option, they find that people generally talk about how funny the movie is or about global warming.

There may be more attention types too. The attention lists may be provided from a range of different providers, such as the device manufacturer, the user, the content producer, etc. If many detectable attention lists are available any way of combining these lists may be used to determine a resulting list, such as a user list only, a union of all lists, the intersection of the user’s list and the content provider’s list, etc.

The detectable attention lists may also provide the user with a level of privacy, where the user can provide their own list of what they would like to be detectable and provide their own rules for how their list is combined with external parties. For instance, a user may be provided with staged options (for example, via a graphical user interface (GUI) and/or one or more audio prompts) as to what is detected from them, and they select to have only emotional and specific responses detected. This makes the user feel comfortable about using their ATS- enabled device(s).

The list of detectable attention indications may arrive to the user’s device in several ways. Two examples include:

The list of detectable attention indications is supplied in a metadata stream of the content to the device. There is a list of detectable attention indications pre-installed in the user’s ATS- enabled device(s) which may be applicable to a wide range of content and user attention indications. In some examples, the user may be able to select from these detectable attention indications, e.g., as described above.

The list of detectable attention indications associated with a segment of content may be learnt from users who have their ATS-enabled their device to detect a larger set of attention indications from them. In this way, content providers can discover how users are attending to their content and then add these attention indication types to the list they wish to detect for users with a more restricted set of detectable attention indications. In some examples, there may be an upstream connection alongside the content stream that allows this learnt metadata to be sent to the cloud to be aggregated. This is discussed briefly with reference to the Content Analytics and User Affinity block 203 of Figure 2.

The option to have lists of detectable attention indications is applicable to all the use cases listed in the ‘List of Use Cases’ section.

Example Use Cases

Long-term Personalization

We use “personalization” to mean a way to cater an experience for a user based on analytics of the user over the long term. Personalization adaptions may have their effectiveness tested by testing how the adaptions change the user’s experience using the ATS-enabled device(s). This forms a closed loop where the personalised adaption can continue to improve and track a user’s preferences. In some alternative examples, personalised adaption may be applied in an open loop fashion, where the effect of the adaption is not measured. The changes applied to experience may not always follow the user’s preference, as to provide natural variety and avoid creating an isolated bubble for the user such as political isolation.

Determining a user’s preferences

Suppose that one or more users consume a range of content on a playback device with an ATS. In some examples, the content-related preferences of each user may be determined over time by aggregating results from the ATS. Preferences that can be tracked may vary widely and could include content types, actors, themes, effects, topics, locations, etc. The user preferences may be determined in the cloud or on the users’ ATS-enabled device(s). In some instances, the terms “user preferences,” “interests” and “affinity” may be used interchangeably .

Short-term estimations of what a user is interested in may be established before longterm aggregations of user preferences are available. Such short-term estimations may be made through recent attention information and hypothesis testing using the attention feedback loop.

Make personalised adaptions to content based on users’ preferences

Personalised adaptions to the content may be made based on the users’ preferences. In some examples, when the personalised adaptions are informed by preferences of multiple users, the adaption may be optimized in a way to account for all the users. Some example methods of making personalised adaptions based on the preferences of multiple users include calculating one or more mean attention-related values, determining a maximum attention- related value for the group, determining a learnt combination of the preferences (such as via a neural network), etc. The attention-related values may, for example, correspond to user preferences, including but not limited to predetermined/previously known user preferences.

Personalised adaptions (e.g., an alternate scene) may be delivered in the content stream as different playback options — for example, as user-selectable playback options — in some examples. Alternatively, only the personalised version of the content may be streamed to the user. A different option is to generate the personalised adaptions, such as via a neural network. The generation of adaptions may take place on the user’s ATS-enabled device(s) or in the cloud, depending on the particular implementation.

Different forms of personalised adaption are detailed in the following sections.

Changing the storyline of linear content based on users’ preferences

One example of an extension of “Make personalised adaptions to linear content based on users’ preferences” is where the storyline of content is adapted to meet user preferences. The storyline may be adapted by replacement, addition or removal of sections in the content. The storyline of content may also be adapted to meet multiple users’ preferences. This might be useful when there are many people watching a content presentation together in the same viewing session, for example. In some such examples, the storyline may be optimized to improve the overall enjoyment for the group using a mixture of their preferences.

Specific examples:

Bob does not enjoy gory content and is prone to becoming squeamish. He watches a movie taking place on a battlefield, where this is a scene of amputating a soldier’s leg. The gory visuals of the scene are replaced for Bob with a shorter scene only showing the face of the person performing the amputation.

Alice is known to be a Stephen Curry fan as her gaze primarily followed him during a game, as detected by ATS-enabled device(s) with a camera. During the next film she watches the content is personalised with Stephen Curry making a cameo appearance, using generative techniques.

John has a short attention span and so prefers shorter content. This is detected by their ATS-enabled device(s), and in the next film he watches long drawn-out scenes are cut into short sequences.

Two parents and a child are watching a movie together. The child gets scared easily and so a jump scare is removed from the movie. The parents enjoy movies without happy endings so the normally happy ending to the movie is left unresolved.

Generative augmentation to content based on users’ preferences

Another example of an extension of “Make personalised adaptions to content based on users’ preferences” is where the content is augmented with generative elements. Generative elements refer to media produced by a machine, which may include video, audio, lighting, or combinations thereof. We use “augmentation” to mean, changing an aspect (e.g., object in a scene, instrument playing a sound, lighting in a scene, narrators voice) of the content. Some examples include superimposing a generated image to replace an object in the scene (e.g., changing an apple into an orange), changing a sound from one type to another (e.g., changing strings to horns).

Specific example:

John is watching a new film, Top Gun: Maverick, which stars Tom Cruise as Captain Pete Mitchell. John is an avid fan of Brad Pitt and would much prefer seeing Mitchell played by Pitt. Using generative machine learning, Brad Pitt’s face is superimposed over Tom Cruise’s face for the entire film. Tom Cruise’s voice is also replaced with generative techniques to make him sound like Brad Pitt. From John’ s point of view, Mitchell is now played by Pitt instead of Cruise.

Selection of personalised options of linear content based on users’ preferences

Another extension of “Make personalised adaptions to linear content based on users’ preferences” is where the adaption is the selection of personalised options for the content stream. Personalised options could include things such as selecting a preferred commentator, a stream catered specifically for one team of two team sport.

Specific examples:

John is watching the Super Bowl on his ATS-enabled device(s) that detect he is wearing a red cap using a camera. The system (for example, the AAE 201) infers he must be rooting for the Chiefs (the red team), and selects the version of the stream that is catered for Chiefs fans. In some examples, John may be presented with additional replays of tries scored by the Chiefs and fewer replays are shown for the opposing team.

In the same scenario as above, John is not wearing a red cap so the ATS does not initially know what team John is rooting for. As the game progresses, John starts engaging by cheering “Go Chiefs,” booing at the opposing team and cheering for scores made by the Chiefs. This information is detected by the ATS, and the appropriate stream is selected for John.

In the same scenario as above, John is not wearing a red cap or being vocal about the game. John’s user preferences from previous sessions are drawn upon where he sided with the Chiefs, so the Chiefs version of the stream is selected for John.

In the same scenarios as listed above, the version of the stream that is selected also may include the selection of John’s team’s commentators.

John and Alice are watching the Super Bowl together on an ATS-enabled device(s) with lights and cameras available to the system. It (for example, the AAE 201) infers that John is a Chiefs fan as he is wearing a red cap, but infers Alice is a Lions fan as she is wearing blue and cheering for the Lions’ goals. John and Alice are provided with a balanced version of the content presentation. In some examples, the lights may turn red on the side of the room where John is sitting and blue on the other side for Alice.

Personalization of accent appearing in content based on users’ preferences Another extension of “Make personalised adaptions to content based on users’ preferences” is where the accent of voices in the content is changed to meet the preferences of the users.

Specific examples:

Bob is listening to an audiobook played on an ATS device that has detected Bob speaking with a British accent. The audiobook is played with an American accent by default; however, the UK version of the audiobook is selected for Bob making the audiobook more familiar and natural to listen to for him.

Jane is watching a movie starring an Australian. She finds it the accent hard to understand and ATS has detected that Jane has a French English accent. The Australian’s voice is generatively swapped out for a French English accent.

In the same scenario as above, Jane’s preference for French English accents is determined and so her ATS-enabled smart device’s voice assistant changes to a French English accent.

Add personalised experience augmentation based on users preferences

Another extension of “Make personalised adaptions to content based on users’ preferences” is where the personalised adaptions are pre-loaded experience augmentations. Examples of experience augmentation are described in detail under the next heading. In some examples, the pre-loaded experience augmentations may be decided based on previous sessions where users with similar preferences enjoyed the same experience augmentation. Such implementations may heighten the experience for the user and also may help to cover cases where the ATS fails to detect an attention response.

Specific example:

Alice is watching a comedy with her ATS-enabled device(s). A joke is made in the comedy that Alice did not laugh at. However, users with similar preferences to Alice’s usually laugh at this joke. Virtual laughs are played out of the system for this joke, as they received a positive response from users with a similar preferences to Alice. Playback and experience adaptions based on users ’ habits and characteristics

Here, the phrase “user habits” refers to routine ways in which the user engages with content (positively or negatively) that may contain useful insights into how the playback and experience could be optimized for the user. User habits may be determined by detections such as which seat the user sits in when the user engages with content using a ATS or what time of night the user starts making annoyed sighs when there is a loud scene in content.

We use the phrase “user characteristics” to mean attributes about the user that may affect how the use experiences content, such as the user’s hearing, eyesight, attention span, etc. User characteristics may be estimated by an ATS based on how the user interacts with content, such as: missing jokes specifically when they are delivered quietly, suggesting poor hearing; and/or having trouble following which object the actor is referring to when they only describe it by colour, suggesting possible colour blindness.

User habits and characteristics may also be seen as a form of user preferences.

The phrase “playback adaptions” refers to changing the content presentation. Playback adaption may include changing the spatial rendering of content, volume adjustments, changing the colour balance on a screen, etc.

According to some examples, adaptions to the user’s experience may be intended to ensure the user’s environment is in a desirable state to enjoy content. Some such examples may include features such as automatically pausing content when there are interruptions to the playback of the content, warning the user of something they should evaluate before starting the content, etc. These adaptions to playback and experience may be managed on the user’s playback device(s).

In some examples, the adaptions may be jointly selected to improve the experience for multiple users of the ATS-enabled system simultaneously. The following sections detail specific examples of user habits, user characteristics and resulting playback and experience adaptions. Personalised accessibility changes to content playback based on users’ characteristics

Another extension of “Playback and experience adaptions based on users’ habits and characteristics” is where the playback adaptions are made to improve the accessibility of the content based on the user’s characteristics. User characteristics that might be optimized for may include short-sightedness, colour blindness, hearing loss, vertigo, epilepsy, etc. The playback adaptions that could be applied to the mentioned characteristics to improve the content’s accessibility may include increasing font size of subtitles, augmenting the colour or appearance of objects in a scene, applying compression to the stream’s audio, reducing camera shake, skipping scenes with flashing lights, etc.

Specific example:

Jake watches his favourite comedian on his ATS-enabled device(s) and laughs at almost any joke delivered by the comedian. The jokes Jake does not laugh at are most frequently the ones that are delivered quietly. In some examples, dialogue enhancement may be enabled to improve the audibility of these jokes.

Personalised automotive notifications based on users habits

Another extension of “Playback and experience adaptions based on users’ habits and characteristics” is where an automotive experience is improved based on seat occupancy. The seat occupancy can be determined by sensors indicating where people are located in the vehicle when they engage with content. In some example, the ATS system may be configured to detect when there is a different seat occupancy to usual. The improved automotive experience may come in the form of user warnings or notifications. These warnings or notifications could inform the user that they have possibly forgotten a person who usually goes on this drive. This improves the experience of the drive as the user has comfort knowing the common participants of the ride have not been forgotten.

Specific example: Every weekend a user drives his parents-in-law to a dinner. The user starts reversing out of the parents-in-law driveway without his mother-in-law in the car. The car notifies the user of her absence and a crisis is averted.

Short-term Experience Augmentation

Add elements dynamically to content based on reactions

Suppose that one or more users are consuming content on a playback device with an ATS. The ATS may be configured — for example, according to metadata in a content stream — to detect classes of responses (e.g., laughing, cheering, yelling “Yes”). The added elements may augment the experience for the user or users in real time.

According to some examples, there may be a fixed set of user sound events (e.g., cough, laugh, applause) or keywords (e.g., “Slam,” “Ooof,” “Higher,” “Lower”) specified in metadata along with details of the appropriate element with which to respond. In some such examples, there may also be a set of emotions associated with the appropriate actions (e.g., excitement, fear). According to some examples, the elements also may be delivered in the metadata stream alongside the content to enable such functionality.

In some examples, there may also be a library of elements stored locally — such as within a TV or another playback device — that may be applicable to many content streams. In such examples, the library of elements may be applied more broadly than elements that are specific to a particular content presentation. Alternatively, or additionally, generative Al technology may be applied to automatically generate elements, for example based on a text description of a content presentation.

According to some examples, the intensity (e.g., volume of a played sound, strength of audio effect, size of icon, colour saturation of icon, opacity of icon, strength of visual effect) of the response may be proportional to the strength of the user reaction (e.g., loudness of yell, length of laughter, pitch of singing).

The following sections highlight examples of the types of elements that may be added.

Add auditory elements dynamically to content based on reactions Another extension of “Add elements dynamically to content based on reactions” is where the added elements play auditorily. According to some examples, whenever one of the users produces a specific response, a corresponding auditory element is added to the content. The auditory element may include playing a sound or adding an audio effect. Examples of sounds that could be played are crowd noises, comedic sounds, impact hits, etc. The audio effect may be something such as reverb, distortion, spatial movement etc. The auditory element may, in some examples, be a combination of these elements, or of similar elements. The sound and audio effect may include the use of Dolby ATMOS™ or another type of object-based audio system. An example of a sound using ATMOS™ is playing a virtual laugh track by placing different laughs in different spatial positions. Another example of a spatial audio effect could include hearing a baseball moving around the users after the batsman strikes the ball.

Specific examples:

Multiple users are watching a stream of cat videos. When a user’s laugh is detected, a virtual laugh track is played through the speakers. The virtual laugh track may be created by placing different laughs in different spatial positions.

During a game of baseball, when the users’ team hits the ball a spatial “whoosh” sound and audio effect is added to the content. The spatial movement of the sound may make it sound as if the ball is flying past the users. The hit of the ball may also be encoded in metadata provided with a content stream.

Add auditory elements of other users dynamically to content based on reactions

Another extension of “Add auditory elements dynamically to content based on reactions” is where the added auditory elements are sourced from other remote users of one or more other ATS systems. Such examples may allow the users in a local environment to hear how others reacted to content as the users in the local environment consume the content. For example, when a user (User A) reacts to the content, the reaction may be recorded and sent to User A’s friends (including User B). Later, when User B goes to watch the same content, User A's reaction may be sent to User B’s play back device through the content’s metadata stream. User A’s reaction may be played for User B at the same point in the content where User A reacted. The reaction played back for User B may, in some examples, contain the reactions of many users played at the same time. The reactions played back for User B may not be from a friend of User B.

Specific Examples:

Jane is watching a game of football between Liverpool and Manchester. She cheers for all the goals scored for Liverpool and boos at some goals scored by Manchester. Jane’s friend Bob later goes to watch the match between Liverpool and Manchester. Bob gets to experience watching the game with Jane’s cheering and booing as he watches the game.

TV shows are often watched by millions of people. As a show is watched by users, their reactions could be recorded then sent to the cloud where these reactions are mixed to form a crowd reaction to the TV show. When another user goes to watch the show, the crowd reaction audio may be played alongside the content.

Add visual elements dynamically to content based on reactions

Another extension of “Add elements dynamically to content based on reactions” is where the added elements appear visually. Whenever one of the users produces a selected type of response, a visual element may be superimposed or composited onto the display with the content. Said visual element may be an icon, animation, emoji, etc. Alternatively, or additionally, the visual element may be a visual effect that is added to the content (e.g., shaking, changing colour).

Specific examples:

- Jane is watching the Academy Awards telecast on her ATS-enabled TV. While introducing the Best Actor in a Motion Picture category, the host of the show asks the home audience to yell out the name of their favourite actor. Jane is a Tom Cruise fan and enjoyed Top Gun Maverick this year, so she yells out “Tom Cruise.” In response, Jane’s TV displays an overlaid image of a virtual Oscar being held aloft by a cartoon representation of Tom Cruise, whereas her neighbour Ben, who is a Brad Pitt fan sees an overlaid image of a virtual Oscar held aloft by Brad Pitt.

Injuries in ice hockey matches are common occurrences. When watching an ice hockey match an ATS-enabled TV may be configured to show an animation of an ambulance whenever the user is heard to say “bring on the ambulance.”

Batman circa 1960 would routinely superimpose words such as “Smash!,” “Bam!,” and “Pow!” over the action in a cartoon font whenever a character was struck in a fight scene. Batman 2026, being viewed on an ATS-enabled TV, could display similar words, based on user utterances in fight scenes. For example, the Penguin is hit by Robin and the user yells “Pow!” Based on this an ATS-enabled TV may be configured to overlay the text “Pow!” instead of the other available options “Smash!” and “Bam!.”

Add lighting elements dynamically to content based on reactions

Another extension of “Add elements dynamically to content based on reactions” is where the added elements are changes to the lighting. In some examples, whenever a user engages with the content in a predetermined manner, a change to the lighting may be issued. These lighting changes may include a bulk changing to the lighting or may trigger an animation using the lights. The bulk change may include changing the colour and intensity of all lights. In some such examples, each light may be changed to its own independent colour and intensity. In some examples, an animation may be issued to play on the lights in response to a user reaction. Some example animations are a wave, strobing and flickering like a flame.

Specific examples:

The user is watching a horror movie and verbally indicates they are too scared watching the film. In response to this, the lights are raised in brightness, reducing the fear induced by the content.

A family is watching a game of rugby where the team colours are red and blue. The ATS detects the family positively responding to the red team and the lights in the room make a bulk change to red.

Augment content to improve intelligibility based on reactions

Supposed one or more users are watching content on a playback device associated with an ATS. Whenever one or more users engage loudly with the content (e.g., laughing, cheering), in some examples the content may be adjusted to improve the intelligibility of what might otherwise be missed by the one or more users. The intelligibility may be improved by delaying the next scene until the user reaction has subsided, increasing the volume of the dialogue, turning on captioning so the next scene can still be understood, etc. The content stream may, in some examples, contain extra media to optionally extend the scene to achieve the time delay. Without said extra media, the time delay can be achieved by pausing the content until the user response has decreased in intensity. Alternatively, generative Al could be used to generate extra media to extend the scene for as long as required to prevent the content from progressing, until the users are ready to move on. In some examples, a volume increase of the dialogue may be proportional to the intensity of the user reaction.

Specific examples:

Bob is watching a comedy with Jane. Bob finds the punchline to a scene hilarious and has a long-extended laugh. The next scene of the comedy may be delayed until Bob can contain his laughter to a certain level.

A family is watching the Super Bowl and they cheer loudly for an extended time for a touchdown that is scored by their team. Closed captions of the commentators are automatically turned on for the period that the users engage for. This allows users to follow what the commentators are saying.

A user is watching stand-up comedy and laughs at a joke told by the comedian for an extended time. The system increases the volume of the comedian’s voice while the user continues to laugh.

A user is having trouble understanding the dialogue in a film they are watching. A dialogue enhancement feature may be turned on that improves the audibility of speech in the room.

In the same scenario as above, instead of dialogue enhancement, closed captions are turned on.

Personalization through Feedback to Content Producers This section shares similarities to the Long-term Personalization and Short-term Experience Augmentation, except that the personalizations are performed by a content producer based on the users’ attention metrics as detected by one or more ATS systems. The attention feedback may come in the form of users explicitly responding upon request of the content provider or implicitly attending to the content.

Examples of requesting and responding to explicit feedback include:

The presenter says, “Make some noise!” and the viewers either make noise or do not.

The presenter says, “What do you viewers at home think?” where the viewers may respond with a sentiment such as booing or cheering.

The presenter says, “Should we pick the red or the blue object?” where the viewers respond with the set of responses “Red,” “Blue” or a non-matching response. Examples of users implicitly attending to content may include all attention types mentioned in this disclosure that are not responsive to an express request in the content.

The personalization decisions may be made by the content producer or by an automated system implemented by the content producer. In some examples, personalization may involve include personalizing the content for the average viewer. In some examples, personalization may involve producing multiple versions of personalization, catered for the multiple audience categories (such as users in a particular country or region, users known to have similar interests or preferences, etc.) In some examples, the user attention responses may be sent to the service (e.g., to the same server or data center) that is providing the content or to a different service.

Next-iteration personalization of content by producers based on reactions

Content producers provide content that can be watched by consumers at any time. A content producer may be provided with information regarding how users in ATS-enabled environments have attended to their content. This information may, in some examples, be used by the content producer to provide a personalised spin to the next iteration of the content (e.g., episode, album, story).

Specific examples:

An influencer receives attention metrics regarding their previous short video and find that users were highly attentive and want to see more products like one they had displayed. The influencer decides to present a similar product for their viewers in the next short.

A vlogger making a series of videos around a video game asks the viewers, “What character should I play as next time?” at the end of his video. The viewers indicate (e.g., vocally, pointing) what character they’d like to see them play next time. The vlogger then makes the decision of what to play as next time with these attention results in mind.

A TV show ends on an open-ended cliff-hanger. Based on the ATS user responses, there were four primary types of response to the ending. The TV show producers use this information and decide to make four versions of the next episode. The users are shown the version of the episode that is intended for them based on how they reacted to the previous episode.

Interactive live streamed media based on reactions

This section describes examples of what can be achieved when the users’ attention information is being sent back to the content producers in real-time. The content producer can decide to adapt the content in real-time based on these analytics.

In some examples, an ATS-enabled system may provide the option for other users to see how an audience is responding. For example, the content producer may provide attention information in the content. Alternatively, real-time attention information may be sent in the metadata stream of the content back to the users. Seeing that other users are responding in a particular manner may increase the likelihood of a particular user responding in the same manner. Similarly, seeing that others are not responding might make a user realize that they have greater sway on the content presentation by responding in a certain way (e.g., cheering for a certain character).

Interactive live streamed gaming based on reactions

Other extensions of “Interactive live streamed media based on reactions” may be where the content is, or includes, gaming content, whether the gaming content involves an individual user gaming, a gaming competition, etc.

Specific examples: Jane is live- streaming herself playing a video game. An option has presented itself in the game where she can go to either the left or right path. She asks her viewers to call out which way she should go, “Left” or “Right.” The attention responses of the viewers are sent back to Jane, saying 70% of people want her to go “Left.” Jane decides to go right to mess around with her viewers.

Bob is hosting a live stream playing a trivia video game. He asks his audience what out of the multiple choice is the correct answer. The most selected answer by his audience turns out to be incorrect. Bob augments the stream for a random portion of his users with a camera shake saying, “We got the answer wrong, we must suffer the camera shake together!”

Interactive live streamed music related content based on reactions

Another extension of “Interactive live streamed media based on reactions” is where the media is music related content such as a band, disc jockey (DJ), live music event, one or more musicians talking and playing around with different instruments, etc. Some attention responses that may be used for a live music stream include clapping, singing along, dancing, moving in time with the music, etc. Attention responses such as having the users sing the most memorable line in the song might be explicitly requested of the users by the content producer.

Specific examples:

A DJ is performing a show that includes participants joining remotely with their ATS- enabled devices. The DJ slowly builds up to more energetic songs. The attention metrics show the DJ that the audience is already dancing more than they would usually at this point in their show. Based on this information the DJ decides to progress to the more energetic songs earlier than the DJ had originally planned.

A musician is hosting a live stream of themselves in their studio. Whilst playing a guitar solo, the musician can see a reported low level of attention to the guitar solo.

In addition, some viewers were responding by saying, “bongo.” The musician decides to start playing bongo drums instead.

Interactive live streamed user-generated content based on reactions Another extension of “Interactive live streamed media based on reactions” is where the media is user-generated content.

Specific example:

A vlogger is hosting a live stream of them in a foreign country. The vlogger enters a shop to buy snacks and show the selection of snacks on the shelf. The vlogger asks viewers what things they think he should buy. The vlogger uses the attention information to decide what to buy to help personalize the experience for the viewers.

Interactive live-streamed show based on reactions

Another extension of “Interactive live streamed media based on reactions” is where the media is a show from a broadcaster (e.g., reality TV show, episodic TV show, awards show, live movie).

Specific examples:

During the People’ s Choice Awards show, voting is allowed for the best film through users’ ATS-enabled devices. The presenters say, “What is the fdm you think deserves best picture?” and the users yell the name of the film they want to see win. The presenters then announce the winner based on the voting that took place using the attention metrics.

A new interactive Jackass™ movie is being filmed and aired live. They allow users of an ATS-enabled device to vote for who should wear a cape and jump off a ramp in a quad bike.

In the finale of Eurovision™, the winner is decided based on real-time attention indications from ATS-enabled devices.

During a presentation of Saturday Night Live™, the producers are interested to know how people are attending to the content in real time. The producers discover that people found a joke made just now by a host was particularly funny for the viewers. From this, the producers encourage the studio audience to laugh louder (for example, via signage visible to the audience but not to TV viewers), and the host is told through his ear piece to continue riffing on the joke with the guest for a bit longer.

Interactive live streamed media viewing session based on reactions Another extension of “Interactive live streamed media based on reactions” is where the content presentation is a viewing session for a show, which could be a replay. Multiple participants join through many ATS-enabled device(s)s to watch the viewing session of the show, in some examples at the same time. This viewing session may be hosted by a human producer or by an automated system. Some examples of suitable viewing session shows are “choose your own adventure” shows, betting game shows, trivia game shows, children’s TV shows (such as shows including calls and responses), etc.

Specific examples:

Mr. Hug Man has a segment in a show where he will hug a celebrity. In a viewing session multiple participants vote for how long they think Mr. Hug Man will hug the celebrity before the celebrity taps to break the hug.

A game is hosted by an automated system where viewers must spot where a list of items are hidden in a scene.

Recommendations

Some disclosed examples involve recommending user content based on user preferences and/or a current user state as determined by analytics using an ATS. The recommendations may be computed in the cloud or on the user’s ATS-enabled device(s). Recommendations may be decided based on short-term or long-term information, or a mixture of the two. The recommendation system may also be optimized in the cloud to better grasp what content a larger group of people (other than people in a particular environment, such as a particular home) enjoy in certain circumstances (e.g., user mood, viewing session length, news events).

Content may be recommended before, during or after content playback. In one example of recommending content during playback, the user is not responding positively to the content currently being played, so a different piece of content aligned with their preferences is recommended to the user.

Content recommendations based on mood

Users may be recommended content based on their mood as detected by their ATS- enabled device(s). The ATS may detect the user laughing, sighing, valence and/or arousal level in speech, not reacting to content they would usually perk up to, etc. In some examples, the ATS system may also leverage a closed loop that may be created as aforementioned. For instance, the ATS system may recommend content based on whether a particular type of content had a positive impact the last time it was consumed when the user was in the user’s current mood. Some examples include exploring whether uplifting or sad music is more helpful to get this user out of a negative mood, or whether watching a game of sport is more enjoyable than a comedy when the user is in a euphoric mood. Recommendations may also come in the form of short-term mood estimation that do not consider the user’s preferences, as to not form a closed loop of recommendation improvement.

Mood-based content recommendations may also be determined using long-term analytics. Suppose the ATS is not yet aware of the user’s mood (e.g., the user has only just woken up). In some examples, the ATS may refer to long-term mood information to infer a likely mood. Such inferences may come in the form of finding patterns (e.g., the user is generally happy in the morning), detecting a trend (e.g., the user’s mood has been improving moving into summer) or by using filtered mood information (e.g., the user has been in a good mood this week, so may presently be in a good mood).

Specific examples:

Bob is in a somber mood and this is detected by the ATS as he is not laughing to jokes in the light-hearted comedy that matches his usual preferences. A dark comedy is recommended to Bob as the next piece of content. The ATS detects that Bob’s mood improves a little through light chuckles. This information is stored (possibly on his device, or in the cloud, etc.) for when Bob finds himself in a similar mood again to help inform the recommendation system.

A child is watching a TV show and starts to become scared. The child is then recommended to change to a calmer show.

Content recommendation based on user’s focus on particular views

The term “views” may mean any visual element representing a particular type of content. Some example of views include thumbnails whilst browsing content options, different screens playing content simultaneously (e.g., a TV playing a show with a tablet playing a game), multiple windows on a screen, etc. The focus to particular views may be determined by the ATS-enabled device or devices, through gaze tracking, via audio direction of arrival information or otherwise. Using such information and knowledge of playback device locations, in some examples the ATS may estimate which content the user is most likely to be interested in and may recommend more content of a similar type. For instance, knowing how long each thumbnail is looked at is a form of measurable attention.

Specific examples:

When a user is scrolling through a list of content options, the ATS-enabled device(s) may report what content thumbnails the user is most engaged with. As the user continues to scroll for more options, the recommendations are directed towards what the user has implicitly expressed an interest in and given attention to.

Content recommendation based on music attention

Music attention may be determined by an ATS-enabled device(s) through detecting, singing along, drumming in the air or on an object (e.g., a steering wheel), yelling the name of the song or artist when it starts playing. The music attention indications may then be used to determine the user’s music preferences.

The content recommendation may come in the form of a recommended or automatically curated playlist based on the user’s music attention indications. The recommended content may be similar in terms of genre, era, artist, song structure, etc. Other forms of media may also be recommended based on music preferences. For example, a movie may be suggested based on having a soundtrack that aligns with the user’s musical tastes.

Specific examples:

Alice is on a road trip and has a habit of tapping along to songs she likes on the steering wheel. Her music preferences are determined using this information and a curated road trip playlist for the rest of her journey is recommended. Exercise Use Cases

Exercise use cases may involve long-term personalization, short-term experience augmentation, personalization through feedback to content producers, or combinations thereof, in some instances at the same time. Exercise content may include a yoga class, a workout class, music that a user likes listening to whilst doing a home workout, etc. Specific sounds that may be listened for to detect exercise related attention indications on an ATS- enabled device(s) might include:

- puffing; deep breaths; vocally counting repetitions; increased breath frequency; shaking of body; visually detecting exercise movements; visually detecting the users’ technique quality; emotion recognition and so on.

In some examples, an ATS may supply a human or virtual coach with attention- related information in real time. The attention-related information may correspond with one or more users’ attention to the exercises. Alternatively, or additionally, the ATS may supply the coach with the users’ track records, performance trends, attention information from previous sessions, etc.

Specific examples:

Jane uses her ATS-enabled device(s) when she works out every Monday. Her progress is tracked over time and the difficulty of classes is made to match her progress. Jane is also provided with a run-down of her progress on her ATS-enabled device(s), such as a dashboard provided on a phone display.

Bob wishes to do an exercise class that is at his level and meets his preferences. Bob watches a pre-recorded exercise class that is personalized for him by having the ‘tracks’ of the video selected for him in order to suit his needs. The ‘tracks’ may refer to different sections of pre-recorded video, so their preferred exercises may be selected for a user, in the order most desirable for the user.

Multiple users, for example two, decide to do an exercise class together. They have different preferences for exercise movements and different levels of competence in areas such as flexibility, strength and aerobics. They provide input to the ATS indicating that they wish to complete a core workout and the ATS automatically generates an exercise class for them catered to their abilities and preferences.

An exercise class is hosted by a virtual coach. The user is detected by an ATS to be struggling to complete the final repetitions due asthmatic issues. The ATS causes a virtual coach not to encourage the user to complete the repetitions.

An exercise class being hosted by a virtual coach has a participant who is not pushing themselves to finish a set of exercises. This is detected by the ATS -enabled device(s) and the virtual coach encourages the participant to work harder and push through.

A live exercise class is hosted by a coach with many participants across multiple ATS -enabled devices. The coach is informed that John is not keeping up with their current exercise. However, the coach is also supplied attention information on how John performed in previous sessions. The coach supplies encouragement, e.g., by calling out “come on, John, you did it last week and I know you can do it!” In another live exercise class hosted by a coach, an ATS detects that one of the participants has particularly poor form with the current exercise. Based on feedback from the ATS, the coach decides to speak directly with that participant through the user’s content stream, to provide some tips on how to do that exercise better.

Playback Optimization and Control Use Cases

This section provides examples of long-term personalization and short-term experience augmentation. “Playback optimization” in this section refers to optimizing the playback across orchestrated devices, in some examples based on an objective function. This objective function may involve, for example, maximizing attention, maximizing intelligibility, maximizing spatial quality for a listening position, etc. The updates required to achieve the playback optimization may be computed on the user’s ATS-enabled device or in the cloud and sent to the users’ device, possibly in a metadata stream provided with content. “Playback control” refers to features such as play, pause, rewind, next episode, etc.

Specific examples:

Six users are in an ATS-enabled car with music playing. There is a conversation between the two users in the back row of seats. They are not attending to the music content, but instead are attending to their discussion. The music playback is attenuated in the back row of the car but kept louder for the other users who are enjoying the song.

Alice is likes watching TV in the living room adjoining the kitchen whilst Bob makes dinner. Bob enjoys listening to music as he prepares dinner. The playback is jointly optimized for Alice and Bob so that they both get to hear spatial audio centred around their area of the room. Moreover, the audio from the Alice’s content is much quieter in Bob’s zone and vice versa.

Many users were watching a movie in a room, but then all the users left the room to greet a visitor. The ATS detected that there is no one in the room (for example, according to camera data) and the attention level for the content drops significantly. The ATS causes the movie to be paused.

Bob is watching a movie when he receives a telephone call. An ATS-enabled device(s) detects that Bob is not attending to the content, because his pose shows Bob holding the phone to his head. The ATS causes the movie to be paused for Bob while he takes the call.

In the same scenario as above, Bob has displayed on previous occasions that he likes to play the content whilst he is on the phone by resuming the content after the automatic pause. The system notices Bob started a phone call again during a movie and does not pause the content for him, based on his learnt preferences.

Advertising using Atention Feedback Use Cases

Previously-deployed advertising systems have limited ways to determine a user’s attention. Some current methods include having a user click an advertisement (“ad”), having a user choose whether or not to skip an ad, and having a trial audience fill out a survey about the ad. According to such methods, the advertiser does not necessarily know if a user is even present when an ad is presented.

Utilizing an ATS allows for better-informed advertising. Better-informed advertising may result in the improvements of current techniques such as advertising performance, advertising optimization, audience sentiment analysis, tracking of user interests, informed advertising placement, etc. Moreover, better-informed advertising allows for new advertising opportunities such as: interactive advertising; personalized advertising; attention driven shopping; and incentivizing attention for enhanced advertising;

These ideas will be explored in detail throughout this section with lists of use cases.

Example Use Cases

Advertising analytics based on user attention

When users consume content on their ATS-enabled devices, advertising played on the system may also leverage the ATS. Rich advertising analytics can be derived from an ATS. Some examples of analytics that may be provided by an ATS include: how engaged were users overall; what aspects of the ad were users engaged with; were the users engaged positively or negatively; who was engaged with the ad (e.g., based on demographic, user preferences and interests); when were users engaged with the ad in a specific way (e.g., moming/night, Monday/Tuesday, etc.); whether specific actors on the ad triggered more or less attention; and which section of the content triggers higher or lower attention (ads at the beginning, middle or end of content; or after a cliff-hanger, joke, or romantic scenes, etc.).

Specific examples:

The marketing team of a new 3D printed basketball want to know how their latest ad has been performing to determine if they are getting value for money spent. The marketing team launched the ad on ATS-enabled devices and now receive analytics of how the ad is performing.

Following the Super Bowl of 2023, the ranking of the best ads during the event are desired for public novelty interest purposes. The Super Bowl was aired across ATS- enabled devices, so the analytics of each ad are collated and compared to determine the rankings.

Optimize advertising based on user attention

Advertising can be optimized for the desired attention response using the “Advertising analytics based on user attention.” Such optimization may be achieved through iteratively improving the ad after releasing new versions based on the analytics. Alternatively, or additionally, such optimization may be achieved by releasing many versions of the ad at the same time to see how the user response differs.

Specific examples:

A company wants to have their ad optimized for maximal positive user attention before performing a full release of their marketing campaign. The company decides to do a soft release of the ad to 1000 ATS users. Using the analytics from this soft release the company creates another version of the ad and perform another soft release iteratively until they are satisfied with the ad. Finally, they release the advertising campaign with confidence in their ad.

In same scenario as above, instead of soft releases of the content, there are iterations of focus group testing. The focus groups watch the content with ATS-enabled devices.

A company has many great ideas to market their new product. They decide to make many versions of their ad and release them at the same time. Using the analytics from ATS-enabled devices, they find that one version of the ad was performing significantly better than the rest. They decide to continue advertising using only the best performing ad.

Sentiment analysis of product based on user attention

Sentiment towards products can be determined using user’s ATS-enabled devices during content and advertising playback. This sentiment may also be evaluated on a user or population basis. For instance, a population may be users within a certain age group, users who share an interest (e.g., cars), users with certain attention characteristics (e.g., users who laugh in the typical frequency range of adult women), a leamt segment of the population (e.g., users who sing in a certain frequency range with a particular timbre), the entirety of all ATS users, etc.

Demographics and user interests are a subset of the factors that may be used to select a population with which to evaluate sentiment analysis. Demographics and user interests may be specified by the user, estimated by the ATS, or both. The user interests may, for example, be estimated in the ways specified in the ‘Determining a user’s preferences’ section. The demographics may be estimated from attention indications such as the characteristics of their responses (e.g., frequency range of their laughter, the height of a jump, their visual appearance indicating age), the type of attention responses they use (e.g., the types of words and dance moves they use), etc.

Sentiment can be determined through a user implicitly or explicitly reacting to a product. Implicit reactions may include yawning, gazing at the product when it appears on a screen, etc. Explicit reactions may include positive reactions such as saying “I love <product name>” or negative reactions such as saying “<product name> again” then proceeding to leave the room, etc.

Specific examples:

The latest James Bond movie features James wearing an Acme watch. Acme wishes to know how the watch is perceived during this product placement and is supplied sentiment analysis as determined by an ATS of the watch on a population scale.

Acme decides to enter into the same type of product placement agreement for the next James Bond film.

Extending the last example, prior to release of the film, Acme was already running an advertising campaign on TV. The sentiment of the watch in TV ads is captured using ATS-enabled devices after the James Bond film is released. As the film gains more viewers, Acme finds higher attention and sentiment towards their watch in their TV ads.

A company is trying to expand the age demographic of their product. They wish to know how 40- to 50-year-olds perceive the product. They determined sentiment on the product for 40-50 year-olds with a mixture of attention analytics from users who have listed themselves as being within that group, and also users who have not provided their age and are estimated to be in that demographic as estimated by an ATS. From this data, the company is given an extensive amount of sentiment analysis information from advertising they launched that was available on ATS-enabled devices.

A company is trying to target their product to introverted people. A hypothesis is made by the company that introverts respond more quietly compared to other groups of people. They decide to filter the sentiment analysis of their product to people who do not engage loudly and do not react with large movements.

Interactive advertising using an ATS Advertising may be interacted with using ATS-enabled devices. Interactive components of an ad may be sent through the ad’ s metadata stream, including information such as what attention types to respond to and the respective actions the ad should respond with. Types of actions the advertising may respond with may include changing the content of the ad, actioning a control mechanism such as skipping the ad, storing a user response (e.g., in the user’s device or in the cloud) for the next time an ad for the same product is played, etc. These actions can help to determine whether users are engaged and allow for gamification of ads. For instance, the user may be rewarded for skipping an ad, as the user has displayed a level of engagement with the device.

Specific examples:

Company ABCD has launched a marketing campaign for their new car. They decide to place advertising on ATS-enabled devices to allow for an interactive experience. Their ad features a car racing game. The ad allows users to either lean left or right, or call out the “left” or “right” to control the vehicle. Users become more responsive to and aware of the new car than other cars advertised with traditional methods.

An insurance company wants to show how little accidents can happen everywhere in life, but little accidents are not so bad if one is insured. No achieve this, the insurance company launches a “choose your own adventure” ad on ATS-enabled device(s)s. Users can pick a new path for the story every time the ad appears, which always results in the outcome being “you should be insured.”

A company is releasing their new game, and with it an ad to get people involved and excited about the release. This advertising is hosted on ATS-enabled device(s)s and the advertising takes multiple ad placements to finish the series of ads resembling the game’s features. Each time an ad of this campaign is played, the user is given mechanisms to personalise the ad such as customising the appearance of the main character. The next time an ad from this campaign plays, the character is given the customised appearance, and the story progresses with any other interactions the user previously made.

A user is consuming content with ad breaks embedded into the content. The user is provided with a reward for not skipping and actively attending to the ad (e.g., strong eye contact, or discussion about the ad). An example reward may include not receiving another ad-break for the rest of the thirty-minutes of content playback. Personalised advertising based on user preferences

Advertising may be personalised in a similar fashion to the personalised content adaptions detailed in the Long-Term Personalization section. As mentioned in the Long-Term Personalization section, the adaptions may be shipped in the metadata stream, sent directly as a selected version or created generatively. In some examples, an ad that simply matches the user’s interests may be selected.

Specific Examples:

Two users watching TV together are known to enjoy comedy as detected by ATS- enabled devices, so they are shown a funny version of an ad. Conversely, users of another household prefer content that is to the point and are shown a more serious version.

A user viewing basketball on their ATS -enabled device(s) is known to be a fan of Stephen Curry as they positively respond to him. An ad for tickets to the next game is personalised by selecting the version where Stephen Curry stars.

In the same scenario as above, the user never speaks of Stephen Curry. However, the ATS is aware that Stephen Curry is the user’s favourite player because the user’s gaze always follows him around the court.

Jane and Bob are watching a show about travelling the world. They discuss how they’d love to go visit Greece and this topic is detected by their ATS-enabled device(s) playing the show. In the next ad break of the show, Jane and Bob are shown advertising for a resort in Greece.

Whilst Bob was watching the latest James Bond film, he mentioned “I really like his suit” and an interest in suits with relation to James Bond is determined. During the next ad break, Bob is shown advertising for a similar-looking suit. As mentioned above, this example can be enhanced by listening for pre-populated words identified with the content (“suit” or “ABCD”).

In a similar scenario to the one above, the user does not mention liking the suit.

Instead, a camera connected to the ATS allows the system to determine that the user is paying special attention to James Bond’s watch. For this reason, the user is served an advertisement for the watch during the next ad break.

Attention driven shopping Many shopping-related opportunities may be enabled through using ATS-enabled device(s)s. This could range across many types of content. Some examples include shopping-related channels of content and virtual assistants. New shopping-related opportunities may include:

Sentiment analysis of products similar to what was detailed in the “Advertising analytics based on user attention” but specifically for shopping related content (e.g., online shop, shopping TV channel);

Direct interaction with shopping material through ATS-enabled devices;

Use of user preferences and interests to optimize their shopping experience and products, deals, etc, shown;

More attention-driven shopping opportunities also may be possible.

Specific examples:

Whilst a user is browsing an online shop, gaze tracking or the user verbally saying “Ooo I like that” indicates that the user is interested. The item is automatically added to their cart.

A directly interactive ATS-enabled shopping experience explicitly tells the audience, “Clap if you’d like this product today!” A subset of the ATS-enabled viewing audience that engaged with the product is shipped the product for free. The remaining users who were not given the product have the item added to their cart and are sent updates to the pricing of the product.

Shopping TV channel utilises attention analytics to determine which product users are most interested in. Users’ devices are also aware of their owner sentiment towards the product.

Shopping TV channel utilises attention analytics to determine which aspects (e.g., features, price) of an advertised product contribute to converting a user to buy a product or contribute to a user losing interest in buying. For example, if user is engaged when the product is presented but loses attention when the price is presented, price is wrongly positioned.

A virtual assistant may be interacted with using an ATS. The virtual assistant may have access to update the user’s shopping interactions. The user may say, for example, “Listen Dolby, add that item to my cart,” as they are referring to an item they are attending to on screen and the virtual assistant may automatically add the item to their cart. The item they were attending to may have been in non-advertising related content such as a movie, podcast, etc.

Whilst a user is consuming shopping related content, the model of each product on screen is a version selected to match their preferences. For example, John insists that all of his personal belongings are yellow. He is shown yellow version of each product if a yellow option is available.

An ATS has detected that Alice responds better to shopping content when items are ordered with the prices ascending. The next time Alice enters an online shop, this consumption preference is automatically selected for her.

A user shows interest in a car for that appears on a shopping channel. The playback system responds with, “Would you like me to book a test drive for you?” The user engages by saying “Yes” and the test drive is automatically arranged with an ATS- enabled device.

A user shows interest in a vacuum cleaner on a shopping TV channel. The user says, “Wow that looks fantastic, I want it,” to which the ATS-enabled device(s) adds to the playback “Do you want to order it now?” The user responds to the question with, “No, I’ll have a think about it,” and the order is declined with the user reaction.

Informed advertising placement based on user attention

The placement of advertising can be informed using attention information as detected by an ATS. We use “advertising placement” to mean when to place advertising and what advertising to place. Choices of advertising placement may be decided using long-term trends and optimizations or in real time using information about how the user is engaging at that moment. Moreover, a combination of the two may be used, where real-time decisions of advertising placement may be optimized over the long term. Examples of decisions that may be made using this information include:

Placing advertising where users are least engaged with the content, as to minimise the annoyance of the ads.

Placing advertising where users are most engaged with the content, as to maximise the attention to the ads.

Placing advertising about a topic when the topic is present in the content.

Placing advertising about a product when users with interests in that product type are engaged with the content.

Optimize advertising placement based on the affect it has on the ad’s or content’s performance using a closed loop enabled by the ATS.

Combinations of any of the above.

Learnt decision making of advertising placement could be done at different levels, such as: per user (e.g., preferring advertising at the start of content); per population type (e.g., Gen Z viewers, or viewers in France, jazz fans); per scene (e.g. certain users are more likely to engage with ads when a specific user is on screen: A Tag Heuer ad with Ryan Gosling following a scene in which Ryan Gosling is on screen for users that respond to the scene); per episode (e.g., the least annoying part to receive an ad in this episode is learnt); per series (e.g., maximal attention is generally five minutes before the episode ends); and so on.

Specific examples:

An original equipment manufacturer (OEM) that produces ATS-enabled device components sells attention information to broadcasters. The broadcaster then sells advertising spaces based on the level of attention for the space. A mobile phone video game produces revenue through advertising other games during playback. The game studio that produces the video game determines that users are the least engaged after finishing a battle in the game, using ATS-enabled mobile phones. The game studio wants to minimize the annoyance of ads, and so decide to place the advertising after battles are finished.

A TV show production company values the viewing experience of their shows. For this reason, they want to optimize advertising placement as to maximise the content’ s performance. The production company uses the attention level after ad breaks to determine the effect of the ad placement on the show. Some attention types they may look out for include a user being excited the show has returned, all users have left the room and are not present, a user is now more engaged with their phone, etc.

During the viewing of a TV show, the placement is delayed until the user has at least a certain level of attention. Alternatively, advertising might be reduced while the user is engaged with the content.

Predictive advertising placement based on user attention

Drawing on the previous “Informed advertising placement based on user attention” section, in some examples a model may be developed that predicts attention types, attention levels, the best times to place advertising in content, or combinations thereof. When a model is trained, there may not be the need for any additional attention information to decide where to place ads. The model may, for example, learn with reference to one or more content presentations (e.g., audio, video, text). If the model is given the task of predicting attention levels, the content provider or user may decide where to place advertising using this information. Models trained to predict the best times to place ads may be trained to predict different advertising placements based on user attributes, such as age, user interests, user preferences, location, etc.

Specific example:

An authoring tool is created that attempts to predict the best time to place advertising. A TV broadcaster uses this tool to decide where to place ads in their 24/7 program automatically.

Attention incentivisation Users of ATS-enabled device(s)s may be provided with incentives to engage with content or advertising. The incentives may be provided by the content-providers, contentproducers, the TV-manufacturers, etc. An example of possible incentivised attention type is being expressive (e.g., vocally, gestures) about how feel about the content. Example reward types could include something of monetary value, bonus content, etc.

Specific examples:

Jane listens to lots of music by her favourite artist, and an ATS determines that Jane consistently sings along to music by that artist. Based on the ATS data, the label of the artist provides Jane with free tickets to the artist’s show for being one of their biggest fans.

A user watches every game of football played by Manchester and will cheer loudly for every goal. They receive a discount for tickets to go see Manchester play live.

A group of children watches an animated movie. Their clear enjoyment of the film, based on positive vocal emotion by the group, unlocks bonus content for the film. At the end of the film, they are provided with made-up bloopers and additional “behind the scenes” clips.

A user receives an advertisement for a speaker system and responds positively. The user is presented an offer to get a free TV if they buy the speaker system.

Content Performance Assessment Using Atention Feedback Use Cases

Current content performance assessment methods generally involve having a test audience preview content. Obtaining metrics through test audiences have many drawbacks, such as requiring manual labor (e.g., reviewing surveys), being non-representative of the final viewing audience and possibly being expensive. In this section we will detail how these issues can be overcome through the use of an Attention Tracking System (ATS).

Having an ATS allows one to determine exactly how a user responds to content as it is playing back. The ATS may be used in end-user devices, making all content consumers a test audience, reducing content assessment costs and eliminating the issue of having a nonrepresentative test audience. Additionally, analytics produced by an ATS do not require manual labor. Because the analytics are collected automatically in real time, content can be automatically improved by machines. However, the option to optimize content by hand is still an option. Furthermore, using an ATS during a content improvement process may form a closed loop in which decisions made using the attention information can have their effectiveness tested by utilizing the ATS another time. Examples of how an ATS can be leveraged for content performance assessment and content improvement are detailed in this section.

In this section of the disclosure, we refer to a type of metadata that specifies where certain attention responses are expected from users. For example, laughter may be expected at a timestamp or during a time interval. In some examples, a mood may be expected for an entire scene.

In some implementations, a performance analysis system may take in the expected level of reactions to content, as specified by content creators and/or statistics of reactions detected by ATS, to then output scores which can act as a content performance metric.

An event analyser may take in attention information (such as events, signals, embeddings, etc.) to determine key events in the content that evoked a response from the user(s). For example, the event analyser may perform clustering on reaction embeddings to determine the regions or events in the content where users reacted with similar responses. In some examples, a probe embedding may be used to find times where similar attention indications occurred. Example Use Cases

Added values for content creators

The ‘Content Performance Assessment Using Attention Feedback Use Cases’ section focuses on the values to the content creator added by the ATS. By implementing ATS, there are several aspects that may benefit content creators and content providers. These include:

Content performance assessment;

Content improvement; and

Real-time content improvement

Content Performance Assessment

Assess content performance based on user attention

The performance of content may be determined using attention metrics coming from users’ ATSs. For example, having users lean forwards whilst looking at a screen that is providing content would demonstrate user interest in the content. However, having a user talk on a topic unrelated to the content could mean they are uninterested. Such information about user attention to the content may be aggregated to gain insights on how users overall are responding. The aggregated insights may be compared to the results from other pieces or sections of content to compare performance. Some examples of pieces or sections of content include episodes, shows, games, levels, etc. Differences in levels of attention may reveal useful content performance insights. Moreover, the attention information may indicate what users are attending to (e.g., theme, object, effects, etc.). Note that any content performance assessments obtained using an ATS could be used in conjunction with traditional methods of assessment such as surveys.

Assess content performance using authored metadata

Another extension of “Assess content performance based on user attention” is where the potential user responses are listed in the metadata. Suppose that one or more users are watching content (e.g., an episode of a Netflix series) on a playback device with an associated ATS. The ATS may be configured by metadata in a content stream to detect particular classes of response (e.g., laughing, yelling “Yes,” “oh, my god”).

In some examples, content creators or editors may specify what are the expected responses from audiences. Content creators or editors may also specify when the reactions are expected, for example, a specific timestamp (e.g., at the end of a punchline, during a hilarious visual event such as a cat smoking), during a particular time interval, for a category of event type (e.g., a specific type of joke) or for the entire piece of content.

According to some examples, the expected reactions may be delivered in the metadata stream alongside the content. There may also be a library of user response types — for example, stored within the user’s device, in another local device or in the cloud — that may be applicable to many content streams, which can be applied more broadly. The metadata of what attention indications are expected may be the attention indications that are exclusively listened for and are permitted by the user, in order to give the user more privacy and provide the content producer and provider with the desired attention analytics.

The user reactions to the content — in some examples, aligned with the metadata — may then be collected. Statistics based on those reactions and metadata may be used to assess the performance of the content. Example assessments for particular types of content include:

Jokes

Content creators add metadata specify the places where a ‘laughter’ response is expected from audiences. Laughter may even be broken down into different types, such as ‘belly laugh,’ ‘chuckle,’ ‘wheezer,’ ‘machine gun,’ etc. Additionally, content creators may choose to detect other verbal reactions, such as someone who repeats the joke or tries to predict the punchline.

During content streaming, in some examples the metadata may inform the ATS to detect whether a specific type of laughter reaction occurs. Statistics of the responses may then be collected from different audiences. A performance analysis system may then use those statistics to assess the performance of the content, which may serve as useful feedback to content creators. For example, if the statistics show a particular segment of a joke or skit did not gain many laughter reactions from audiences, it means the performance of this segment needs to be improved.

Scares

During a horror movie, responses such as ‘oh my god’, a visible jump or a verbal gasp may be expected from audiences. An analysis of ATS information gathered from this authored metadata may reveal a particular segment of a scary scene gained little frightened response. This may indicate the need to improve that portion of the scary scene.

Controversial topics

Some channels stream debates about different groups, events and policies that may receive a lot of comments and discussions. During content streaming, the metadata informs the ATS to detect whether a supportive or debating reaction is presented. Statistics of user responses may then be collected from different audiences. Such ATS data may help the content creators to analyse the reception of the topics.

Provocative scenes

Content creators may add metadata specifying the places in a content presentation where they expect strong negative responses from audiences, such as “oh, that’s disgusting” or turning their head away. This may be of use in horror movies, user-generated “gross out” content, etc. During content streaming, such as a video containing a person eating a spider, in some examples the metadata may inform the ATS to detect if a provocative reaction occurs. The aggregated data may show that users were not responding with disgust during the scene. The content creators may decide that extra work needs to be done to make the scene more provocative.

Magnificent scenes

Content producers add metadata may specify the places in a content presentation where they expect strong positive responses such as “wow,” “that is so beautiful,” etc., from audiences. This technique may be of use for movies, sport broadcasting, user generated content, etc. For example, in a snowboarding broadcast, slow motions of highlight moments are expected to receive strong positive reactions. Receiving information of user reactions from the ATS and analysis from data aggregation may give insights to content creators. The content creators can determine whether the audience likes the content or not, and then adjust the content accordingly.

Assess content performance using learnt metadata

Another extension of “Assess content performance using authored metadata” is where the metadata is learnt from user attention information. During content playback, ATSs may collect responses from the audience. Statistics of the responses may then be fed into an event analyzer in order to help create meaningful metadata for the content. The metadata may, for example, be produced according to one or more particular dimensions (e.g., hilarity, tension). The event analyzer may, in some instances, decide what metadata to add using techniques such as peak detection to determine where an event might be occurring. Authored metadata may already exist for the content but additional learnt metadata may still be generated using such disclosed methods.

Specific examples: A live stand-up comedy show has all the jokes marked up in the metadata through information coming from ATS-enabled devices.

A show already authored with metadata has additional metadata learnt using ATS- enabled devices. The additional metadata reveals that audiences were laughing at an unintentionally funny moment. The content producers decide to accentuate this moment.

Highlight reel creation based on audience reactions

Another extension of “Assess content performance using learnt metadata” is where a highlight reel is created for content based on the most highly engaged with segments. For example, during a game broadcast, audiences may react excitedly, perhaps yelling ‘go for it, go for it’, or saying ‘oh no’ when their supported team is losing a battle. These reactions may be detected by an ATS and could be predefined reaction types. Collectively, statistics of ATS data may show the relatively most salient moments in the broadcast. These salient moments can be used to make a highlights reel, either automatically or with some effort from editors. Moreover, a highlights reel may be created for each individual user who watched the game with an ATS-enabled device, based on that user’s attention indications alone. In a similar manner, a highlights reel could be made for a subset of the ATS users, such as a group of friends, the ‘red’ team supporters, etc.

User’s enjoyment may be enhanced if they could hear how other users reacted to the segments highlighted in the reel. This could be done in the same manner as detailed in the ‘Add auditory elements of other users dynamically to content based on reactions’ section of Personalization and Augmentation using Attention Feedback, or in a similar manner.

Top occurrences of event based on user attention

Another extension of “Highlight reel creation based on audience reactions” is where the highlights come from a set of specified events (e.g., jokes, tackles, catches). The events may occur across different pieces of content, for example, episodes, games, books, songs, etc.

Specific examples:

A reel of the top ten best football goals is determined based on responses detected by ATS -enabled devices. The reactions that the ATS looks for may include supportive responses such as applause, cheering, yelling ‘yes!’ etc.

A baseball game gets automatic highlights generated from the sequences that users overall got most excited about.

The top ten funniest jokes from a stand-up comedian are automatically selected for his promotional video based on user attention. Reactions the ATS may look for may include laughing, applause, cheering, etc.

Real-time Content Improvement

Integrating voting in podcasts

In some examples, podcasts may obtain real-time audience voting by collecting some predefined words, such as ‘yes’ or ‘not.’ This process can make the audience have more sense of interactivity and of being part of the streaming, rather than passive receiving the content.

One example:

A musician streamer plays music for the musician’ s viewers in real time. The host asks the audience to choose which song to play out of four options. By uttering ‘A’, ‘B’, ‘C’ or ‘D’, or pointing to the option, listeners can choose their options and have their attention detected by their ATS. The responses are then aggregated immediately. The host is informed with by results from the listeners. The results report that 67.5% of viewers prefer A and 20.1% of them would like to hear C and the other options are below 10%. The host decides to play song A for the audience.

Managing User Attention Across Media Locations Obtaining real-time user attention metrics opens up many possibilities. In this section we will list several use cases of what may be achieved with such attention information when combined with contextual information, and/or by sharing the attention information across devices, media types, etc. We will primarily focus on location as a form of contextual information type. However, other contextual information types are applicable as well, such as weather, local holidays and events, etc. Many examples covered in this section are centered around the idea of continually customizing experiences across devices, media types and locations using attention information as detected by ATS.

This section covers use cases of user attention across media and locations. We use “content” to mean any form of consumable media such as TV shows, radio, bus-stop advertisements, billboards, satnav, etc. In addition to traditional forms of media, we also consider driving or riding in a vehicle to be a form of content, where users can have a level of attention. We discuss what can be achieved when sharing attention information across different media types. When we discuss using attention information across media types, we are referring to the idea that user reactions may be considered in conjunction with their reactions as seen in other media. This conjunction may help to determine user preferences and interests more robustly. Furthermore, such methods may allow for an ATS-based system to use previously determined user preferences and interests to inform decisions in a different medium. Geographical information is an invaluable resource and provides an extra dimension for geo-sensitive actions, such as recommendations for stores, restaurants and events. In some examples, information collected using an ATS may be combined with location information to enrich the decisions made and/or to create new applications that would not be possible otherwise.

There are at least three elements of the system that can assist in the use cases we’ve outlined below. They are:

Obtained information: This could include sets of content, geographical and detected attention information. For instance, a billboard is a form of content, the user’s location is a form of geographical information, and a user’s attention is captured by ATS. Device Platform: Example device platforms include car, TV, mobile, etc.

- Actions: Recommendation (for example, as discussed in the ‘Recommendation’ section), highlighting, purchasing, etc. Based on these three elements, different scenarios and use cases are listed below.

Example Use Cases

Tracking user preferences across device and media

As detailed in ‘Determining a user’s preferences’ section, individual users can have their interests and preferences tracked using ATS-enabled devices. This can be expanded where a user’s interests can be determined across multiple ATS-enabled devices and across media types. Furthermore, when users’ preferences are being tracked across devices and modes, their learnt interests can be used to inform decisions on a new device and/or medium. A user’s preferences may be determined on the user’s devices or in the cloud. Their preferences may, in some examples, be synchronized using the metadata stream of content or otherwise. Such preferences may also be stored on the user’s profile being used at the time the ATS is running (e.g., Google profile, FB profile, Dolby ID, etc.)

The user preferences, as determined by an ATS, may also be shipped to devices (for example, using a cloud-based profile) which are not ATS-enabled, to allow for personalised experiences where an ATS is not available.

Specific examples:

Jane watches a movie featuring a music sequence. She sings along with the song that is played, which is detected as positive attention to the music on her ATS-enabled device(s). Her preferences are updated accordingly. The next day when she is in the car, an auto-generated playlist takes into consideration songs similar to the one to which she responded positively on the previous day.

In a virtu al -reality (VR) game, Bob decides to make his character have a medie val appearance. The VR system is ATS-enabled and detects Bob enjoying the game more whilst having the medieval appearance. This suggests that Bob might have an interest in medieval themes. When Bob is looking for his next audio book to consume, some books with medieval themes are added to the list of recommendations.

While Steve listens to a professional basketball game in his car, he seems very unhappy every time the Lakers score. When he gets home and turns on the TV, it automatically switches to the TV channel playing the game, and automatically selects a commentary stream designed to appeal to Warriors fans because Steve does not seem to like the Lakers. Highlight local interests based on user attention

A user’s interests and preferences, as determined by an ATS, can be used to provide the user with highlights (e.g., notifications, a pin on a map) of nearby interests (e.g., landmark, store, venue). The user’s interests and preferences may be determined using the methods described in ‘Determining a user’s preferences’ section or in the ‘Tracking user preferences across device and media’ section. Combining these user preferences with location information may allow for a recommendation- like system. Some examples of such recommendations include highlighting a location on a satnav, informing the user they are driving past a location aligned with user interests, etc.

Specific examples:

A user has been detected to be a fan of an artist. The user is informed when they drive past a location at which the artist has previously performed or at which the artist will soon perform.

Whilst a user watches a four-wheel driving show, the user reacts strongly to one brand of vehicle in particular. When the user drives past a retailer for the car, the shop is highlighted on the map. The car’s voice assistant also offers to set up a test drive appointment.

A user has watched many history shows about Pompeii with interest. When the user drives past an exhibit on Pompeii, the vehicle notifies the user, and the satnav highlights a museum being near the route.

A user that has indicated an interest in Omega watches whilst watching multiple movies and TV shows in which Omega watches were visible. The user is shown the nearest Omega store when he looks for GPS directions on his phone.

Location linked advertising paired with user attention

New geo-linked advertising strategies may be created, and current ones improved, through the combination of location and user attention information. Examples of current location-based advertising include billboards, advertising on public transport, etc. This strategy could help advertisers to determine if their real-world positioned advertising is being successful, using ATSs. Additionally, this strategy offers new advertising opportunities in which ads may be placed virtually on location. Real-world positioned advertising can have attention tracked through the use of ATS- enabled devices. For example, many current advertising strategies such as billboards are tied with a location, such as the location at which they are positioned. The ATS can further refine which geo-based location receives more attention, as opposed to generalized data such as traffic data. A user could be detected gazing at said billboard and have other user responses associated to the billboard. The interaction with the specific physical ad may be determined based on GPS data, camera data, microphone data, etc. Furthermore, physical ads could encourage attention responses through interactive mechanisms such as trivia. After responding to the trivia, even with an incorrect answer, an action may be taken.

Virtual advertising may be achieved in a similar manner to the ideas detailed in “Highlight local interests based on user attention,” except that the local interests are motivated by advertising. New advertising strategies are possible, such as every vehicle on the road is an advertisement for its own model, and if a user is detected attending to another vehicle, the user is known to potentially have an interest in that vehicle. Alternatively, markers could appear on a user’ s map for locations that have geo-linked advertising that aligns with the user’s preferences.

The user may be connected with the advertiser in circumstances such as: when the user engages with one of the geo-linked ads; and when the user is nearby a geo-linked ad the user shares an interest with, as determined based on previous attention indications.

In some examples, the successful connection of clients from ATS -enabled advertising may result in car OEMs being provided with a finder’ s reward by the advertising sponsor.

Specific examples:

A trivia question is put on a geo-linked billboard ad with the question “What was the first movie with Dolby audio” to which the answer is “A Clockwork Orange.” A free ticket to the cinema is offered for a correct answer or a ‘learn more’ link is presented to the user for an incorrect answer.

A user walks past a bus with humorous geo-linked advertising on the side. The user is detected laughing using their ATS-enabled mobile device. The next ad that appears on their phone is for the same product that was advertised on the bus.

Users are detected to be engaged with a billboard advertisement in their car as they start to have a discussion around the topic presented in the ad. Statistics of the billboard’s effectiveness are sent back to the advertiser or the billboard manager to price billboard differently based on audience attention.

Alice drives past advertising for a live music event. She positively responds to the ad and receives an offer to buy a ticket for the event.

In the same example as above, instead of receiving an offer to buy a ticket, Alice is offered to play some of the band’s new music in the car for her.

A user is driving and sees an ad about a concert on a bus, and says, “I did not realise [name] is coming to [place] for a concert and I'm definitely going there.” The ATS- enabled car captures the reaction to the ad using the car’s current location. This is used to offer ticket purchases to the user.

Whilst a user is driving past a bus-stop advertisement for Acme watches, the user says, "Acme watches are nice". An Acme store is then highlighted on map for the user the next time they are nearby one.

A musical show has paid for a catered virtual geo-linked ad for their upcoming show. If users have interests aligned with musicals or topics addressed by the soon to be played musical, they have the location of the venue pinned on their satnav.

Car brands may decide to have their manufactured cars act as geo-linked ads that can be engaged with. A user sees someone else driving a car they like and says, “those look neat, don’t they?” which is detected by their ATS-enabled device(s). The user has advertising personalised to their liking of that model of car.

A potential interest in the content of a billboard ad is determined by the user gazing at the billboard each time they drive past it.

Continuous attention and customization across devices and location

This section provides examples of what could be made possible using a combination of the following sections:

Tracking user preferences across device and media;

Highlight local interests based on user attention; and

Location linked advertising paired with user attention.

Dynamic continuous attention tracking optimizations may be possible, such as:

Device A detects positive attention towards something of interest to the user and distributes this knowledge to other devices;

Device B creates a highlight reel for the thing of interest; and Device C advertises the thing of interest to the user.

Specific example:

A user engages with a geo-linked billboard ad for a watch in their car. Following this, they get served advertising on their smart TV for the same watch brand during a commercial break. Eventually the watch brand store gets highlighted on the user’s map when they start looking for a restaurant on their map app for lunch.

Navigational routing based on user attention

Routes determined by navigation systems may use attention analytics to help inform decisions. This could happen on both a short term (e.g., a user is feeling down and is not in a rush so take them down the scenic route) and long-term basis (e.g., optimize citywide traffic based on what wholistically results in the highest positive attention level towards driving). Routes may also be personalised to take users on paths that better suit their interests or preferences. User interests and preferences may, for example, be determined by methods described in ‘Determining a user’s preferences’ section or in the ‘Tracking user preferences across device and media’ section. The recommended routing may also suggest a departure time with the user’s emotional state in consideration, as determined by ATSs.

Specific examples:

Peak or “rush” hour in a suburb is particularly problematic. A routing is produced per user, based on the preferences of all the ATS users in the suburb who drive during peak hour. An ATS -enabled navigational system may, in some examples, suggest that some people leave earlier in the morning because they are known to often be up early in the morning and vice-versa. The navigational system may also suggest users to take different routes based on their enjoyment of certain roads and the proximity of their location and destination. In these ways, the routing of traffic in a suburb may be optimized in a wholistic manner.

A user may be recommended to depart a little earlier today, because the traffic and weather are favourable for driving. Moreover, in this instance the user is in a good mood, so the user may be more open to departing sooner. The ATS-informed navigation routing informs the user to take the earlier route with this reasoning supplied.

A user is looking to drive to a destination to which there are two different routes that take a similar amount of time. The user is recommended the path that has stores and/or advertising that is aligned with their interests. Jeong is going on a road trip. He wants to take the “scenic” route. The ATS- informed navigation system suggests a route which maximizes driving along the coast because Jeong regularly watches ocean documentaries and surfing videos.

Yvonne is going on a similar road trip. Her “scenic” route goes near the highest-rated restaurants and wineries along the way, because she often views cooking shows and often talks about wine in her car.

Trivia utilising an ATS

Through the use of an ATS, many trivia experiences are possible. Note that trivia is a form of content. By taking advantage of location information, context-relevant trivia may optionally be derived. Trivia questions may also be supplied from advertisers.

Specific examples:

Whilst John is driving around town in his ATS-enabled vehicle, trivia is automatically generated for him in an eye-spy like game. Nearby landmarks and towns may he the basis for trivia questions, such as “When was this town established?”

In the same scenario as above, a light advertising sell is thrown in the mix of the trivia questions that John receives. For example, “What is the only watch brand that .

The answer may be a watch brand favoured by John, such as Acme.

A trivia game that is centred entirely around advertising may be supplied to the user. This game could act as entertainment for the user and/or may potentially provide the user with a reward. An example reward after a session of trivia advertising might be that the user doesn’t receive any more advertising across their devices for the rest of the day.

Geographical sentiment mapping based on user attention

Using attention information in conjunction with location information could allow for a mapping of areas to sentiment per user. This type of mapping may prove to be insightful for geographical decisions. In some examples, computer programs may post-process such a mapping to help users digest the information. For example, the programs may create summaries, pin locations, etc.

Specific examples:

A user seeking to buy a new home is trying to determine an area that might be suitable for them. Geographical sentiment mapping data may prove to be useful because it could provide insights into areas to which the user has previously positively responded (e.g., laughing, excited talking), as compared to areas to which they have had negative responses (e.g., swearing, grunts). The user may be provided with a list of recommended areas where they might like to buy a house.

A couple is looking to buy a home. Each time they inspect a house, they go hack to the car to debrief whilst they drive off. The sentiment of each location is tracked by their ATS-enabled vehicle. When the couple is looking to make a final decision, they have the ATS information displayed to them as a dashboard, heatmap and with location pins for the places they visited during their house hunting adventures. Using such an information summary, the couple may be able to make a more informed decision about their feelings regarding possible homes to purchase, without the need for detailed note-taking about their prior impressions.

Enhanced Safety and Wellbein Through Attention Feedback

Products targeted around wellbeing have become increasingly popular in recent years. Applications intended to monitor and enhance a user’s wellbeing would greatly benefit from information strongly reflective of the user’s current state to infer wellbeing. This type of information could be collected from an Attention Tracking System (ATS).

Attention analytics can be used to enhance the safety of users. Safety systems could be informed from many types of attention information such as mood, attentiveness, positive/negative sentiment, etc. In this section we list examples of how ATS technology could improve the safety and wellbeing of users. Example Use Cases

User utterances detected by an ATS as proxies for attentiveness

An ATS deployed in the car may be used to determine whether a driver is focussed on the task of driving and is in a level-headed frame of mind. This assessment may be done in real time and/or may be improved over time as an ATS-enabled system learns the user’s behaviours. Learning user behaviours may help to determine what is normal for a particular user and to detect changes from the normal. For example:

• Singing along with music being played in the cabin or laughing at jokes in a podcast may indicate the driver is awake and alert.

• Swearing may indicate the driver is frustrated or suffering from “road rage” and may make impaired decisions.

• A user is known to curse all the time. When this user is detected swearing, there may be less attribution to “road rage.”

• Snoring may indicate the driver has fallen asleep!

• Eyes blinking or head nodding may indicate the driven will soon fall asleep.

Adaptive Acoustic Driver Mindfulness Application

Several smartphone applications exist in the prior art which target mindfulness, meditation, sleep, calm and mental wellbeing. Some disclosed examples provide mindfulness applications that are configured to enhance focus on the task of driving a car. In the proposed techniques and systems, feedback from an ATS is used to inform playback of a variety of acoustic responses, for example from an in-cabin audio system.

Here is an example generated voice response that a driver may hear from such a system:

• “I’ve noticed you’ve sworn at other drivers a lot more than usual this morning. It sounds like you’re a bit frustrated and not focused on driving. Here is some soothing whale song to help you focus.”

Enhancing Driver Focus By Gamifying Attention In this example a driver may be asked to engage in a simulated trivia quiz or other game designed to reward attentive driving. An insurance company may offer reduced car insurance premiums to drivers who consistently score well at such a game.

For example, a driver may be periodically asked to respond to questions about the road conditions such as:

• “What was the color of the car that just merged from the right lane?”

• “What is the current speed limit?”

• “In how many metres will the left lane end?”

These examples may be extended to additional use cases such as operating heavy machinery, air traffic control etc.

Automatically Recommending Rest

An ATS -enabled system may be configured to detect that the driver is getting more distracted over time, angrier as traffic gets worse, more tired and less focused on the road, etc. In some examples, such a system may be configured to detect that the driver is getting more tired over time, angry as traffic gets worse, less focused on the road, etc, using the methods specified in ‘User utterances detected by an ATS as proxies for attentiveness.’ Based on this, the system may recommend the driver to take a break for the safety of themselves and the passengers, if any.

Such a system may also personalise a recommendation, for example as described in ‘Personalization and Augmentation using Attention Feedback’ with the ‘Long-Term Personalization’ and ‘Recommendations’ sections, so users are shown a preferable rest option. This might include things such as preferred foods, coffee, service stations, etc. The preference may be determined using a user’s attention to driving after taking a rest break at a particular location. An insurance company may offer reduced car insurance premiums to drivers who consistently follow such recommendations and do not drive whilst impaired.

Keeping the Children Calm in the Back Seat

Children yelling or fighting in the back seat during a car trip may be distracting for a driver. In some examples, an ATS may be configured to detect the sound of yelling, fighting or bored children and automatically switch the content being played from the in-cabin audio system to something more entertaining for children. In some examples the ATS may be configured to ensure a certain target attention level (indicated, for example, by laughing at the jokes, singing along or responding during a call-and-response segment of a children’s podcast). If the attention level of the children falls below some level, in some examples the system may automatically try different content until the children are calm, aiding focus for the car’s driver.

Measuring Mental Health Using Utterances As Proxies

A person’s utterances and other responses may indicate something about their mental health when analysed over several hours or days. For example, an ATS configured to listen for laughter may be able to calculate a mean laughs per week metric. It may be possible to convert this to a score that is indicative of a person’ s general mental health or overall wellbeing. Deviations may be detected from a user’s previous attention patterns or averages, which could indicate that something has changed in the user’ s wellbeing.

Other examples include:

• Frequent swearing may indicate poor mental health.

• A monotone speaking style may indicate poor mental health, particularly when compared against a baseline of a person’s regular speaking style.

• The frequency with which somebody smiles may be indicative of wellbeing.

• The likelihood that a person sings along with an uplifting song may be indicative of positive mental health.

Figure 12 is a flow diagram that outlines one example of a disclosed method. Method 1200 may, for example, be performed by the control system 160 of Figure 1 A, by the control system 160 of Figure 2, or by one of the other control system instances disclosed herein. The blocks of method 1200, like other methods described herein, are not necessarily performed in the order indicated. According to some examples, one or more blocks may be performed in parallel. Moreover, some similar methods may include more or fewer blocks than shown and/or described.

The method 1200 may be performed by an apparatus or system that includes the control system 160 of Figure 1A, the control system 160 of Figure 2, or one of the other control system instances disclosed herein. In some examples, the blocks of method 1200 may be performed by one or more devices within an audio environment, e.g., by an audio system controller (such as what may be referred to herein as a smart home hub) or by another component of an audio system, such as a television, a television control module, a laptop computer, a game console or system, a mobile device (such as a cellular telephone), etc. However, in some implementations at least some blocks of the method 1200 may be performed by one or more devices that are configured to implement a cloud-based service, such as one or more servers.

In this example, block 1205 involves obtaining, by a control system, sensor data from a sensor system during a content presentation. The content presentation may, for example, be a television program, a movie, an advertisement, music, a podcast, a gaming session, a video conferencing session, an online learning course, etc. In some examples, the control system may obtain sensor data from one or more sensors of the sensor system 180 disclosed herein in block 1205. The sensor data may include sensor data from one or more microphones, one or more cameras, one or more eye trackers configured to collect gaze and pupil size information, one or more ambient light sensors, one or more heat sensors, one or more sensors configured to measure galvanic skin response, etc.

According to this example, block 1210 involves estimating, by the control system, user response events based on the sensor data. In some examples, block 1210 may be performed by one or more Device Analytics Engines (DAEs). The response events estimated in block 1210 may, for example, include detected phonemes, emotion type estimations, heart rate estimations, body pose estimations, one or more latent space representations of sensor signals, etc.

In this example, block 1215 involves producing, by the control system, user attention analytics based at least in part on estimated user response events corresponding with estimated user attention to content intervals of the content presentation. In some examples, block 1215 may be performed by an Attention Analytics Engine. The “content intervals” may correspond to time intervals, which may or may not be of a uniform size. In some examples, the content intervals may correspond to uniform time intervals on the order of 1 second, 2, seconds, 3 seconds, etc. Alternatively, or additionally, the content intervals may correspond to events in the content presentation, such as intervals corresponding to the presence of a particular actor, the presence of a particular object, the presentation of a theme, such as a musical theme, jokes, dramatic events, romantic events, scary events, beautiful events, etc. According to this example, block 1220 involves causing, by the control system, the content presentation to be altered based, at least in part, on the user attention analytics. In some examples, block 1220 may be performed by a content presentation module according to input from an Attention Analytics Engine. Causing the content presentation to be altered may, for example, involve adding or altering a laugh track, adding or altering audio corresponding to responses of one or more other people to the content presentation (or to a similar content presentation), extending or contracting the time during which at least a portion of the content is presented, adding, removing or altering at least a portion of a visual content presentation and/or an audio content presentation, etc. In some examples, altering one or more aspects of audio content may involve adaptively controlling an audio enhancement process, such as a dialogue enhancement process. According to some examples, altering the one or more aspects of audio content may involve altering one or more spatialization properties of the audio content. In some examples, altering one or more spatialization properties of the audio content may involve rendering at least one audio object at a different location than a location at which the at least one audio object would otherwise have been rendered.

In some instances, the content presentation may be altered based, at least in part, on one or more user preferences or other user characteristics. According to some examples, causing the content presentation to be altered may involves causing the content presentation to be personalized or augmented, for example based at least in part on one or more user preferences or other user characteristics. In some examples, causing the content presentation to be personalized or augmented may involve altering audio playback volume, one or more other audio characteristics, or combinations thereof. According to some examples, causing the content presentation to be personalized or augmented may involve altering one or more display characteristics, such as brightness, contrast, etc. In some examples, causing the content presentation to be personalized or augmented may involve altering a storyline, adding a character or other story element, altering a time interval during which a character is involved, altering a time interval devoted to another aspect of the content presentation, or combinations thereof.

According to some examples, causing the content presentation to be personalized or augmented may involve providing personalized advertising content. In some such examples, providing personalized advertising content may involve providing advertising content corresponding to estimated user attention to one or more content intervals involving one or more products or services.

In this example, block 1225 involves causing, by the control system, an altered content presentation to be provided. In some examples, block 1225 may involve providing the altered content presentation on a television screen, one or more loudspeakers, etc.

Some aspects of present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer readable medium (e.g., a disc) which stores code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.

Some embodiments may be implemented as a configurable (e.g., programmable) digital signal processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform required processing on audio signal(s), including performance of one or more examples of the disclosed methods. Alternatively, embodiments of the disclosed systems (or elements thereof) may be implemented as a general purpose processor (e.g., a personal computer (PC) or other computer system or microprocessor, which may include an input device and a memory) which is programmed with software or firmware and/or otherwise configured to perform any of a variety of operations including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system are implemented as a general purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more loudspeakers and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or a keyboard), a memory, and a display device. Another aspect of present disclosure is a computer readable medium (for example, a disc or other tangible storage medium) which stores code for performing (e.g., coder executable to perform) one or more examples of the disclosed methods or steps thereof.

While specific embodiments of the present disclosure and applications of the disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the disclosure described and claimed herein. It should be understood that while certain forms of the disclosure have been shown and described, the disclosure is not to be limited to the specific embodiments described and shown or the specific methods described.

Various aspects of the present disclosure may be appreciated from the following Enumerated Example Embodiments (EEEs):

EEE 1. A system, comprising: a head unit; a loudspeaker system; a sensor system; and a control system, comprising: one or more device analytics engines configured to estimate user response events based on sensor data received from the sensor system; and a user attention analytics engine configured to produce user attention analytics based at least in part on estimated user response events received from the one or more device analytics engines, the user attention analytics corresponding with estimated user attention to content intervals of a content presentation being provided via the head unit and the loudspeaker system, wherein the control system is configured to: cause the content presentation to be altered based, at least in part, on the user attention analytics; and cause an altered content presentation to be provided by the head unit, by the loudspeaker system, or by the head unit and the loudspeaker system.

EEE 2. The system of EEE 1 , further comprising an interface system configured for providing communication between the control system and one or more other devices via a network, wherein the altered content presentation is, or includes, altered content received from the one or more other devices via the network. EEE 3. The system of EEE 2, wherein causing the content presentation to be altered involves sending, via the an interface system, user attention analytics from the user attention analytics engine and receiving the altered content responsive to the user attention analytics.

EEE 4. The system of any one of EEEs 1-3, wherein causing the content presentation to be altered involves causing the content presentation to be personalized or augmented.

EEE 5. The system of EEE 4, wherein causing the content presentation to be personalized or augmented involves altering one or more of audio playback volume, audio rendering location, one or more other audio characteristics, or combinations thereof.

EEE 6. The system of EEE 4 or EEE 5, wherein the head unit is, or includes, a television and wherein causing the content presentation to be personalized or augmented involves altering one or more television display characteristics.

EEE 7. The system of any one of EEEs 3-6, wherein causing the content presentation to be personalized or augmented involves altering a storyline, adding a character or other story element, altering a time interval during which a character is involved, altering a time interval devoted to another aspect of the content presentation, or combinations thereof.

EEE 8. The system of any one of EEEs 3-7, wherein causing the content presentation to be personalized or augmented involves providing personalized advertising content.

EEE 9. The system of EEE 8, wherein providing personalized advertising content involves providing advertising content corresponding to estimated user attention to one or more content intervals involving one or more products or services.

EEE 10. The system of any one of EEEs 3-9, wherein causing the content presentation to be personalized or augmented involves providing or altering a laugh track.

EEE 11. The system of any one of EEEs 1-10, wherein the head unit, the loudspeaker system and the sensor system are in a first environment; and the control system is further configured to cause the content presentation to be altered based, at least in part, on sensor data, estimated user response events, user attention analytics, or combinations thereof, corresponding to one or more other environments.

EEE 12. The system of any one of EEEs 1-11, wherein the control system is further configured to cause the content presentation to be paused or replayed. EEE 13. The system of any one of EEEs 1-12, wherein the sensor system includes one or more cameras and wherein the sensor data includes camera data.

EEE 14. The system of any one of EEEs 1-13, wherein the sensor system includes one or more microphones and wherein the sensor data includes microphone data. EEE 15. The system of EEE 14, wherein the control system is further configured to implement an echo management system to mitigate effects of audio played back by the loudspeaker system and detected by the one or more microphones.

EEE 16. The system of any one of EEEs 1-15, wherein one or more first portions of the control system are deployed in a first environment and a second portion of the control system is deployed in a second environment.

EEE 17. The system of EEE 16, wherein the one or more first portions of the control system are configured to implement the one or more device analytics engines and wherein the second portion of the control system is configured to implement the user attention analytics engine. EEE 18. The system of EEE 4 or EEE 5, wherein the head unit is, or includes, a digital media adapter.

Claims

CLAIMS What Is Claimed Is:

1. A system, comprising: a head unit; a loudspeaker system; a sensor system; and a control system, comprising: one or more device analytics engines configured to estimate user response events based on sensor data received from the sensor system; and a user attention analytics engine configured to produce user attention analytics based at least in part on estimated user response events received from the one or more device analytics engines, the user attention analytics corresponding with estimated user attention to content intervals of a content presentation being provided via the head unit and the loudspeaker system, wherein the control system is configured to: cause the content presentation to be altered based, at least in part, on the user attention analytics; and cause an altered content presentation to be provided by the head unit, by the loudspeaker system, or by the head unit and the loudspeaker system.

2. The system of claim 1 , further comprising an interface system configured for providing communication between the control system and one or more other devices via a network, wherein the altered content presentation is, or includes, altered content received from the one or more other devices via the network.

3. The system of claim 2, wherein causing the content presentation to be altered involves sending, via the an interface system, user attention analytics from the user attention analytics engine and receiving the altered content responsive to the user attention analytics.

4. The system of any one of claims 1-3, wherein causing the content presentation to be altered involves causing the content presentation to be personalized or augmented.

5. The system of claim 4, wherein causing the content presentation to be personalized or augmented involves altering one or more of audio playback volume, audio rendering location, one or more other audio characteristics, or combinations thereof.

6. The system of claim 4 or claim 5, wherein the head unit is, or includes, a television and wherein causing the content presentation to be personalized or augmented involves altering one or more television display characteristics.

7. The system of any one of claims 3-6, wherein causing the content presentation to be personalized or augmented involves altering a storyline, adding a character or other story element, altering a time interval during which a character is involved, altering a time interval devoted to another aspect of the content presentation, or combinations thereof.

8. The system of any one of claims 3-7, wherein causing the content presentation to be personalized or augmented involves providing personalized advertising content.

9. The system of claim 8, wherein providing personalized advertising content involves providing advertising content corresponding to estimated user attention to one or more content intervals involving one or more products or services.

10. The system of any one of claims 3-9, wherein causing the content presentation to be personalized or augmented involves providing or altering a laugh track.

11. The system of any one of claims 1-10, wherein the head unit, the loudspeaker system and the sensor system are in a first environment; and the control system is further configured to cause the content presentation to be altered based, at least in part, on sensor data, estimated user response events, user attention analytics, or combinations thereof, corresponding to one or more other environments.

12. The system of any one of claims 1-11, wherein the control system is further configured to cause the content presentation to be paused or replayed.

13. The system of any one of claims 1-12, wherein the sensor system includes one or more cameras and wherein the sensor data includes camera data.

14. The system of any one of claims 1-13, wherein the sensor system includes one or more microphones and wherein the sensor data includes microphone data.

15. The system of claim 14, wherein the control system is further configured to implement an echo management system to mitigate effects of audio played back by the loudspeaker system and detected by the one or more microphones.

16. The system of any one of claims 1-15, wherein one or more first portions of the control system are deployed in a first environment and a second portion of the control system is deployed in a second environment.

17. The system of claim 16, wherein the one or more first portions of the control system are configured to implement the one or more device analytics engines and wherein the second portion of the control system is configured to implement the user attention analytics engine.

18. The system of claim 4 or claim 5, wherein the head unit is, or includes, a digital media adapter.