WO2025196197A1

WO2025196197A1 - Information processing system and method

Info

Publication number: WO2025196197A1
Application number: PCT/EP2025/057656
Authority: WO
Inventors: Justinas Miseikis; David GONZALEZ HIDALGO
Original assignee: Sony Europe Bv; Sony Group Corp
Current assignee: Sony Europe Bv; Sony Group Corp
Priority date: 2024-03-20
Filing date: 2025-03-20
Publication date: 2025-09-25
Anticipated expiration: 2026-09-20

Abstract

The present invention is directed to an information processing system for providing content insight information regarding media content, wherein the information processing system comprises circuitry configured to acquire content data related to media content, to acquire user data which is indicative of a user behaviour of the user consuming the media content and to input the content data and the user data to a user reaction neural network to generate user reaction information of the user, wherein the user reaction information is indicative of a user reaction related to the media content, and to aggregate user reaction information of a plurality of users to generate the content insight information regarding the media content. Further, the present invention is directed to a method for providing the content insight information.

Description

INFORMATION PROCESSING SYSTEM AND METHOD

TECHNICAL FIELD

The present disclosure generally pertains to an information processing system and method for providing content insight information regarding media content. In particular, the present disclosure is directed to provide the content insight information based on crowdsourced feedback.

TECHNICAL BACKGROUND

User interaction with media devices such as (smart) TVs, tablets, smartphones, etc. has evolved significantly over the years, from passive viewing to active engagement. From the beginnings of black-and-white TVs with manual dials, where viewers were passive consumers of content displayed on the TV screen, to the sleek, technology-enabled smart TV of today that offer a multifaceted and interactive viewing experience and allow new forms of interaction beyond the traditional remote control, such as voice commands, gesture commands, facial expressions commands, etc.

As technology advanced, so did the expectations of users. With the rise of smart TVs, the Internet, and streaming platforms, the users have gained unprecedented control over what, when, and how they consume media content. Users can now access online media content and services like streaming platforms, digital newspaper, podcasts, social media, e-commerce, and the like. Technology has made media devices more interactive, and the introduction of Large Language models (LLMs) should be no exception.

LLMs are a type of artificial Intelligence (Al) that can perform multiple natural language processing (NLP) tasks. LLMs are built based on deep learning and are trained on vast datasets containing text from the Internet. LLMs can generate human-like texts, answer questions, translate languages, summarize documents, and engage in contextually relevant conversations. Further, LLMs can generate computer code which can be used by other programs to visualise and generate various multimodal results. Some examples include using the generated codes for graph plotting, providing a calculator, generating python code (including all the libraries), speech generation, image generation, and the like. From content generation and recommendation systems to virtual assistants and customer service solutions, LLMs have the potential to impact numerous aspects of daily lives. It is generally desirable to use the above-described technologies to gain insight regarding user reactions to media content output on a media device. This insight could be useful to content creators, rightsholders, and producers. It could be utilized for creating more engaging content, which could be adjusted according to the viewer’s feedback or impressions. With generative Al, even the storyline of certain tv episodes or movies can be adjusted. Or even voting or betting processes could be integrated to influence the outcome of certain actions or episodes. For example, all the viewers worldwide could provide feedback or “vote” in which character should succeed in some duel. Or in reality TV shows or sports matches which have live broadcast, they could influence some future events in the show or game.

Thus, the present invention is related to an aggregation of multiple user reaction actions, comments, and attention while consuming media content to provide information (content insight information) on the engagement and interest in a crowdsourced manner.

SUMMARY

An information processing system for providing content insight information regarding media content, wherein the information processing system comprises circuitry configured to: acquire content data related to media content; acquire user data which is indicative of a user behaviour of the user consuming the media content; input the content data and the user data to a user reaction neural network to generate user reaction information of the user, wherein the user reaction information is indicative of a user reaction related to the media content; and aggregate user reaction information of a plurality of users to generate the content insight information.

According to a second aspect, the present disclosure provides a method for providing content insight information regarding media content, the comprising: acquiring content data related to media content; acquiring user data which is indicative of a user behaviour of the user consuming the media content; inputting the content data and the user data to a user reaction neural network to generate user reaction information of the user, wherein the user reaction information is indicative of a user reaction related to the media content; and aggregating user reaction information of a plurality of users to generate the content insight information.

Further aspects are set forth in the dependent, the drawings and the following description.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments are explained by way of example with respect to the accompanying drawings, in which:

Fig. la shows a media device including a user reaction engine for generating user reaction information;

Fig. lb shows a system including a plurality of media devices and a server;

Fig. 2 is a block diagram of an exemplary configuration of the user reaction engine;

Fig. 3 is a block diagram of an exemplary configuration of the server;

Fig. 4 shows a method for generating the user reaction information; and

Fig. 5 shows a method for generating the content insight information;

DETAILED DESCRIPTION OF EMBODIMENTS

Before a detailed description of the embodiments under reference of Fig. 1 is given, general explanations are made.

In some embodiments, an information processing system for providing content insight information regarding media content comprises circuitry configured to: acquire content data related to media content; acquire user data which is indicative of a user behaviour of the user consuming the media content; input the content data and the user data to a user reaction neural network to generate user reaction information of the user, wherein the user reaction information is indicative of a user reaction related to the media content; and aggregate user reaction information of a plurality of users to generate the content insight information. The media content can be video media content such as a film, a TV show, a video, or the like. Further, the media content can be audio media content such as a podcast, an audio book, or the like.

The media content can be output on a media device. The media device can be a smart TV, a tablet, a mobile phone, AR/VR glasses, smartwatch, projector, or any other device configured to output the media content. Further, the media device may comprise an interface to access the Internet. By way of the Internet connection, the media device may stream media content provided on streaming portals or download the media content and store it on a storage of the media device. Further, the media device (and/or a later described server) can be configured to access, via APIs, third party applications.

The media content is indicated by the content data. That is, the content data represents or is related to the media content.

The content data comprises at least one of content image data, content text data, and content audio data. For example, a video media content may usually comprise image data and audio data. A podcast may usually comprise audio data.

“Acquiring the content data” means that the content data is either streamed or broadcasted to the media device. Alternatively, the content data is downloaded to the media device. In other words, the media device is configured to obtain or receive the content data.

The user data is indicative of the user behaviour. That is, the user behaviour can be derived from the user data. A user behaviour may comprise an action, a reaction, or an intention of the user. For example, the user behaviour may be a gesture, a gaze, a position, a speech (e.g. a comment, a question, an inquiry, etc.), a mood, and the like of the user. The user content data comprises at least one of user image data, user text data, and user audio data.

“Acquiring the user data” includes that the media device is configured to capture the user data. For example, the media device may comprise a camera to capture an image. The camera then outputs the image user data corresponding to the captured image. The camera may be configured as a RBG camera or event camera (also “event-based camera”). Further, a presence sensor (such as a mmWave sensor which uses short-wavelength electromagnetic waves) may be used to capture user data.

Further, the media device may comprise a microphone to capture a speech of the user and the microphone then outputs the audio user data corresponding to the speech. Furthermore, the media device may comprise input means capturing text data entered by the user in the input means. The input means may be wired to or wirelessly coupled with the media device. The input means can be a remote controller (such as a conventional TV remote controller), gaming pad, touchpad on a screen of the media device, etc.

The content data and the user data are input to the user reaction neural network to generate the user reaction information. The user reaction neural network is trained to understand how the user is reacting to the media content. The user reaction neural network has multimodal capabilities in that it may process audio data, text data and image data to generate the user reaction information. The user reaction neural network may be trained to perform NPL, object segmentation and action recognition. In some examples, the neural network may be an LLM. In further examples, the LLM may be based on a transformer-based neural network architecture as described in A.

Vaswani et al., “Attention is all you need,” in 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 2017. Such exemplary neural networks known in the art include BERT, ChatGPT, and the like.

By inputting the content data and the user data to the user reaction neural network, the user’s reaction can be mapped to the content data. In other words, the captured user data is mapped to the context of the content shown, captured, and encoded using Al methods. It can be done by adding encoding layer or capturing certain parameters in the latent space of the Al or deep learning models.

The user reaction information is indicative of the user reaction related to the media content. Therefore, the user reaction information includes at least one reaction of the user related to the media content. User reactions may include visual user reactions (e.g., gesture, facial expression, etc.), audible user reactions (e.g., speech, noises, etc.) and text input user reactions (e.g., text messages input via input means).

In the user reaction information, the user reaction is mapped to the media content, e.g. by mapping the user reaction to a point in time (or timestamp) of the media content. If the user reaction information comprises a plurality of user reactions, each of the user reactions is mapped to a respective timestamp of the media content.

The user reaction information of a plurality of users is aggregated to generate the content insight information. This means that there is a plurality of media devices, each of which is configured to generate the user reaction information of at least one user consuming the media content output on the media device. Thus, it is possible to collect a plurality of user reaction information related to the same media content but collected from different media devices. The aggregation of the user reaction information of the plurality of users effectuates a crowdsourced feedback regarding the same media content.

By way of the information processing system, the media content is (pre-)analyzed on the media device to provide metadata of what is happening the media content. The metadata include specific people shown, a game or sports match including current statistics, advertisements (either direct, or background posters, or placed advertising in the media content), specific events, objects, etc. Given the metadata on the shown content, the context of what is happening on the screen or what is output by the speaker is derived and understood by the information processing system.

Further, the user and his reactions to the content are observed. The reactions may include speech or phrases, pausing or starting the media content (where applicable), body language and facial expression.

Such a user reaction information is collected from multiple media devices. Once transmitted to the server, the user reaction information of a plurality of users is aggregated by mapping user reactions from many users about the same portions of the same media content, typically mapped to specific timestamps of the media content. When aggregating the user reactions, general insights are derived to provide wide insights on which portions of the media content cause certain reactions, which parts were not interesting, some highlights of the content and even most and least favourite moments. Similarly, for advertising purposes, user engagement could be evaluated.

Thus, the information processing system is configured to collect (acquire) and aggregate crowdsourced feedback of the viewers’ actions and reactions to the content being played (user reaction information of a plurality of users regarding the same media content). The user reaction is mapped directly to the content and context of the video, thus giving direct natural reactions of people watching the video. This provides statistically significant data and insights into the state of viewers consuming certain media content from all around the world in form of the content insight information.

Thereby, a system is provided that allows the collection, aggregation, and anonymization of natural reactions of the users regarding the media content that they are consuming on their media devices, where context would be known or analysed, for further engagement and interaction activities as well as insights for producers and copyrighters to understand the reaction of the audience to their media content. The collection of user reaction information and aggregation thereof could be done either in realtime, or with certain delay, depending on the application.

In some embodiments, aggregating the user reaction information of the plurality of users comprises to: retrieve the user reaction information of a plurality of users from a user reaction database as context information; and input the context information to an insight neural network to generate the content insight information.

The user reaction database comprises a plurality of user reaction information of a plurality of users regarding a plurality of media contents. Each of the plurality of user reaction information may be captured by a respective media device used by at least one user. The plurality of media contents comprises different video contents and audio contents. The user reaction database may be configured as a vector database which stores the plurality of user reaction information as embedded data (embeddings) which can be used in LLMs.

The retrieval of the user reaction information of a plurality of users is based on a content insight query which queries for providing a content insight information regarding a particular media content. The query is passed to the user reaction database and, as a result, the user reaction information of a plurality of users is retrieved and then used as context information for the insight neural network.

The insight neural network is configured to provide the content insight information for the media content. To this end, the insight neural network receives the context information (comprising the user reaction information of a plurality of users regarding the media content) and the content insight query and generates the content insight information.

The insight neural network can be an LLM, such as a t transformer-based Large Language Model, LLM. Further, the above description of the user reaction neural network applies correspondingly to the insight neural network.

The retrieval of the user reaction information of a plurality of users and generating the content insight information based on the retrieved information corresponds to a retrieval augmented generation (RAG)-based LLM application. It is noted that also other forms LLM applications are possible to generate the content insight information based on the user reaction information of a plurality of users. In some embodiments, the content insight information may include the user reaction information which is time-indexed regarding the media content.

That is, each user reaction included in the user reaction information is associated with a point in time (or timestamp) of the media content. By this, each user reaction can be associated with a respective portion of the media content. For example, sad user reactions like crying can be associated with a portion of a video content which depicts a sad scene and happy user reactions like smiling can be associated with another portion of the video content which depicts a funny scene, wherein these scenes can be indicated with corresponding timestamps. That is, the user reaction is time-indexed with respect to a portion of the media content.

In some embodiments, the user reaction neural network is a transformer-based LLM. Further, the insight neural network can be a transformer-based LLM.

In some embodiments, the circuitry is further configured to input the content data as media content embeddings and the user data as user embeddings to the neural network.

The media content embeddings include media content vision embeddings and media content text embeddings. Correspondingly, the user embeddings include user vision embeddings and user text embeddings.

Generally, text embeddings are a numerical representation of text, used for measuring semantic similarities and aiding in context-based Al interaction. For example, text embeddings are vectors or arrays of numbers which represents the meaning text data. The text embeddings are then used by the neural network.

Correspondingly, vision embeddings are numerical representations of images that encode the semantics of in image data. For example, vision embeddings are vectors or arrays of numbers which represents the meaning and the context of the image data.

The media content embeddings are generated based on the content data. More specifically, the media content vision embeddings are generated based on the content image data and the media content text embeddings are generated based on the content text data and/or on the content audio data.

Correspondingly, the user vision embeddings are generated based on the user image data and user text embeddings are generated based on the user text data and/or on the user audio data.

The media device may comprise a vision encoder which is configured to generate the vision embeddings (i.e. the media content vision embeddings and the user vision embeddings) based on the image data (i.e. the content image data and the user image data, respectively). The media device may comprise a text embedder which is configured to generate the text embeddings (i.e. the media content text embeddings and the user text embeddings) based on the image data (i.e. the content text data and the user text data, respectively).

Further, the media device may comprise a speech-to-text encoder which is configured to convert speech included in audio data into text data which can be subsequently fed to the text embedder to generate text embeddings. More specially, speech included in the content audio data may be converted into content text data by the speech-to-text encoder. The converted content text data can then be input to the text embedder which generates media content text embeddings based on the converted content text data. That is, the content text embeddings are generated based on the content audio data.

Correspondingly, the user text embeddings can also be generated based on the user audio data as described with respect to the processing of the content audio data.

In some embodiments, the media content is a video content or an audio content.

In some embodiments, the user reaction includes at least one of a speech, a gesture, a text input, and a facial expression of the user.

In some embodiments, the circuitry is further configured to acquire the user reaction information of a plurality of users from multiple media devices. Each of multiple media devices is used by a corresponding user/group of users. Thus, by acquiring the user reaction information from multiple media devices, user reaction information of different users can be obtained.

In some embodiments, the user data is anonymised. For example, when obtaining the user image data and the user audio data, any personalised features (such as identity, face, voice timbre, age, etc. of the user) are removed or filtered before the inputting user data to user reaction neural network. For example, in a case where the user points to a certain object shown on the screen of the media device, only the information on which object the user’s attention is and a transcription of the speech is fed to the user reaction neural network. Thus, no personal user information (like direct voice inputs, or images of the user as captured by the media device) are used as input.

In some embodiments, the information processing system comprises the media device outputting the media content and a server, wherein the media device is configured to acquire the content data and the user data are and to generate the user reaction information. The server is configured to aggregate the user reaction information of the plurality of users and to generate the content insight information. The media device generates the user reaction information and transmits the user reaction information to the server for further processing. In some embodiments, the method for providing content insight information regarding media content comprises: acquiring content data related to media content; acquiring user data which is indicative of a user behaviour of the user consuming the media content; inputting the content data and the user data to a user reaction neural network to generate user reaction information of the user, wherein the user reaction information is indicative of a user reaction related to the media content; and aggregating user reaction information of a plurality of users to generate the content insight information.

The method may be carried out with the information processing system comprising the circuitry discussed herein.

In some embodiments, aggregating the user reaction information of the plurality of users comprises retrieving the user reaction information of a plurality of users from a user reaction database as context information, and inputting the context information to an insight neural network to generate the content insight information. In some embodiments the content insight information includes the user reaction information which is time-indexed regarding the media content. In some embodiments, the user reaction neural network is a transformer-based Large Language Model, LLM. In some embodiments, the media content is a video content or an audio content. In some embodiments, the user reaction includes at least one of a speech, a gesture, a text input, and a facial expression. In some embodiments, the method further comprises acquiring the user reaction information from multiple media devices. In some embodiments, the user data is anonymised. In some embodiments, the user reaction information is generated based on content data indicative of the media content and user data indicative of a user behaviour. In some embodiments, acquiring the content data and the user data are and generating the user reaction information are performed by a media device, and aggregating the user reaction information of the plurality of users and generating the content insight information are performed by a server.

The methods as described herein are also implemented in some embodiments as a computer program causing a computer and/or a processor to perform the method, when being carried out on the computer and/or processor. In some embodiments, also a non-transitory computer- readable recording medium is provided that stores therein a computer program product, which, when executed by a processor, such as the processor described above, causes the methods described herein to be performed.

Returning to Fig. la, a media device 100 is shown which includes a user reaction engine 200, a screen 101 and/or a speaker 103 to output media content, a camera 105, a microphone 107, input means 109, a communication module 111. Further, a user 110 is shown which consumes media content output by the media device 100. In some example, there can also be multiple users 110 consuming the media content output by the media device 100.

The camera 105 and the microphone 107 are configured to capture user data 203 regarding the behaviour of the user 110.

The camera 105 is configured to capture image data regarding the user 110 (user image data 203a). For example, the camera 105 is configured to capture a user’s gesture and to provide user image data indicative of the user’s gesture. The user’s gesture may a be hand gesture (e.g. pointing to a position or object on the screen of the media device 100), head gesture (e.g. nodding) and the like. The camera 105 is further configured to capture a gaze or a gaze direction of the user 110 and based on the captured image data indicative of the gaze or the gaze direction, it can be determined to which portion of the screen of the media device 100 the user’s gaze is directed.

The microphone 107 is configured to capture audio data regarding the user 110 (user audio data 203c) and content audio data output by the media device 100. For example, the microphone 107 may capture, as user audio data, speech.

Further, the media device 100 may comprise input means 109 configured to receive input by the user 110. The input means 109 can be integrated into the media device 100 such as touch pad. In another example, the input means 109 can be remotely arranged from but coupled to the media device 105 such as a remote controller, keyboard and mouse, gaming controller. The user 101 may enter user text data 203b via the input means 109.

By way of the communication module 111, the media device 100 and the user reaction engine 200 are connected or have access to the Internet 120. By this, media content can be streamed on or downloaded to the media device 100. Further, the Internet connection enables the media device 100 and the user reaction engine 200 to access data bases, use cloud storages, use search engines (e.g. Google, Bing, etc.), etc.

As will be described later, the user reaction engine 200 is configured to generate user reaction information 205 related to the media content output on the media device 100 based on the content data 201 and the user data 203. The user reaction information 205 is indicative of a reaction of the user 110 related to the output media content.

The user reaction information 205 can then be transmitted via the Internet 120 to the server 300 as shown in Fig. lb.

As shown in Fig. lb, the server 300 is connected via the Internet 120 with a plurality of media devices 100a, 100b, 100c, lOOn, each of which is configured as the media device 100 as described with respect to Fig. la. Each of the plurality of media devices 100a, 100b, 100c, lOOn can transmit their respectively generated user reaction information to the server 300. The server 300 is configured to generate content sight information 315 as will be described later with respect to Figs. 3 and 5.

Fig. 2 shows the configuration of the user feedback engine 200. In order to provide user reaction information 205, the user reaction engine 200 is configured to receive the content data 201 and the user data 203 as input data and processes the input data to generate the user reaction information 205. That is, the user reaction engine 200 analyses the media content (indicated by the content data 201) and the behaviour of the user 110 (indicated by the user data 203) to generate the user reaction information 205.

The content data 201 is indicative of the media content output on the media device 203. The content data 201 may comprise at least one of content image data 201a, text data 201b, and content audio data 201c. Additionally, content data 201 may comprise meta data regarding the media content (content meta data).

With the content data 201, the user reaction engine 200 may be able to understand what is output on the media device 201 and may provide context regarding the media content. That is, the user reaction engine 200 is configured to analyse the content data 201 to obtain descriptive text information regarding the media content.

The user data 203 is indicative of a feedback of the user 110. As described above, the user data 203 may comprise the user image data 203a and the user audio data 203c captured with the camera 105 and the microphone 107, respectively. Further, the user data 203 may also comprise the text input by the user 110 (user text data 203b) through the input means 109.

The user feedback engine 200 comprises a speech-to-text encoder 210, a tokenizer 220, a text embedder 230, a vision encoder 240, a conversion component 250 and a LLM 260 (user reaction neural network). The feedback engine 200 is configured to process the content data 201 and the user data 203. By way of example, processing of the content data 201 is explained with respect to Fig. 2. The processing of the user data 203 is performed correspondingly.

The speech-to-text encoder 210 is configured to convert spoken language into text form. That is, the speech-to-text encoder 210 is configured to receive the content audio data 201c indicative of spoken language and to convert the content audio data 201c into text data expressed as a sequence of text. Configurations of the speech-to-text encoder 210 are known to the skilled person and may comprise, for example, an attention-based encoder-decoder framework or a transducer framework.

The tokenizer 220 is configured to receive, as input, the content text data 201b. Further, the input to the tokenizer 220 may also be the converted text data output by speech-to-text encoder 210.

The tokenizer 220 is configured to convert the content text data 201b into tokens. That is, the tokenizer 220 is configured to break a text sequence comprised in the content text data 201b into smaller parts, i.e. tokens. For example, a sentence “One ring to rule them all” comprised in the content text data 201b is tokenized into the individual words “one”, “ring”, “to”, “rule”, “them”, “all”. The output of the tokenizer 220 is a sequence of tokens that represents the text included in the content text data 201b. In other examples, the tokens can additionally comprise sub words, signs, and other small units, which collectively represents the sentence.

The text embedder 230 is configured to receive, as input, the tokens generated by the tokenizer 220 based on the content text data 201b and to generate (output) text embeddings indicative of the content text data 201b. The text embeddings are then used by the LLM 260 as described later.

Further, the text embeddings include temporal information for time indexing the text embeddings.

The vision encoder 240 is configured to receive the content image data 201a. The vision encoder 240 is configured to generate (output) vision embeddings based on the received content image data 201a. The vision embeddings are then used by the LLM 260 as described later.

Further, in case of video media content (sequential content image data), the vision embeddings include temporal information for time indexing the vision embeddings.

The conversion component 250 is configured to receive, as input, the vision embeddings generated by the vision encoder 240, and to convert the vision embeddings into a format that the LLM 260 can understand and process. Thereby, the LLM 260 can jointly process the (converted) vision embeddings and the text embeddings. For example, the conversion component 250 is configured to transform the vision embeddings, i.e. convert the output of the vision encoder 240 into the format that the LLM 260 can understand, to align modalities such that the vision embeddings and text embeddings are compatible for joint processing by the LLM 260. In some example, the conversion component 250 is configured to resample the vision embeddings to ensure that vision embeddings have a common fixed length, making them suitable for further interactions with the LLM 260 (ensuring so called “Fixed-Length Representations”).

The LLM 260 is configured to receive, as input, the text embeddings generated by the text embedder 230 and the (converted) vision embeddings output by the vision encoder 240 in order to generate the user reaction information 205. In other examples, the LLM 260 can also configured as any other transformer-based neural network.

Additionally, the LLM 260 is configured to encode the embeddings by using artificial intelligence methods. For example, encoding can be done by adding an encoding layer to the LLM 260. In some example, the encoding may comprise a latent space encoding as known in the art, which captures certain parameters in the latent space of LLM 260 (or, generally, in the latent space of the Al or the deep learning model).

The LLM 260 is configured to generate the user reaction information 205. The user reaction information 205 comprises the user reaction with respect to the media content. In the user reaction information 205, the user reaction is associated with the portion of the media content. The association is based on the temporal information included in the text embeddings and vision embeddings. In other words, the user reaction is associated with a timestamp of the media content.

In some examples, the user reaction information 205 may comprise a plurality of user reactions, each of which is associated with a corresponding one of a plurality of portions of the media content. That is, a first user reaction is associated with a first timestamp of the media content, a second user reaction is associated with a second timestamp of the media content, and so on.

Fig. 3 shows configuration of the server 300 which comprises a communication module 301, a management module 303, a user reaction database 305 and a content insight engine 310. The server 300 is configured to perform a Retrieval Augmented Generation (RAG) based LLM application.

The communication module 301 is configured to receive a plurality of user reaction information 205a, 205b, 205c, 205n generated by and transmitted from the plurality of media devices 100a, 100b, 100c, lOOn. Each of the plurality of user reaction information 205a, 205b, 205c, 205n comprises the type of information as described with respect to the user reaction information 205. The management module 303 is configured to store the plurality of user reaction information 205a, 205b, 205c, 205n in the user reaction database 305. Further, the management module 303 is configured to pass a content insight query 304 to the retriever 311 and the insight LLM 313.

The user reaction database 305 is configured as a vector store which stores the user reaction information 305 as embedded data for use in an insight LLM 313.

The content insight engine 310 is configured to generate the content insight information 315 and comprises a retriever 311 and an insight LLM 313.

The retriever 311 is configured to retrieve, from the user reaction database 305, context information 312 based on the content insight query 304. Here, the context information 312 comprises the plurality of user reaction information 205a, 205b, 205c, 205n of a plurality of users, wherein the user reaction information 205a, 205b, 205c, 205n is related to the same media content. The content insight query 304 indicates which user reaction information is to be retrieved. The content insight query 304 is formulated as a query text, for example “Please provide content insight for the movie XY”. Generally speaking, the retriever 311 is configured to retrieve, from the user reaction database 305, user reaction information which originates from different users/media devices and are related to the same media content based on the content insight query 304.

The insight LLM 330 is configured to generate the content insight information 315 related to the media content based on the content insight query 304 and context information 312. To this end, the content insight query 304 and context information 312 are passed to the insight LLM 330 which generates a response to the content insight query 304, i.e. the content insight information 315.

The content insight information 315 comprises information regarding feedback of a plurality of users 110 regarding the same media content. For example, the content insight information 315 comprises all user reactions comprised in the user feedback user reaction information 205a, 205b, 205c, 205n and corresponding timestamps of the media content, where the user reactions occur. Generally, the content insight information 315 is indicative of which portions of the media content cause certain reactions, of which portions are not interesting, of some highlights of the content and of most and least favourite moments.

Fig. 4 shows a method 400 for processing the content data 201 and the user data 203 for generating the user reaction information 205. The method 400 is carried out by the media device 100. In step 401, the content data 201 included in the media content is obtained. The content data 201 includes at least one of the content image data 201a, the content text data 201b and the content audio data 201c.

In step 403, content vision embeddings are generated based on the content image data 201a. To this end, the content image data 201a is input to the vision encoder 200 which generates vision embeddings based on image data.

In optional step 403a, the content vision embeddings are converted to a format which the LLM 260 can process. To this end, the contend vision embeddings are input into the conversion component 250 which converts the content vision embeddings to obtain converted content vision embeddings. The presence of step 403a depends on the capability of the LLM 260. If the LLM 260 is able to process text embeddings and vision embeddings without any conversion of the vision embeddings, step 405a is omitted.

In step 405, content text embeddings are generated based on the content text data 201b and/or the content audio data 201c. To generate text embeddings based on the content text data 201b, the content text data 201b is input into the tokenizer 203 which generates tokens. The tokens are then input into the text embedder 230 which generates text embeddings based on the tokens. To generate text embeddings based on the content audio data 201c, the content audio data 201c is input to the speech-to-text-encoder 210 which detects speech included in the content audio data 201c and outputs the detected speech as text data. The text data output by the speech-to-text encoder 210 is then processed by the tokenizer 220 and the text embedder 230 to generate text embeddings as described with respect to the content text data 201b. The content embeddings are input to the LLM 260 as described later with step 417.

In step 411, the user data 203 indicative of a user’s behaviour is obtained. The user data 203 includes at least one of the user image data 203 a, the user text data 203b and the user audio data 203c.

In optional step 411a, user specific features are removed from the user data 203 in order to anonymise the user data 203.

In step 413, content vision embeddings are generated based on the user image data 203a. To this end, the user image data 203a is input to the vision encoder 200 as described in step 403 with respect to the content vision embeddings. In optional step 413a, the user vision embeddings are converted by the conversion component 205 as described in step 403a with respect to the content vision embeddings. In step 415, content text embeddings are generated based on the user text data 203b and/or the user audio data 203c as described in step 405 with respect to the content text data 201b and the content audio data 201c.

In short, the user data 203 is input to the user reaction engine 200 as described in steps 403, 403a, 405 with respect to the content data 201. Therefore, the above-described processing of the content data 201 applies correspondingly to the processing of the user data 203.

In step 417, the (optionally converted) content vision embeddings and the content text embeddings are input to the LLM 260 to generate the user reaction information 205. The LLM 260 may analyse the media content. For example, the LLM 260 may be configured to perform at least one of language understanding, object segmentation, action recognition and LLM-based language processing on the content data 201 for understanding the media content.

Further, in step 417, the (optionally converted) user vision embeddings and the user text embeddings (which are user input embeddings) are input to the LLM 260. By this, the LLM 260 is configured to analyse the user data 203. To this end, the LLM 260 is configured to perform at least one of language understanding, object segmentation, action recognition and LLM-based language processing on the user data 203.

Further, in step 417, the LLM 260 is configured to merge the content data 201 (more specifically, the content vision embeddings) and the user data 203 (more specifically, the user text embeddings) to understand the user’s behaviour with respect to the media content and to generate the user reaction information 205. Therefore, the LLM 260 generates the user reaction information 205 based on the content data 201 and the user data 203.

In step 419, the media device 100 transmits the user reaction information 205 to the server 300 for further processing.

Fig. 5 shows a method 500 for providing the content insight information 315. The method 500 is performed by the server 300.

In step 501, the communication module 301 receives the plurality of user reaction information 205a, 205b, 205c, 205n of the plurality of media devices 100a, 100b, 100c, lOOd and relating to the same media content.

In step 503, the management module 303 inputs the content insight query 304 to the retriever 311 for retrieving the context information 312 from the user reaction database 305 based on the content insight query 304. The content insight query 304 corresponds to a query for providing content insight information regarding a media content. For example, the content insight query 304 may be queried by an administrator or any other user who is interested in the content insight information..

In step 505, the retriever 311 retrieves the context information 312, i.e. the plurality of user reaction information 205a, 205b, 205c, 205n related to the same media content. In some examples, the context information 312 can aggregate the plurality of user reaction information 205a, 205b, 205c, 205n into a single data file.

In step 507, the management module 303 inputs the content insight query 304 and the context information 312 to the insight LLM 313.

In step 509, the insight LLM 313 generates the content insight information 315 regarding the media content as queried in the content insight query 304.

It should be recognized that the embodiments describe methods with an exemplary ordering of method steps. The specific ordering of method steps is however given for illustrative purposes only and should not be construed as binding. Changes of the ordering of method steps may be apparent to the skilled person.

All units and entities described in this specification and defined in the appended claims, if not stated otherwise, can be implemented as integrated circuit logic, for example on a chip, and functionality provided by such units and entities can, if not stated otherwise, be implemented by software.

In so far as the embodiments of the disclosure described above are implemented, at least in part, using software-controlled data processing apparatus, it will be appreciated that a computer program providing such software control and a transmission, storage or other medium by which such a computer program is provided are envisaged as aspects of the present disclosure.

It should be recognized that the embodiments describe methods with an exemplary ordering of method steps. The specific ordering of method steps is however given for illustrative purposes only and should not be construed as binding.

A method for providing user reaction information and a method for providing content insight information are described above and under reference of Fig. 4 and 5, respectively. The methods can also be implemented as a computer program causing a computer and/or a processor, to perform the methods, when being carried out on the computer and/or processor. In some embodiments, also a non-transitory computer-readable recording medium is provided that stores therein a computer program product, which, when executed by a processor, such as the processor described above, causes the method described to be performed. All units and entities described in this specification and claimed in the appended claims can, if not stated otherwise, be implemented as integrated circuit logic, for example on a chip, and functionality provided by such units and entities can, if not stated otherwise, be implemented by software.

Note that the present technology can also be configured as described below.

(1) An information processing system for providing content insight information regarding media content, wherein the information processing system comprises circuitry configured to: acquire content data related to media content; acquire user data which is indicative of a user behaviour of the user consuming the media content; input the content data and the user data to a user reaction neural network to generate user reaction information of the user, wherein the user reaction information is indicative of a user reaction related to the media content; and aggregate user reaction information of a plurality of users to generate the content insight information regarding the media content.

(2) The information processing system according (1), wherein, for aggregating the user reaction information, the circuitry is further configured to: retrieving the user reaction information of a plurality of users from a user reaction database; and inputting the user reaction information of a plurality of users to an insight neural network to generate the content insight information.

(3) The information processing system according to (1) or (2), wherein the content insight information includes the user reaction information which is time-indexed regarding the media content.

(4) The information processing system according to anyone of (1) or (3), wherein the user reaction neural network is a transformer-based Large Language Model, LLM. (5) The information processing system according to anyone of (1) or (4), wherein the media content is a video content or an audio content.

(6) The information processing system according to anyone of (1) or (5), wherein the user reaction includes at least one of a speech, a gesture, a text input, and a facial expression.

(7) The information processing system according to anyone of (1) or (6), wherein the circuitry is further configured to acquire the user reaction information of a plurality of users from multiple media devices.

(8) The information processing system according to anyone of (1) or (7), wherein the user data is anonymised.

(9) The information processing system according to anyone of (1) or (8), wherein the user reaction information is generated based on content data indicative of the media content and user data indicative of a user behaviour.

(10) The information processing system according to anyone of (1) or (9), wherein the information processing system comprises a media device outputting the media content and a server, wherein the media device is configured to acquire the content data and the user data are and to user reaction information is generated by and generate the user reaction information, and the server is configured to aggregate the user reaction information of the plurality of users and to generate the content insight information.

(11) A method for providing content insight information regarding media content, comprising: acquiring content data related to media content; acquiring user data which is indicative of a user behaviour of the user consuming the media content; inputting the content data and the user data to a user reaction neural network to generate user reaction information of the user, wherein the user reaction information is indicative of a user reaction related to the media content; and aggregating user reaction information of a plurality of users to generate the content insight information.

(12) The method according to (11), wherein aggregating the user reaction information of the plurality of users comprises: retrieving the user reaction information of a plurality of users from a user reaction database; and inputting the user reaction information of a plurality of users to an insight neural network to generate the content insight information.

(13) The method according to (11) or (12), wherein the content insight information includes the user reaction information which is time-indexed regarding the media content.

(14) The method according to anyone of (11) to (13), wherein the user reaction neural network is a transformer-based Large Language Model, LLM.

(15) The method according to anyone of (11) to (14), wherein the media content is a video content or an audio content.

(16) The method according to anyone of (11) to (15), wherein the user reaction includes at least one of a speech, a gesture, a text input, and a facial expression.

(17) The method according to anyone of (11) to (16), further comprising: acquiring the user reaction information from multiple media devices.

(18) The method according to anyone of (11) to (17), wherein the user data is anonymised.

(19) The method according to anyone of (11) to (18), wherein the user reaction information is generated based on content data indicative of the media content and user data indicative of a user behaviour.

(20) The method according to anyone of (11) to (19), wherein acquiring the content data and the user data are and generating the user reaction information are performed by a media device, and aggregating the user reaction information of the plurality of users and generating the content insight information are performed by a server.

(21) A computer program comprising program code causing a computer to perform the method according to anyone of (11) to (20), when being carried out on a computer.

(22) A non-transitory computer-readable recording medium that stores therein a computer program product, which, when executed by a processor, causes the method according to anyone of (11) to (20) to be performed.

Claims

1. An information processing system for providing content insight information regarding media content, wherein the information processing system comprises circuitry configured to: acquire content data related to media content; acquire user data which is indicative of a user behaviour of the user consuming the media content; input the content data and the user data to a user reaction neural network to generate user reaction information of the user, wherein the user reaction information is indicative of a user reaction related to the media content; and aggregate user reaction information of a plurality of users to generate the content insight information regarding the media content.

2. The information processing system according to claim 1, wherein, for aggregating the user reaction information, the circuitry is further configured to: retrieving the user reaction information of a plurality of users from a user reaction database; and inputting the user reaction information of a plurality of users to an insight neural network to generate the content insight information.

3. The information processing system according to claim 1, wherein the content insight information includes the user reaction information which is time-indexed regarding the media content.

4. The information processing system according to claim 1, wherein the user reaction neural network is a transformer-based Large Language Model, LLM.

5. The information processing system according to claim 1, wherein the media content is a video content or an audio content.

6. The information processing system according to claim 1, wherein the user reaction includes at least one of a speech, a gesture, a text input, and a facial expression.

7. The information processing system according to claim 1, wherein the circuitry is further configured to acquire the user reaction information of a plurality of users from multiple media devices.

8. The information processing system according to claim 1, wherein the user data is anonymised.

9. The information processing system according to claim 1, wherein the user reaction information is generated based on content data indicative of the media content and user data indicative of a user behaviour.

10. The information processing system according to claim 1, wherein the information processing system comprises a media device outputting the media content and a server, wherein the media device is configured to acquire the content data and the user data are and to generate the user reaction information, and the server is configured to aggregate the user reaction information of the plurality of users and to generate the content insight information.

11. A method for providing content insight information regarding media content, comprising: acquiring content data related to media content; acquiring user data which is indicative of a user behaviour of the user consuming the media content; inputting the content data and the user data to a user reaction neural network to generate user reaction information of the user, wherein the user reaction information is indicative of a user reaction related to the media content; and aggregating user reaction information of a plurality of users to generate the content insight information.

12. The method according to claim 11, wherein aggregating the user reaction information of the plurality of users comprises: retrieving the user reaction information of a plurality of users from a user reaction database; and inputting the user reaction information of a plurality of users to an insight neural network to generate the content insight information.

13. The method according to claim 11, wherein the content insight information includes the user reaction information which is time-indexed regarding the media content.

14. The method according to claim 11, wherein the user reaction neural network is a transformer-based Large Language Model, LLM.

15. The method according to claim 11, wherein the media content is a video content or an audio content.

16. The method according to claim 11, wherein the user reaction includes at least one of a speech, a gesture, a text input, and a facial expression.

17. The method according to claim 11, further comprising: acquiring the user reaction information from multiple media devices.

18. The method according to claim 11, wherein the user data is anonymised.

19. The method according to claim 11, wherein the user reaction information is generated based on content data indicative of the media content and user data indicative of a user behaviour.

20. The method according to claim 11, wherein acquiring the content data and the user data are and generating the user reaction information are performed by a media device, and aggregating the user reaction information of the plurality of users and generating the content insight information are performed by a server.