[go: up one dir, main page]

WO2025168923A1 - Data processing apparatus and method - Google Patents

Data processing apparatus and method

Info

Publication number
WO2025168923A1
WO2025168923A1 PCT/GB2025/050202 GB2025050202W WO2025168923A1 WO 2025168923 A1 WO2025168923 A1 WO 2025168923A1 GB 2025050202 W GB2025050202 W GB 2025050202W WO 2025168923 A1 WO2025168923 A1 WO 2025168923A1
Authority
WO
WIPO (PCT)
Prior art keywords
subtitles
spoken
suitability
video
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/GB2025/050202
Other languages
French (fr)
Inventor
Nigel Stuart Moore
Raheel MALIK
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Europe Bv
Sony Group Corp
Original Assignee
Sony Europe Bv
Sony Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Europe Bv, Sony Group Corp filed Critical Sony Europe Bv
Publication of WO2025168923A1 publication Critical patent/WO2025168923A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/4508Management of client data or end-user data
    • H04N21/4532Management of client data or end-user data involving end-user characteristics, e.g. viewer profile, preferences
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/434Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams, extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
    • H04N21/4348Demultiplexing of additional data and video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/475End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data
    • H04N21/4755End-user interface for inputting end-user data, e.g. personal identification number [PIN], preference data for defining user preferences, e.g. favourite actors or genre
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles

Definitions

  • the present disclosure relates to a data processing apparatus and method.
  • Video content may be provided with subtitles to improve accessibility of the video content. This allows, for example, improved accessibility for people who are hard of hearing or people who cannot speak the original language of the video content.
  • spoken subtitles have become increasingly used. These are delivered as audio output where the subtitles are read out. This is useful, for example, for people who do not speak the original language of the video content and who may also find it difficult to read the subtitles on-screen as the video content is played.
  • Subtitles may be provided as images (e.g. bitmap images) which are overlaid on the video content. To obtain the spoken subtitles, these images may be subjected to optical character recognition and the resulting text is then subject to text-to-speech (TTS) processing.
  • TTS text-to-speech
  • FIGs. 1A and 1 B schematically shows example data processing apparatuses
  • Fig. 2 schematically shows a first example of subtitles
  • Fig. 3 schematically shows an example text-to-speech engine
  • Fig. 4 schematically shows a second example of subtitles
  • Figs. 7A and 7B schematically show example screens displayed to a user
  • Fig. 8 shows a first example method
  • Fig. 1 A shows a data processing apparatus I device (content providing apparatus I device) 100 for providing video content (e.g. via radio transmission or via a communications network such as the internet).
  • Fig. 1 B shows a data processing apparatus I device (content receiving apparatus I device) 111 for receiving the video content (e.g. via radio transmission or via a communications network such as the internet).
  • Each of the processor 101 , memory 102, storage medium 103, communication interface 104 and user interface 105 are implemented using appropriate circuitry, for example.
  • the processor 101 controls the operation of each of the memory 102, storage medium 103, communication interface 104 and user interface 105.
  • the content receiving apparatus 111 comprises a processor 106 for executing electronic instructions, a memory 107 for storing the electronic instructions to be executed and electronic input and output information associated with the electronic instructions, a storage medium 108 (e.g. a hard disk drive or solid state drive) for long term storage of digital information, a communication interface 109 and a user interface 110 (e.g. a touch screen, a non-touch screen, buttons, a keyboard and/or a mouse) for receiving commands from and/or outputting information to a user.
  • the communication interface 109 is for sending information to and/or receiving information from one or more other apparatuses.
  • the communication interface 109 is configured to receive video data representing the video content.
  • the communication interface 109 thus comprises one or more of a radio receiver and a network communication interface (e.g. an Ethernet interface).
  • Each of the processor 106, memory 103, storage medium 108, communication interface 109 and user interface 110 are implemented using appropriate circuitry, for example.
  • the processor 106 controls the operation of each of the memory 107, storage medium 108, communication interface 109 and user interface 110.
  • Fig. 2 shows an example use of TTML.
  • the example has been simplified for ease of explanation (and thus shows only a simplified portion of the TTML code).
  • the content receiving apparatus 111 is (or is comprised within) a television (TV) 111 which an electronic display 201 (e.g. a liquid crystal display (LCD) or organic light emitting diode (OLED) display) and a sound output device 204.
  • the sound output device 204 is a loudspeaker integrated as part of the TV 111 .
  • the sound output device 204 may take any other suitable form, such as a sound bar, surround sound system, headphones or assistive technology such as a hearing aid connected to the TV 111.
  • the display 201 and sound output device 204 are part of the user interface 110.
  • Fig. 3 illustrates example processing of each instance of subtitle text to enable the subtitle text to be spoken (that is, read aloud to the user via the sound output device 204).
  • Each instance of subtitle text is a text string 301 input to a text-to-speech (TTS) engine 302.
  • the TTS engine 302 is executed by the processor 106, for example, and may be any suitable TTS engine known in the art.
  • the output of the TTS engine 302 is a TTS output 303.
  • the TTS output 303 is a digital audio signal which is reproduced by the sound output device 204.
  • subtitle text “I'm talking to you!” is displayed between time 002 and 004.
  • a second instance of subtitle text “I'm interrupting you back!”” is displayed between time 003 and 006.
  • a third instance of subtitle text “STOP ARGUING!” is displayed between time 004 and 005.
  • a fourth instance of subtitle text “##Crashing Noise##” is displayed between time 004 and 006.
  • the fourth instance of subtitle text represents a sound effect (rather than dialogue) and is usually found in subtitles designed for people who are hard of hearing. In this example, the fact it is a sound effect is denoted by the use of the “#” symbols before and after the text “Crashing Noise” describing the sound effect.
  • Fig. 5 shows a table reproduced from DVB TTML subtitles specification (EN 303 560). This table is carried in the “TTML_subtitling_descriptor” and shows possible values of the parameter “TTS_suitability”.
  • the TTS_suitability parameter is signalled to the content receiving device 111 by content providing device 100 and indicates whether the TTML subtitles provided with the video content are suitable for use as spoken subtitles via TTS.
  • TTS_suitability can take a value value 0x0 (indicating the suitability for use as spoken subtitles is unknown), 0x1 (indicating the subtitles are suitable for use as spoken subtitles) or 0x2 (indicating the subtitles are not suitable for use as spoken subtitles).
  • the value 0x3 is reserved for future use.
  • TTS_suitability 0x0.
  • Figs. 6A and 6B show a first example in which existing metadata provided with video content indicating the genre of the video content is used to determine the suitability of TTML subtitles being used as spoken subtitles.
  • the content receiving device 111 is able to determine a likelihood of the TTML subtitles provided with the video content being suitable for use as spoken subtitles. If a user has selected the use of spoken subtitles for all video content where suitable (e.g. through a general interactive settings menu provided via user interface 110), the content receiving device 111 may thus output spoken subtitles for “News” or “Documentary” video content but not for “Drama” or “Sport” video content. Spoken subtitles are thus provided only when appropriate to do so, even if the content provider has not explicitly indicated this using the TTS_su itability value.
  • the method proceeds to step 706.
  • the alternative language speech only subtitles are machine-translated to the language of the spoken dialogue of the video content.
  • the machine-translated speech only subtitles are subject to TTS to generate the spoken subtitles.
  • the TTS engine 302 comprises a machine translation component (not shown) and applies TTS to the output of the machine translation component.
  • the machine translation component uses any suitable known machine translation technique (e.g. neural machine translation using a transformer), for example. Spoken subtitles are thus provided in the language of the dialogue of the video content without the unnecessary description of sound effects (e.g. “##Crashing Noise###”), even though speech only subtitles in that language were not originally available.
  • the translated subtitles may also be displayed in a visual form on the screen. The method then ends at step 708.
  • step 705 If such alternative language speech only subtitles are not available at step 705, the method proceeds to step 707.
  • available subtitles which do not satisfy the test of steps 703 or 705 e.g. HoH subtitles
  • these subtitles are then subject to TTS to generate the spoken subtitles.
  • the method then ends at step 708.
  • the present technique thus allows speech only subtitles in the original language of the spoken dialogue of the video content to be obtained from alternative language speech only subtitles if these are available. Only if such alternative language speech only subtitles are unavailable are subtitles such as HoH subtitles used. This helps provide an improved spoken subtitles experience for the user.
  • the processor 106 may control the TTS engine 302 to ignore all text including and between such symbols when performing TTS. This helps prevent the undesirable and unnecessary output of sound effects as part of the spoken subtitle output.
  • any other suitable technique for distinguishing subtitle text corresponding to dialogue and subtitle text corresponding to sound effects e.g. any suitable machine learning text analysis technique may be used.
  • Fig. 9 shows an example method according to the present technique. The method is executed by the processor 106 of the content receiving device 111 , for example.
  • the method starts at step 801 .
  • step 802 content data comprising video data representing a video and subtitle data representing subtitles for the video is received.
  • the content data is received by communication interface 109 and transmitted by communication interface 104, for example.
  • a data processing apparatus configured to: receive feedback data indicating user feedback on the suitability of the subtitles for use as spoken subtitles when the subtitles are used as spoken subtitles; update the suitability value of the metadata of the video data in the lookup table based on the feedback data.
  • a data processing apparatus wherein, when it is determined the subtitles are suitable for use as spoken subtitles, the circuitry is configured to cause output of the spoken subtitles with reproduction of the video.
  • the circuitry is configured to prevent output of the spoken subtitles with reproduction of the video.
  • a data processing apparatus according to clause 11 , wherein: the subtitle data is provided in a Timed Text Markup Language, TTML, format and the predetermined indicator is a value of a parameter TTS_suitability; the first value of the predetermined indicator is 0x1 ; the second value of the predetermined indicator is 0x2; and the third value of the predetermined indicator is 0x0.
  • a television comprising a data processing apparatus according to any preceding clause.
  • a computer-implemented data processing method comprising: receiving content data comprising video data representing a video and subtitle data representing subtitles for the video; determining whether the content data comprises a predetermined indicator indicating a suitability of the subtitles for use as spoken subtitles; and if the content data does not comprise the predetermined indicator indicating the suitability of the subtitles for use as spoken subtitles: determining, based on one or more characteristics associated with content of the video, a suitability of the subtitles for use as spoken subtitles; and controlling output of spoken subtitles obtained using the subtitle data according to the determined suitability of the subtitles for use as spoken subtitles.
  • a machine-readable medium in particular, a non-transitory machine-readable medium
  • software such as an optical disk, a magnetic disk, semiconductor memory or the like
  • the present disclosure should be understood to include a non-transitory storage medium comprising code components which cause a computer to perform any of the disclosed method(s).
  • Described embodiments may be implemented in any suitable form including hardware, software, firmware or any combination of these. Described embodiments may optionally be implemented at least partly as computer software running on one or more computer processors (e.g. data processors and/or digital signal processors).
  • the elements and components of any embodiment may be physically, functionally and logically implemented in any suitable way. Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the disclosed embodiments may be implemented in a single unit or may be physically and functionally distributed between different units, circuitry and/or processors.
  • the present disclosure has been described in connection with some embodiments, it is not intended to be limited to these embodiments. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in any manner suitable to implement the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Databases & Information Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

A data processing apparatus comprising circuitry configured to: receive content data comprising video data representing a video and subtitle data representing subtitles for the video; determine whether the content data comprises a predetermined indicator indicating a suitability of the subtitles for use as spoken subtitles; and if the content data does not comprise the predetermined indicator indicating the suitability of the subtitles for use as spoken subtitles: determine, based on one or more characteristics associated with content of the video, a suitability of the subtitles for use as spoken subtitles; and control output of spoken subtitles obtained using the subtitle data according to the determined suitability of the subtitles for use as spoken subtitles.

Description

DATA PROCESSING APPARATUS AND METHOD
BACKGROUND
Field of the Disclosure
The present disclosure relates to a data processing apparatus and method.
Description of the Related Art
The “background” description provided is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in the background section, as well as aspects of the description which may not otherwise qualify as prior art at the time of filing, are neitherexpressly or impliedly admitted as prior art against the present disclosure.
Video content may be provided with subtitles to improve accessibility of the video content. This allows, for example, improved accessibility for people who are hard of hearing or people who cannot speak the original language of the video content.
In recent years, spoken subtitles have become increasingly used. These are delivered as audio output where the subtitles are read out. This is useful, for example, for people who do not speak the original language of the video content and who may also find it difficult to read the subtitles on-screen as the video content is played.
Subtitles may be provided as images (e.g. bitmap images) which are overlaid on the video content. To obtain the spoken subtitles, these images may be subjected to optical character recognition and the resulting text is then subject to text-to-speech (TTS) processing.
In recent years, however, an alternative method of providing subtitles as text has been developed. This is in the form of, for example, Timed Text Markup Language (TTML). This is part of the Digital Video Broadcasting (DVB) standard(s) (e.g. DVB TTML subtitles specification (EN 303 560)) and defines subtitles as text strings associated with respective sets of presentation timestamps. Each text string is then displayed at a time during playback of the video content in accordance with its associated set of presentation timestamps. In particular, display of the text string is started at a first timestamp (“begin” time stamp) and display of the text string is stopped at a second, later, timestamp (an “end” time stamp).
TTS may be performed on each text string during the time it is displayed. However, this may not always be appropriate. For example, if successive text strings are displayed very quickly to reflect very fast dialogue between characters (with characters interrupting each other or the like), it may not be possible for each text string to be spoken at a speed which is useful to the listener. Spoken subtitles may thus not be useful for all types of content.
There is therefore a desire for technology which is able to determine whether or not spoken subtitles are appropriate for given video content and to provide (or not provide) such spoken subtitles accordingly.
SUMMARY
The present disclosure is defined by the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
Non-limiting embodiments and advantages of the present disclosure are explained with reference to the following detailed description taken in conjunction with the accompanying drawings, wherein:
Figs. 1A and 1 B schematically shows example data processing apparatuses;
Fig. 2 schematically shows a first example of subtitles;
Fig. 3 schematically shows an example text-to-speech engine;
Fig. 4 schematically shows a second example of subtitles;
Fig. 5 shows a table reproduced from a known subtitles specification;
Figs. 6A and 6B show example lookup tables;
Figs. 7A and 7B schematically show example screens displayed to a user;
Fig. 8 shows a first example method; and
Fig. 9 shows a second example method.
Like reference numerals designate identical or corresponding parts throughout the drawings.
DETAILED DESCRIPTION OF THE EMBODIMENTS
Fig. 1 A shows a data processing apparatus I device (content providing apparatus I device) 100 for providing video content (e.g. via radio transmission or via a communications network such as the internet). Fig. 1 B shows a data processing apparatus I device (content receiving apparatus I device) 111 for receiving the video content (e.g. via radio transmission or via a communications network such as the internet).
The content providing apparatus 100 comprises a processor 101 for executing electronic instructions, a memory 102 for storing the electronic instructions to be executed and electronic input and output information associated with the electronic instructions, a storage medium 103 (e.g. a hard disk drive or solid state drive) for long term storage of digital information, a communication interface 104 and a user interface 105 (e.g. a touch screen, a non-touch screen, buttons, a keyboard and/or a mouse) for receiving commands from and/or outputting information to a user. The communication interface 104 is for sending information to and/or receiving information from one or more other apparatuses. In particular, the communication interface 104 is configured to transmit video data representing the video content. The communication interface 104 thus comprises one or more of a radio transmitter and a network communication interface (e.g. an Ethernet interface), for example.
Each of the processor 101 , memory 102, storage medium 103, communication interface 104 and user interface 105 are implemented using appropriate circuitry, for example. The processor 101 controls the operation of each of the memory 102, storage medium 103, communication interface 104 and user interface 105.
The content receiving apparatus 111 comprises a processor 106 for executing electronic instructions, a memory 107 for storing the electronic instructions to be executed and electronic input and output information associated with the electronic instructions, a storage medium 108 (e.g. a hard disk drive or solid state drive) for long term storage of digital information, a communication interface 109 and a user interface 110 (e.g. a touch screen, a non-touch screen, buttons, a keyboard and/or a mouse) for receiving commands from and/or outputting information to a user. The communication interface 109 is for sending information to and/or receiving information from one or more other apparatuses. In particular, the communication interface 109 is configured to receive video data representing the video content. The communication interface 109 thus comprises one or more of a radio receiver and a network communication interface (e.g. an Ethernet interface).
Each of the processor 106, memory 103, storage medium 108, communication interface 109 and user interface 110 are implemented using appropriate circuitry, for example. The processor 106 controls the operation of each of the memory 107, storage medium 108, communication interface 109 and user interface 110.
Fig. 2 shows an example use of TTML. The example has been simplified for ease of explanation (and thus shows only a simplified portion of the TTML code).
In this example, the content receiving apparatus 111 is (or is comprised within) a television (TV) 111 which an electronic display 201 (e.g. a liquid crystal display (LCD) or organic light emitting diode (OLED) display) and a sound output device 204. In this example, the sound output device 204 is a loudspeaker integrated as part of the TV 111 . However, the sound output device 204 may take any other suitable form, such as a sound bar, surround sound system, headphones or assistive technology such as a hearing aid connected to the TV 111. The display 201 and sound output device 204 are part of the user interface 110.
Video content is displayed on the display 201 with subtitles 202 overlaid on the video image. Subtitles are defined by subtitle data. For example, subtitle data represents each instance of subtitle text as a text string and, for each instance of subtitle text, indicates when it is to be displayed during reproduction of the video content. For instance, each text string in the subtitle data is associated with a respective first time (“begin” time) and second time (“end” time). The subtitle text defined by that text string is then displayed between the associated first and second times during reproduction of the video content.
In particular, looking at the TTML code 203, a first instance of subtitle text “Construction of the wall began in 1961.” is displayed between time 001 and 004 (the time counter incrementing in seconds, for example). This instance of subtitle text is the one shown displayed on the display 201. A second instance of subtitle text “Until it fell in 1989,” is displayed between time 005 and 007. A third instance of subtitle text “the only way through was via various” is displayed between time 008 and 011 . A fourth instance of subtitle text “official checkpoints, most notably” is displayed between time 012 and 015.
The video content in this example is a history documentary and, as apparent from the TTML code 203, each instance of subtitle text is displayed over several seconds and there is no temporal overlap between the display of successive instances of subtitle text. This is because the subtitle text is representing the dialogue of a single narrator. Such subtitles are more likely to be appropriate for spoken output. This is because the display of each instance of subtitle text over several seconds means there is sufficient time for the words to be spoken at an understandable speed. Furthermore, since there is no temporal overlap of text, each instance of subtitle text is spoken to completion before the next instance of subtitle text begins.
Fig. 3 illustrates example processing of each instance of subtitle text to enable the subtitle text to be spoken (that is, read aloud to the user via the sound output device 204). Each instance of subtitle text is a text string 301 input to a text-to-speech (TTS) engine 302. The TTS engine 302 is executed by the processor 106, for example, and may be any suitable TTS engine known in the art. The output of the TTS engine 302 is a TTS output 303. The TTS output 303 is a digital audio signal which is reproduced by the sound output device 204.
Fig. 4 shows another example use of TTML. The example has again been simplified for ease of explanation (and thus shows only a simplified portion of the TTML code).
This time, looking at the TTML code 402, a first instance of subtitle text “I'm talking to you!” is displayed between time 002 and 004. A second instance of subtitle text “I'm interrupting you back!”” is displayed between time 003 and 006. A third instance of subtitle text “STOP ARGUING!!” is displayed between time 004 and 005. A fourth instance of subtitle text “##Crashing Noise##” is displayed between time 004 and 006. The fourth instance of subtitle text represents a sound effect (rather than dialogue) and is usually found in subtitles designed for people who are hard of hearing. In this example, the fact it is a sound effect is denoted by the use of the “#” symbols before and after the text “Crashing Noise” describing the sound effect. The video content in this example is a drama (such as a soap opera) and, as apparent from the TTML code 402, there is a temporal overlap between the display of successive instances of subtitle text. This is because the subtitle text represents dialogue between a plurality of characters who are talking quickly and interrupting each other during an argument. Because of this overlap, several instances of subtitle text 401 must be displayed simultaneously on the display 201 during these overlaps (e.g. at time 004 in Fig. 4). An extra parameter “col” is provided in the TTML code 402 indicating a displayed colour of each text string (in this example, yellow for “I'm talking to you!”, blue for “I'm interrupting you back!”, green for “STOP ARGUING!!” and orange for “##Crashing Noise##”). This helps to distinguish the dialogue associated with different characters. In an example, it is envisaged that different colours may be associated with different characters in the TTML code so the same colour is used for the same character throughout the video content.
Such subtitles are less likely to be appropriate for spoken output. This is because the display time of at least some of the instances of subtitle text is much shorter (e.g. “STOP ARGUING!!” is only displayed for one second), meaning there may not be sufficient time for the words to be output by the TTS engine 302 at an understandable speed (taking into account that the original dialogue may be, for example, too quick or accented for the listener to understand, which is why they have chosen to use the spoken subtitle functionality) before the next instance of subtitle text is started. Togetherwith the temporal overlap of text, this means the speech output of the TTS engine 302 may fall behind and lose temporal alignment with the dialogue of the characters. For instance, the four instances of subtitle text shown in Fig. 4 are displayed for 4 seconds (from time 002 to 006). If, at a reasonable talking speed, it takes more than 4 seconds for the TTS engine 302 to output speech representing these four instances of subtitle text, temporal alignment will be lost.
There may also be a problem with the subtitle text representing the sound effect “##Crashing Noise##”. Although the displayed subtitle “##Crashing Noise##” may be useful to a person who is hard of hearing (since they may have difficulty hearing the noise to which this relates in the audio of the video content itself), the spoken version of this text may be less useful. For example, although a person who chooses to use spoken subtitles may do so because they find it useful for dialogue (e.g. because the original content is in a language they do not understand), they will still be able to hear and understand the sound effects. Having a TTS output of “##Crashing Noise##” is thus likely to be more disruptive than helpful.
In general, symbols such as “#” (or other symbols such as a musical quaver symbol or the like) may be used to represent sound effects or music in subtitles. In such scenarios, having a TTS output of the subtitle wording provided with such symbols may not be appropriate. For example, as well as a TTS output of sound effect subtitles potentially being disruptive to the user’s experience (as mentioned above), many users may not find it helpful for subtitles corresponding to lyrics of a song (denoted using a quaver symbol or the like, for example) to be subject to TTS output, since such TTS output will be undesirably heard over the voice of the singer of the song. A TTS output of the symbols themselves (e.g. an audio output of “quaver symbol” or “hash symbol”) is also likely to be undesirable.
Figs. 2 and 4 thus demonstrate different types of video content where, for the video content of Fig. 2 (history documentary), spoken subtitles may be appropriate, whereas for the video content of Fig.4 (soap opera), spoken subtitles may not be appropriate.
It is desirable to be able to determine whether or not spoken subtitles are appropriate in advance (that is, before a user has attempted to use spoken subtitles when they are not appropriate and has hence had a negative experience). The DVB standard(s) (e.g. DVB TTML subtitles specification (EN 303 560)) provide one way of doing this, as exemplified in Fig. 5.
Fig. 5 shows a table reproduced from DVB TTML subtitles specification (EN 303 560). This table is carried in the “TTML_subtitling_descriptor” and shows possible values of the parameter “TTS_suitability”. The TTS_suitability parameter is signalled to the content receiving device 111 by content providing device 100 and indicates whether the TTML subtitles provided with the video content are suitable for use as spoken subtitles via TTS. In particular, TTS_suitability can take a value value 0x0 (indicating the suitability for use as spoken subtitles is unknown), 0x1 (indicating the subtitles are suitable for use as spoken subtitles) or 0x2 (indicating the subtitles are not suitable for use as spoken subtitles). The value 0x3 is reserved for future use.
Thus, if TTS_suitability = 0x1 or 0x2, the content receiving device 111 knows whether or not the TTML subtitles are suitable for use as spoken subtitles. For instance, the subtitles of Fig. 2 may be associated with TTS_suitability = 0x1 and the subtitles of Fig. 4 may be associated with TTS_suitability = 0x2.
A problem, however, is that this relies on the provider of the content indicating, specifically, a value 0x1 or 0x2 based on an assessment of the suitability of the TTML subtitles for use as spoken subtitles. If the provider has not done this (and, instead, simply sets TTS_suitability = 0x0), there is thus no way for the content receiving device 111 to determine whether or not the subtitles are suitable for use as spoken subtitles. The decision is thus left up to the user and the user experience may be negatively affected if spoken subtitle functionality of the content receiving device 111 is activated but the subtitles are not suitable for this (e.g. if the subtitles are like those of Fig. 4 rather than those of Fig. 2).
In an example, the present technique helps address this problem through an override mechanism of the indicated TTS_suitability value when the received value is TTS_suitability = 0x0. This is exemplified in Figs. 6A and 6B. Fig. 6A shows a first example in which existing metadata provided with video content indicating the genre of the video content is used to determine the suitability of TTML subtitles being used as spoken subtitles.
In this case, when the indicated TTS_suitability value = 0x0, the genre (which takes one of a plurality of predetermined values, for example, “News”, “Documentary”, “Drama”, “Sport”, etc.) indicated in the metadata of the video content is looked up in a lookup table to find a corresponding TTS_suitability override value associated with that genre.
Fig. 6A shows an example of such a lookup table. Here, the genres “News” and “Documentary” (which, typically, are expected to have only a single person talking at a time at a steady pace and are thus more likely to have TTML subtitles suitable for use as spoken subtitles) have a TTS_suitability override value = 0x1. On the other hand, the genres “Drama” and “Sport” (which, typically, are expected to have multiple people talking and/or interrupting each other and talking at a faster pace and are thus less likely to have TTML subtitles suitable for use as spoken subtitles) have a TTS_su itability override value = 0x2.
Thus, even if the TTS_suitability value provided for the video content is 0x0, based on other information about the video content, the content receiving device 111 is able to determine a likelihood of the TTML subtitles provided with the video content being suitable for use as spoken subtitles. If a user has selected the use of spoken subtitles for all video content where suitable (e.g. through a general interactive settings menu provided via user interface 110), the content receiving device 111 may thus output spoken subtitles for “News” or “Documentary” video content but not for “Drama” or “Sport” video content. Spoken subtitles are thus provided only when appropriate to do so, even if the content provider has not explicitly indicated this using the TTS_su itability value.
Fig. 6B shows a second example which builds on the first example of Fig. 6A. Here, as well as metadata indicating the genre of the video content being used to determine the TTS_suitability override value, additional metadata “Program Name” indicating a title of the video content is also used. This allows further granularity in determining the suitability of spoken subtitles for the video content.
Fig. 6B shows an example a lookup table which enables this. In this specific example, the genres “News” and “Documentary” are always expected to be suitable for use with spoken subtitles, and thus there is no further granulation of TTS_su itability override values. On the other hand, it is expected that different types of video content within the genre “Drama” or within the genre “Sport” may differ in their suitability for use with spoken subtitles.
For example, here, two types of video content with the genre “Drama” are shown. The first one has title “Nordic Noir” and is, for example, a crime-based drama with slow and considered dialogue. It is thus determined that such a drama, unlike a soap opera or the like (with multiple characters talking quickly and potentially interrupting each other during arguments), is suitable for use with spoken subtitles. The TTS_suitability override value is thus set to 0x1 . On the other hand, the second “Drama” content has the title “Conversation Street” and is, indeed, a soap opera. It is thus determined that such a drama is not suitable for use with spoken subtitles. The TTS_suitability override value is thus set to 0x2.
Furthermore, in this example, two types of video content with the genre “Sport” are shown. The first one has title “Snooker”. This is a slow-paced game with correspondingly slowpaced commentary. It is thus determined that such commentary is suitable for use with spoken subtitles. The TTS_su itability override value is thus set to 0x1 . On the other hand, the second “Sport” content has the title “Horse Racing”. This is a fast-paced sport with correspondingly fast-paced commentary. It is thus determined that such commentary is not suitable for use with spoken subtitles. The TTS_suitability override value is thus set to 0x2.
The example of Fig. 6B thus provides further granularity in the determination of whether or not a particular instance of video content has TTML subtitles which are suitable for use as spoken subtitles. This helps improve the likelihood of spoken subtitles being provided to the user when they are more likely to be appropriate and reduce the likelihood of spoken subtitles being provided to the user when they are less likely to be appropriate.
In an example, the lookup tables of Figs. 6A and 6B are stored in advance in the storage medium 108 of the content receiving device 111. Alternatively, the lookup tables may be stored on a remote server (not shown) and accessible to the content receiving device 111 via the communication interface 109 (e.g. over a network such as the internet). In either case, the lookup tables (in particular, the lookup table of Fig. 6B) may be updatable. For instance, when a new drama (with a new “Program Name”) is introduced by the content provider, this new drama may be added to the lookup table of Fig. 6B to indicate whether or not the new drama is suitable for use with spoken subtitles.
The generation and update of the lookup table of Fig. 6B may be carried out by any suitable party (e.g. manufacturer of the content receiving device 111 or even a user of the content receiving device 111) independently of the content provider (who provides the original TTS_su itability value with the content). This allows, for example, the content receiving device 111 to determine the spoken subtitle suitability of received video content even when the content provider itself has simply set TTS_suitability = 0x0.
In an example, multiple users of multiple content receiving devices 111 are provided with the ability to update the lookup table of Fig. 6B (e.g. when stored centrally on a remote server (not shown) and accessible to all content receiving devices). This allows information and experience from many users to be used to determine the suitability of content with a particular “Program Name” for use with spoken subtitles, thereby helping to ensure the lookup table contains the most appropriate TTS_su itability override value for such content. In an example, if a user selects (e.g. using a settings menu or the like of content receiving apparatus 111) the output of spoken subtitles for content with a particular “Program Name” then, after viewing of the content has ended (e.g. because the end time of broadcast content has ended, because the user has stopped or paused streaming of the content or because the user has switched to viewing different content), a screen like that of Fig. 7A is shown to the user.
The screen includes a textual message 601 asking about the quality of the spoken subtitles for the content, a first virtual button 602 indicating the quality was acceptable and a second virtual button 603 indicating the quality was not acceptable. If the user selects the first virtual button 602, the TTS_suitability override value for the content is recorded as 0x1. On the other hand, if the user selects the second virtual button 603, the TTS_suitability override value for the content is recorded as 0x2. The recorded values are examples of feedback data provided by the user.
This may be implemented for video content with a “Program Name” not yet included in the lookup table (in which case, a new row is added to the lookup table). This allows the TTS_suitability override values in the lookup table to be built up over time as new video content is produced. It may also be implemented for video content with a “Program Name” already included in the lookup table. This allows existing TTS_suitability override values in the lookup table to be updated as the nature of content changes over time, thus making sure the TTS_su itability override values remain up-to-date and appropriate.
Thus, for example, if a first season of the drama “Nordic Noir” initially only has slow-paced dialogue then, based on user feedback, a new entry for “Nordic Noir” may be added to the lookup table with TTS_suitability override = 0x1 (i.e. suitable for spoken subtitles). On the other hand, if the second season contains much faster dialogue with multiple characters interrupting each other then, based on user feedback, the existing entry for “Nordic Noir” may be updated to have TTS_suitability override = 0x2 (i.e. not suitable for spoken subtitles).
In an example, if the lookup table is stored centrally on a remote server (not shown) and is accessible to all users (this may be referred to as a global or general lookup table), the TTS_suitability override value for a given row of the lookup table at any given time may be determined as an average value (e.g. the modal value) based on all users who provide feedback for that value. For example, if 5 users have provided feedback with 3 users selecting the first virtual button 602 (indicating TTS_suitability override value should be 0x1) and 2 users selecting the second virtual button 603 (indicating TTS_suitability override value should be 0x1), the decision follows the majority (so TTS_suitability override value is set to 0x1 , in this case). This is then continuously updated as more users provider feedback. It will be appreciated that using the average is only an example and any suitable statistical analysis may be used to determine when to add or update the TTS_suitability override value of particular video content in the lookup table. In another example, the lookup table may be user specific. For example, each content receiving device 111 may locally store a lookup table or the remote server (not shown) may store a plurality of lookup tables each associated with a respective user (e.g. stored as part of an online profile of that user). This takes into account that different users may have different preferences regarding the suitability of spoken subtitles.
For instance, a first user may find spoken subtitles which become delayed during fast- paced and/or multi-person dialogue very stressful and may thus prefer spoken subtitles not to be activated for content such as “Conversation Street” and “Horse Racing”. In this case, based on feedback from the first user, the TTS_suitability override values for this content in their user-specific lookup table are set as 0x2. On the other hand, another user may not find the delay stressful and find they still benefit from the spoken subtitles (e.g. since they provide dialogue at a slower, more understandable pace than the original dialogue). In this case, based on feedback from the second user, the TTS_suitability override values for this content in their user-specific lookup table are set as 0x1 .
These examples may be combined so that a user-specific lookup table is initially a copy of a generic lookup table made available to all users but, based on individual user feedback, that copy (only accessible to the specific user it is associated with) is updated accordingly. Thus, for instance, for the first user mentioned above, “Conversation Street” and “Horse Racing” may initially both be associated with TTS_suitability override values of 0x2, since these are the values provided in the generic lookup table. However, if the first user then watches, say, “Horse Racing”, turns on spoken subtitles and provides feedback indicating the spoken subtitles were acceptable (e.g. by selecting first virtual button 602), their user-specific lookup table is updated so the TTS_suitability override value for “Horse Racing” is changed to 0x1 . The TTS_suitability override value for “Horse Racing” in the generic lookup table, however, remains as 0x2 (although the first user’s feedback may still be used with feedback from other users to update the generic lookup table if this is statistically acceptable).
This allows all users to initially have access to all TTS_suitability override values in the generic lookup table (so, for example, spoken subtitles are only provided if appropriate based on the preferences of the majority of users). Those values are then be adapted over time to reflect each specific user’s preferences. This provides a bespoke solution for spoken subtitles based on user preferences, thereby helping provide an enhanced user experience.
In an example, the user-specific lookup table for a particular user (stored locally on a content receiving apparatus 111 owned by the user and/or on a remote server (not shown)) may be provided as part of data defining a user profile of the user. This user profile data may indicate other information. For example, it may indicate one or more preferred characteristics of voice(s) used for the spoken subtitle output (implemented using Speech Synthesis Markup Language (SSML), for example). Such characteristics could include, for example, whether a voice is male or female, whether the voice has a particular accent and/or any shift in frequency of the voice (e.g. to assist a user who is not able to hear some frequencies in a standard frequency range of the voice).
Characteristics of the voice(s) used for spoken subtitle output may also be adjusted according to other information. For instance, different colour codes in TTML may be linked to respective male or female voices. Alternatively, based on a determination of whether a character within the vicinity of the location on the screen of an output instance of subtitle text is male or female (this being determined based on video metadata and/or a suitable image classification technique, for example), a male or female voice may be output.
The user profile data may also indicate a preferred set of voices (each with its own respective set of characteristic(s)) to use for respective speakers when the spoken subtitles include dialogue between two or more characters. If the sound output device 204 is configured to provide spatial sound effects, a preferred apparent source location of such voices in the room may also be indicated. For instance, a user may prefer all voices to appear to come from the same apparent location in the room (e.g. the centre of the room) or for the voices of multiple people having a conversation to appear to come from different respective locations in the room.
The user profile data may also indicate, for example, information on a hearing profile of a user. For instance, if a user is hard of hearing in their left ear, the user profile data may indicate that sound (including spoken subtitles) should be biased towards the user’s right ear (e.g. by adjusting the balance of the sound output towards the right (rather than left) channel for stereo output). More complex aural characteristics of the user (e.g. ear shape and position and/or an audiogram of the user) may also be indicated by the user profile data and used (e.g. with any suitable existing methods) to tailor the spoken subtitle output to the requirements of the user.
The lookup table may include further information (one or more further columns indicating respective further types of metadata provided with the video content) to further granulate the TTS_su itability override values for a given content genre (e.g. “Drama”) and title (e.g. “Nordic Noir”). For example, different season I series numbers, different episode numbers and even different parts (as indicating by start and end times) of a single episode of a TV drama may each have a respective TTS_suitability override value.
The required spoken word output rate may also be considered. For example, based on the “begin” and “end” time of each instance of subtitle text and the number of words in that instance of subtitle text, the rate at which words must be spoken forthat instance of subtitle text may be calculated (by dividing the number of words by the time period between the “begin” and “end” times). This can be averaged over all instances of subtitle text in the video (or a part of the video) to determine an average required spoken word output rate. If the average required spoken word output rate exceeds a predetermined threshold (determined according to the maximum acceptable spoken word output rate of TTS subtitles, e.g. 100 words per minute), it is determined that the subtitles are not suitable for use as spoken subtitles (and the TTS_suitability override value = 0x2). On the other hand, if the average required spoken word output rate is less than or equal to the predetermined threshold, it is determined that the subtitles are suitable for use as spoken subtitles (and the TTS_suitability override value = 0x1).
In an example, the TTS functionality of the content receiving device may also control and/or allow a user setting able to control the speed of spoken subtitle output (for example Very Slow, Slow, Medium, Fast and Very Fast) . In embodiments, the predetermined threshold or the setting of the predetermined threshold may take that ability into consideration. For example, for user control, the predetermined threshold will be set to a predetermined lowest value for the selectable TTS output speed “Very Slow” and to a predetermined highest value for the selectable TTS output speed “Very Fast”. The predetermined threshold may also be adjusted automatically, for example, being higher for denser subtitles (that is, a higher number of words in a single subtitle instance) and lower for less dense subtitles (that is, a lower number of words in a single subtitle instance).
The present technique thus provides significant flexibility in how TTS_suitability override values may be defined.
Thus, in examples of the present technique, if the TTS_su itability value provided with the video content by the video content provider is 0x1 or 0x2, spoken subtitles are determined to be suitable or not suitable for the content. On the other hand, if the TTS_su itability value = 0x0, then other metadata (e.g. “Program Name”) provided with the video content by the video content provider is looked up in a lookup table to determine a TTS_suitability override value. The TTS_su itability override value is either 0x1 or 0x2 and spoken subtitles are thus either determined to be suitable (in the case of 0x1) or not suitable (in the case of 0x2) accordingly.
In an example, a user may control the content receiving device 111 (e.g. using commands issued via the user interface 110, e.g. via a remote control or voice commands) to output spoken subtitles for all content automatically or control the content receive device 111 to output spoken subtitles each time they view content for which this is desired. In either case, for a given piece of video content, if the content is deemed suitable for use with spoken subtitles (because the TTS_suitability value or TTS_su itability override value = 0x1 ), then the spoken subtitles are output. On the other hand, if the content is not deemed suitable for use with spoken subtitles (because the TTS_suitability value o r TTS_su itability override value = 0x2), then the spoken subtitles are either not output or are only output after a warning (e.g. an on-screen visual warning and/or audio warning) has been provided to the user. This helps ensure that spoken subtitles are only provided when they are likely to be appropriate and helpful to a user or that, at least, the user is made aware if this is not the case. A user may also be provided with a warning if spoken subtitles are to be output but this is based on the TTS_su itability override value being set as 0x1 rather than the TTS_su itability value being set as 0x1 . In this case, the content provider themselves have not indicated the content is suitable for use with spoken subtitles. Rather, this has been determined based on, for example, other metadata of the content and user feedback (e.g. as exemplified in the lookup table of Fig. 6B). There may thus remain a risk that, for a particular user, spoken subtitles may not be appropriate or helpful to the user. This risk is higher when, for instance, a TTS_su itability override value for new video content has been added to the lookup table by only a single user.
Fig. 7B shows an example of such a warning. This is an on-screen warning displayed on the display 201 (but may, instead or in addition, take the form of an audio warning or the like). The screen includes a textual message 604 warning the user that the content provider has not indicated this particular video content is suitable for spoken subtitles and asking the userwhether they wish to proceed, a first virtual button 605 indicating the user does wish to proceed with spoken subtitles and a second virtual button 606 indicating the user does not wish to proceed with spoken subtitles. If the user selects the first virtual button 605, the content is reproduced with spoken subtitles. On the other hand, if the user selects the second virtual button 606, the content is reproduced without spoken subtitles. This helps maintain transparency and flexibility for the user regarding the suitability and output of spoken subtitles.
A similar warning to that shown in Fig. 7B may be displayed if it is determined the subtitles are not suitable for use as spoken subtitles (e.g. if TTS_suitability or TTS_suitability override = 0x2). In this case, the message 604 may read “It has been indicated this content may not be suitable for spoken subtitles. Proceed?”. The user may then select either virtual button 605 to proceed with spoken subtitles (in which case, spoken subtitles are provided with reproduction of the video content) or virtual button 606 to proceed without spoken subtitles (in which case, the video content is reproduced without the spoken subtitles). This provides increased flexibility to the user. Alternatively, if it is determined the subtitles are not suitable for use as spoken subtitles, the video content may simply be reproduced without the spoken subtitles, even if the user has previously indicated they desire spoken subtitles. In this case, when the user indicates (via a settings menu or the like) they wish spoken subtitles to be provided (e.g. in general for all content), the user may be provided with information indicating spoken subtitles will only be provided when the content receiving device determines the use of spoken subtitles is appropriate (e.g. when if TTS_suitability or TTS_suitability override = 0x1).
Although the messages 601 and 604 of Figs. 7A and 7B are exemplified as textual messages, the present technology is not limited to this. For example, the messages 601 and 604 may be provided as spoken audio output. Furthermore, to assist users who are not familiarwith the user interface (e.g. with selecting from multiple virtual buttons 602 and 603 or 605 and 606), a single virtual “OK” button (or reference to a physical “OK” button on a remote control (not shown) or the like) may be provided and the spoken audio message will instruct the user to press “OK” if, for example, the spoken subtitles worked well (thereby replacing the virtual buttons 602 and 603 in Fig. 7A) or if they wish to proceed with using spoken subtitles even though the content provider has not indicated spoken subtitles are suitable for use with the content (thereby replacing the virtual buttons 605 and 606 in Fig. 7B). In an example, if the user selects “OK” within a predetermined time period (e.g. 5 or 10 seconds), the answer is determined to be in the affirmative (e.g. equivalent to selecting virtual button 602 or 605). On the other hand, if the user does not select “OK” within the predetermined time period, the answer is determined to be in the negative (e.g. equivalent to selecting virtual button 603 or 606).
In the above examples, the term “subtitles” has been used. This is intended to cover all methods of representing dialogue (e.g. spoken dialogue or dialogue expressed through sign language or the like) and/or sounds (e.g. sound effects or music) through words visually overlaid on video content. Subtitles can be adapted for various purposes.
One purpose is to provide translation of dialogue so people who don’t understand the language of spoken dialogue in the video content can read a translation of that dialogue in a language they do understand. Subtitles adapted for this purpose may not describe non-dialogue sounds. For example, the subtitle “##Crashing Noise##” of Fig. 4 may not be included in subtitles intended to facilitate translation, since viewers who cannot understand the spoken language of the dialogue will nonetheless still be able to hear audible sound effects of the video content. Subtitles adapted for this purpose may be referred to as “speech only” subtitles.
Another purpose is to improve accessibility for users who are hard of hearing and thus cannot hear the spoken dialogue of the video content. Unlike subtitles intended for translation, subtitles adapted for this purpose may describe non-dialogue sounds (such as the subtitle “##Crashing Noise##” of Fig. 4) to benefit those users. Subtitles adapted for this purpose may be referred to as “captions” or “hard of hearing (HoH)” subtitles.
As previously explained, subtitles including descriptions of non-dialogue sounds may not be appropriate for use as spoken subtitles. Sometimes, however, when video content is provided with several selectable sets of subtitles (for example, “English”, “French” and “German”), it is assumed that the set of subtitles provided in the same language as that of the spoken dialogue of the video content (e.g. the English subtitles) is for the benefit of users who are hard of hearing. This set of subtitles thus contains non-dialogue sounds. On the other hand, it is assumed that the sets of subtitles provided in one or more different languages (e.g. the French or German subtitles) are for the benefit of users who not hard of hearing but who cannot understand the language of the spoken dialogue of the video content. These sets of subtitles thus do not contain non-dialogue sounds. Thus, while the alternative language (e.g. French and German) subtitles of the video content may be appropriate for use as spoken subtitles, the spoken dialogue language (e.g. English) subtitles may not be. In an example, the present technique helps address this problem as illustrated by the method shown in Fig. 8. The method is executed by the processor 106 of the content receiving device 111 , for example.
The method starts at step 701 .
At step 702, it is determined whether the user has requested the output of spoken subtitles for the current video content. If the user has not requested spoken subtitles, the method ends at step 709. If the user has requested spoken subtitles, the method proceeds to step 703.
At step 703, it is determined whether speech only subtitles in the language of the spoken dialogue of the video content (e.g. English) are available.
If such speech only subtitles are available at step 703, the method proceeds to step 704. At step 704, the speech only subtitles are obtained. At step 708, the speech only subtitles are subject to TTS (by TTS engine 302) to generate the spoken subtitles. The method then ends at step 709.
If such speech only subtitles are not available at step 703 (e.g. if only HoH subtitles are available in the language of the spoken dialogue of the video content), the method proceeds to step 705. At step 705, it is determined whether speech only subtitles are available in an alternative language (e.g. French).
If such alternative language speech only subtitles are available at step 705, the method proceeds to step 706. At step 706, the alternative language speech only subtitles are machine-translated to the language of the spoken dialogue of the video content. At step 708, the machine-translated speech only subtitles are subject to TTS to generate the spoken subtitles. In an example, the TTS engine 302 comprises a machine translation component (not shown) and applies TTS to the output of the machine translation component. The machine translation component uses any suitable known machine translation technique (e.g. neural machine translation using a transformer), for example. Spoken subtitles are thus provided in the language of the dialogue of the video content without the unnecessary description of sound effects (e.g. “##Crashing Noise###”), even though speech only subtitles in that language were not originally available. The translated subtitles may also be displayed in a visual form on the screen. The method then ends at step 708.
If such alternative language speech only subtitles are not available at step 705, the method proceeds to step 707. At step 707, available subtitles which do not satisfy the test of steps 703 or 705 (e.g. HoH subtitles) are obtained. At step 708, these subtitles are then subject to TTS to generate the spoken subtitles. The method then ends at step 708. The present technique thus allows speech only subtitles in the original language of the spoken dialogue of the video content to be obtained from alternative language speech only subtitles if these are available. Only if such alternative language speech only subtitles are unavailable are subtitles such as HoH subtitles used. This helps provide an improved spoken subtitles experience for the user.
In an example, if no speech only subtitles are available and thus, for example, HoH subtitles are obtained at step 707 and used to generate spoken subtitles, further processing may be executed on the obtained HoH subtitles to try to reduce the impact of non-dialogue text included in the HoH subtitles. For example, if the beginning and end of non-dialogue text strings are marked with a predetermined symbol (e.g. in the example of “##Crashing Noise”), the processor 106 may control the TTS engine 302 to ignore all text including and between such symbols when performing TTS. This helps prevent the undesirable and unnecessary output of sound effects as part of the spoken subtitle output. This is only an example and any other suitable technique for distinguishing subtitle text corresponding to dialogue and subtitle text corresponding to sound effects (e.g. any suitable machine learning text analysis technique) may be used.
Fig. 9 shows an example method according to the present technique. The method is executed by the processor 106 of the content receiving device 111 , for example.
The method starts at step 801 .
At step 802, content data comprising video data representing a video and subtitle data representing subtitles for the video is received. The content data is received by communication interface 109 and transmitted by communication interface 104, for example.
At step 803, it is determined whether the content data comprises a predetermined indicator indicating a suitability of the subtitles for use as spoken subtitles.
In an example, the predetermined indicator comprised in the content data may take one of at least a first value, second value and third value. The first value indicates the subtitles are suitable for use as spoken subtitles. The second value indicates the subtitles are not suitable for use as spoken subtitles. The third value indicates the suitability of the subtitles for use as spoken subtitles is unknown. It is determined that the predetermined indicator indicates the suitability of the subtitles for use as spoken subtitles when the predetermined indicator takes the first or second value.
Thus, for instance, for TTML subtitles (although the present technology is not limited to TTML subtitles), the predetermined indicator is a value of TTS_suitability. The first value is 0x1 (indicating the subtitles are suitable for use as spoken subtitles), the second value is 0x2 (indicating the subtitles are not suitable for use as spoken subtitles) and the third value is 0x0 (indicating the suitability of the subtitles for use as spoken subtitles is unknown).
If a predetermined indicator indicating a suitability of the subtitles for use as spoken subtitles is present in the content data (e.g. because the predetermined indicator takes the first or second value), the method proceeds to step 805. On the other hand, if no predetermined indicator indicating a suitability of the subtitles for use as spoken subtitles is present in the content data (e.g. because the predetermined indicator takes the third value or because the predetermined indicator is simply not present (e.g. no value of TTS_suitability is provided in the content data)), the method proceeds to step 804.
At step 804, based on one or more characteristics associated with content of the video, a suitability of the subtitles for use as spoken subtitles is determined. “Genre” and/or “Program Name” of the video (as indicated in metadata included in the received content data) are examples of such characteristics associated with the content of the video. More generally, such a characteristic may be any content descriptor (that is, data providing any information about the content of the video) which implies the suitability of the subtitles of the video for use as spoken subtitles. As well as “Genre” and “Program Name” being example content descriptors, a content descriptors may also include, for instance, the season I series number, the episode number, the current part (as indicated by a start and end time) of an episode and/or the average required spoken word output rate (as previously described).
At step 805, output of spoken subtitles obtained using the subtitle data is controlled according to the determined suitability of the subtitles for use as spoken subtitles.
For example, if it is determined that the subtitles are suitable for use as spoken subtitles (e.g. because the predetermined indicator TTS_suitability = 0x1 is found to be present in the content data at step 803 or because the subtitles are found to be suitable based on the “Genre” or “Program Name” of the video being associated with a TTS_suitability override value of 0x1 at step 804), spoken subtitles obtained from the subtitle data are output.
On the other hand, if it is determined that the subtitles are not suitable for use as spoken subtitles (e.g. because the predetermined indicator TTS_suitability = 0x2 is found to be present in the content data at step 803 or because the subtitles are found to not be suitable based on the “Genre” or “Program Name” of the video being associated with a TTS_suitability override value of 0x2 at step 804), spoken subtitles obtained from the subtitle data are not output (or are output only after a warning is provided to the user).
The method ends at step 805.
Embodiment(s) of the present disclosure are defined by the following numbered clauses: 1 . A data processing apparatus comprising circuitry configured to: receive content data comprising video data representing a video and subtitle data representing subtitles for the video; determine whether the content data comprises a predetermined indicator indicating a suitability of the subtitles for use as spoken subtitles; and if the content data does not comprise the predetermined indicator indicating the suitability of the subtitles for use as spoken subtitles: determine, based on one or more characteristics associated with content of the video, a suitability of the subtitles for use as spoken subtitles; and control output of spoken subtitles obtained using the subtitle data according to the determined suitability of the subtitles for use as spoken subtitles.
2. A data processing apparatus according to clause 1 , wherein the one or more characteristics associated with the content of the video include a genre of the video.
3. A data processing apparatus according to clause 1 or 2, wherein the one or more characteristics associated with the content of the video include a title of the video.
4. A data processing apparatus according to any preceding clause, wherein the one or more characteristics associated with the content of the video are indicated in metadata of the video data.
5. A data processing apparatus according to clause 4, wherein the circuitry is configured to: access a lookup table associating values of the metadata with respective suitability values, each suitability value indicating whether or not the subtitles are suitable for use as spoken subtitles; lookup a value of the metadata of the video data in the lookup table; and determine the suitability of the subtitles for use as spoken subtitles according to the suitability value associated with the value of the metadata of the video data.
6. A data processing apparatus according to clause 5, wherein the lookup table is a user-specific lookup table.
7. A data processing apparatus according to any one of clauses 5 or 6, wherein the circuitry is configured to: receive feedback data indicating user feedback on the suitability of the subtitles for use as spoken subtitles when the subtitles are used as spoken subtitles; update the suitability value of the metadata of the video data in the lookup table based on the feedback data.
8. A data processing apparatus according to any preceding clause, wherein, when it is determined the subtitles are suitable for use as spoken subtitles, the circuitry is configured to cause output of the spoken subtitles with reproduction of the video. 9. A data processing apparatus according to any preceding clause, wherein, when it is determined the subtitles are not suitable for use as spoken subtitles, the circuitry is configured to prevent output of the spoken subtitles with reproduction of the video.
10. A data processing apparatus according to any one of clauses 1 to 8, wherein, when it is determined the subtitles are not suitable for use as spoken subtitles, the circuitry is configured to: cause output of a warning message indicating the subtitles are not suitable for use as spoken subtitles; cause output of the spoken subtitles with reproduction of the video after output of the warning message.
11. A data processing apparatus according to any preceding clause, wherein the predetermined indicator can take one of at least: a first value indicating the subtitles are suitable for use as spoken subtitles; a second value indicating the subtitles are not suitable for use as spoken subtitles; and a third value indicating the suitability of the subtitles for use as spoken subtitles is unknown; and the circuitry is configured to determine that the predetermined indicator indicates the suitability of the subtitles for use as spoken subtitles when the predetermined indicator takes the first or second value.
12. A data processing apparatus according to clause 11 , wherein: the subtitle data is provided in a Timed Text Markup Language, TTML, format and the predetermined indicator is a value of a parameter TTS_suitability; the first value of the predetermined indicator is 0x1 ; the second value of the predetermined indicator is 0x2; and the third value of the predetermined indicator is 0x0.
13. A data processing apparatus according to clause 12, wherein: if it is determined, based on the one or more characteristics associated with content of the video, that the subtitles are suitable for use as spoken subtitles, the value of TTS_suitability is overridden so TTS_suitability = 0x1 ; and if it is determined, based on the one or more characteristics associated with content of the video, that the subtitles are not suitable for use as spoken subtitles, the value of TTS_suitability is overridden so TTS_suitability = 0x2.
14. A television comprising a data processing apparatus according to any preceding clause.
15. A computer-implemented data processing method comprising: receiving content data comprising video data representing a video and subtitle data representing subtitles for the video; determining whether the content data comprises a predetermined indicator indicating a suitability of the subtitles for use as spoken subtitles; and if the content data does not comprise the predetermined indicator indicating the suitability of the subtitles for use as spoken subtitles: determining, based on one or more characteristics associated with content of the video, a suitability of the subtitles for use as spoken subtitles; and controlling output of spoken subtitles obtained using the subtitle data according to the determined suitability of the subtitles for use as spoken subtitles.
16. A program for controlling a computer to perform a method according to clause 15.
17. A computer-readable storage medium storing a program according to clause 16.
Numerous modifications and variations of the present disclosure are possible in light of the above teachings. It is therefore to be understood that, within the scope of the claims, the disclosure may be practiced otherwise than as specifically described herein.
In so far as embodiments of the disclosure have been described as being implemented, at least in part, by one or more software-controlled information processing apparatuses, it will be appreciated that a machine-readable medium (in particular, a non-transitory machine-readable medium) carrying such software, such as an optical disk, a magnetic disk, semiconductor memory or the like, is also considered to represent an embodiment of the present disclosure. In particular, the present disclosure should be understood to include a non-transitory storage medium comprising code components which cause a computer to perform any of the disclosed method(s).
It will be appreciated that the above description for clarity has described embodiments with reference to different functional units, circuitry and/or processors. However, it will be apparent that any suitable distribution of functionality between different functional units, circuitry and/or processors may be used without detracting from the embodiments.
Described embodiments may be implemented in any suitable form including hardware, software, firmware or any combination of these. Described embodiments may optionally be implemented at least partly as computer software running on one or more computer processors (e.g. data processors and/or digital signal processors). The elements and components of any embodiment may be physically, functionally and logically implemented in any suitable way. Indeed, the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the disclosed embodiments may be implemented in a single unit or may be physically and functionally distributed between different units, circuitry and/or processors. Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to these embodiments. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in any manner suitable to implement the present disclosure.
REFERENCES [1] DVB TTML subtitles specification (EN 303 560)

Claims

1 . A data processing apparatus comprising circuitry configured to: receive content data comprising video data representing a video and subtitle data representing subtitles for the video; determine whether the content data comprises a predetermined indicator indicating a suitability of the subtitles for use as spoken subtitles; and if the content data does not comprise the predetermined indicator indicating the suitability of the subtitles for use as spoken subtitles: determine, based on one or more characteristics associated with content of the video, a suitability of the subtitles for use as spoken subtitles; and control output of spoken subtitles obtained using the subtitle data according to the determined suitability of the subtitles for use as spoken subtitles.
2. A data processing apparatus according to claim 1 , wherein the one or more characteristics associated with the content of the video include a genre of the video.
3. A data processing apparatus according to claim 1 , wherein the one or more characteristics associated with the content of the video include a title of the video.
4. A data processing apparatus according to claim 1 , wherein the one or more characteristics associated with the content of the video are indicated in metadata of the video data.
5. A data processing apparatus according to claim 4, wherein the circuitry is configured to: access a lookup table associating values of the metadata with respective suitability values, each suitability value indicating whether or not the subtitles are suitable for use as spoken subtitles; lookup a value of the metadata of the video data in the lookup table; and determine the suitability of the subtitles for use as spoken subtitles according to the suitability value associated with the value of the metadata of the video data.
6. A data processing apparatus according to claim 5, wherein the lookup table is a user-specific lookup table.
7. A data processing apparatus according to claim 5, wherein the circuitry is configured to: receive feedback data indicating user feedback on the suitability of the subtitles for use as spoken subtitles when the subtitles are used as spoken subtitles; update the suitability value of the metadata of the video data in the lookup table based on the feedback data.
8. A data processing apparatus according to claim 1 , wherein, when it is determined the subtitles are suitable for use as spoken subtitles, the circuitry is configured to cause output of the spoken subtitles with reproduction of the video.
9. A data processing apparatus according to claim 1 , wherein, when it is determined the subtitles are not suitable for use as spoken subtitles, the circuitry is configured to prevent output of the spoken subtitles with reproduction of the video.
10. A data processing apparatus according to claim 1 , wherein, when it is determined the subtitles are not suitable for use as spoken subtitles, the circuitry is configured to: cause output of a warning message indicating the subtitles are not suitable for use as spoken subtitles; cause output of the spoken subtitles with reproduction of the video after output of the warning message.
11. A data processing apparatus according to claim 1 , wherein the predetermined indicator can take one of at least: a first value indicating the subtitles are suitable for use as spoken subtitles; a second value indicating the subtitles are not suitable for use as spoken subtitles; and a third value indicating the suitability of the subtitles for use as spoken subtitles is unknown; and the circuitry is configured to determine that the predetermined indicator indicates the suitability of the subtitles for use as spoken subtitles when the predetermined indicator takes the first or second value.
12. A data processing apparatus according to claim 11 , wherein: the subtitle data is provided in a Timed Text Markup Language, TTML, format and the predetermined indicator is a value of a parameter TTS_suitability; the first value of the predetermined indicator is 0x1 ; the second value of the predetermined indicator is 0x2; and the third value of the predetermined indicator is 0x0.
13. A data processing apparatus according to claim 12, wherein: if it is determined, based on the one or more characteristics associated with content of the video, that the subtitles are suitable for use as spoken subtitles, the value of TTS_suitability is overridden so TTS_suitability = 0x1 ; and if it is determined, based on the one or more characteristics associated with content of the video, that the subtitles are not suitable for use as spoken subtitles, the value of TTS_suitability is overridden so TTS_suitability = 0x2.
14. A television comprising a data processing apparatus according to claim 1 .
15. A computer-implemented data processing method comprising: receiving content data comprising video data representing a video and subtitle data representing subtitles for the video; determining whether the content data comprises a predetermined indicator indicating a suitability of the subtitles for use as spoken subtitles; and if the content data does not comprise the predetermined indicator indicating the suitability of the subtitles for use as spoken subtitles: determining, based on one or more characteristics associated with content of the video, a suitability of the subtitles for use as spoken subtitles; and controlling output of spoken subtitles obtained using the subtitle data according to the determined suitability of the subtitles for use as spoken subtitles.
16. A program for controlling a computer to perform a method according to claim 15.
17. A computer-readable storage medium storing a program according to claim 16.
PCT/GB2025/050202 2024-02-09 2025-02-03 Data processing apparatus and method Pending WO2025168923A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB2401802.0A GB2637985A (en) 2024-02-09 2024-02-09 Data processing apparatus and method
GB2401802.0 2024-02-09

Publications (1)

Publication Number Publication Date
WO2025168923A1 true WO2025168923A1 (en) 2025-08-14

Family

ID=90354786

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2025/050202 Pending WO2025168923A1 (en) 2024-02-09 2025-02-03 Data processing apparatus and method

Country Status (2)

Country Link
GB (1) GB2637985A (en)
WO (1) WO2025168923A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230010466A1 (en) * 2019-12-09 2023-01-12 Dolby Laboratories Licensing Corporation Adjusting audio and non-audio features based on noise metrics and speech intelligibility metrics

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10237591B2 (en) * 2015-09-09 2019-03-19 Lg Electronics Inc. Broadcast signal transmission device, broadcast signal reception device, broadcast signal transmission method, and broadcast signal reception method
WO2020261805A1 (en) * 2019-06-28 2020-12-30 ソニー株式会社 Information processing device, information processing method, and program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230010466A1 (en) * 2019-12-09 2023-01-12 Dolby Laboratories Licensing Corporation Adjusting audio and non-audio features based on noise metrics and speech intelligibility metrics

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Digital Video Broadcasting (DVB); TTML subtitling systems", vol. JTC BROADCAS EBU/CENELEC/ETSI on Broadcasting, no. V1.1.0, 17 August 2017 (2017-08-17), pages 1 - 39, XP014303218, Retrieved from the Internet <URL:docbox.etsi.org/Broadcast/Broadcast/70-Drafts/00DVB-375/JTC-DVB-375v110.docx> [retrieved on 20170817] *
JAN OUTTERS: "EN300 468 update for subtitles", 12 September 2016 (2016-09-12), XP017852291, Retrieved from the Internet <URL:https://www.dvb.org/resources/restricted/members/documents/TM-SUB/TM-SUB0147_EN300-468-update-for-subtitles.docx> [retrieved on 20160912] *

Also Published As

Publication number Publication date
GB202401802D0 (en) 2024-03-27
GB2637985A (en) 2025-08-13

Similar Documents

Publication Publication Date Title
US7221405B2 (en) Universal closed caption portable receiver
JP3953886B2 (en) Subtitle extraction device
EP3633671B1 (en) Audio guidance generation device, audio guidance generation method, and broadcasting system
US9131280B2 (en) Customizing the display of information by parsing descriptive closed caption data
US8010366B1 (en) Personal hearing suite
JP5201692B2 (en) System and method for applying closed captions
JP2021505046A (en) Methods and systems for recommending content in the context of conversations
Shirley et al. Personalized object-based audio for hearing impaired TV viewers
JP7505491B2 (en) Information processing device, information processing method, and program
CN102055941A (en) Video player and video playing method
WO2009103204A1 (en) A method and apparatus of playing dynamic audio-video menu
US20050036069A1 (en) Image display apparatus having sound level control function and control method thereof
JP2016091057A (en) Electronic device
US20250373895A1 (en) Trick Play of Content Output
JP3986009B2 (en) Character data correction apparatus, method and program thereof, and subtitle generation method
CN114157920A (en) Playing method and device for displaying sign language, smart television and storage medium
Neves A world of change in a changing world
Simon et al. MPEG-H Audio for Improving Accessibility in Broadcasting and Streaming
WO2025168923A1 (en) Data processing apparatus and method
CN101783896B (en) Television equipment
JP2006186920A (en) Information reproducing apparatus and information reproducing method
JP2015018079A (en) Subtitle voice generation apparatus
JP4509188B2 (en) Movie playback apparatus, movie playback method and computer program thereof
JP2007228624A (en) Movie playback apparatus, movie playback method and computer program thereof
WO2011077627A1 (en) Broadcast receiver apparatus and program information voice output method in broadcast receiver apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25705609

Country of ref document: EP

Kind code of ref document: A1