EP4506924A1

EP4506924A1 - Method for receiving and processing audio data and triggering of associated sound effects according to prosody and/or movement

Info

Publication number: EP4506924A1
Application number: EP24193437.1A
Authority: EP
Inventors: Fernando Daniel Goncalves
Original assignee: Poetie
Current assignee: Poetie
Priority date: 2023-08-08
Filing date: 2024-08-07
Publication date: 2025-02-12
Also published as: FR3152077A1; FR3152077B1

Abstract

L'invention concerne un procédé de réception et de traitement de données audios comprenant des paroles correspondant à la lecture en temps réel d'un texte source pour le déclenchement d'effets sonores synchronisé avec ladite lecture du texte, caractérisé en ce qu'il comprend une étape (110) de détermination d'un index (12) de position du locuteur dans le texte source, par détection d'une correspondance entre les données audios réceptionnées et le texte source, une étape (112) de réception d'au moins une donnée représentative d'une donnée (10) de prosodie, et/ou de réception d'au moins une donnée de mouvement, une étape (140) de détermination de l'effet sonore à déclencher en fonction de l'index (12) de position, et en fonction de la donnée de prosodie et/ou de la donnée de mouvement, une étape (150) de déclenchement de l'effet sonore déterminé.

The invention relates to a method for receiving and processing audio data comprising words corresponding to the real-time reading of a source text for triggering sound effects synchronized with said reading of the text, characterized in that it comprises a step (110) of determining a position index (12) of the speaker in the source text, by detecting a correspondence between the received audio data and the source text, a step (112) of receiving at least one piece of data representative of a prosody piece of data (10), and/or of receiving at least one piece of movement data, a step (140) of determining the sound effect to be triggered as a function of the position index (12), and as a function of the prosody piece of data and/or the movement piece of data, a step (150) of triggering the determined sound effect.

Description

Domaine technique de l'inventionTechnical field of the invention

L'invention concerne un procédé de réception et de traitement de données audios, et un système de réception et de traitement de données audios. En particulier, l'invention concerne un procédé permettant le déclenchement d'effets sonores associés à un texte source durant la lecture du texte source par un locuteur humain, prenant en compte la prosodie de la lecture et/ou un mouvement pendant la lecture.The invention relates to a method for receiving and processing audio data, and to a system for receiving and processing audio data. In particular, the invention relates to a method for triggering sound effects associated with a source text during reading of the source text by a human speaker, taking into account the prosody of the reading and/or movement during reading.

Arrière-plan technologiqueTechnological background

L'invention se place dans le domaine de la lecture et propose d'accompagner la lecture d'un livre par le déclenchement d'effets sonores en lien avec le texte source qui est lu.The invention is placed in the field of reading and proposes to accompany the reading of a book by triggering sound effects linked to the source text which is read.

Différentes techniques de lecture interactive ont été déjà proposées pour permettre le déclenchement d'effets sonores pendant la lecture.Various interactive reading techniques have already been proposed to allow the triggering of sound effects during reading.

Certaines techniques proposent par exemple une estimation de la vitesse de lecture pour permettre le déclenchement et le pré-chargement d'effets sonores au moment qui semble opportun. Ces techniques sont simples mais le risque de déclenchement d'effet sonore au mauvais moment est élevé.Some techniques, for example, provide playback speed estimation to allow sound effects to be triggered and preloaded at the appropriate time. These techniques are simple, but the risk of triggering a sound effect at the wrong time is high.

D'autres techniques se basent sur la reconnaissance de la lecture de mots-clés prédéterminés dans le texte, pour déclencher l'effet sonore associé. Ces techniques de détection de mots sont toutefois davantage soumises à des erreurs de reconnaissance et de suivi de la lecture.Other techniques rely on recognizing the reading of predetermined keywords in the text, to trigger the associated sound effect. However, these word detection techniques are more subject to recognition and reading tracking errors.

En outre, les techniques actuelles ne proposent généralement qu'un effet sonore prévu pour chaque passage du texte et ne permettent pas de varier l'expérience de lecture à chaque nouvelle lecture à haute voix du texte source. Elles ne permettent en outre pas d'adapter l'effet sonore aux émotions et à la manière de lire du locuteur. Elles ne permettent également pas de déclencher des effets sonores en cas de mouvement du livre effectués par le locuteur pour une interaction avec le livre.Furthermore, current techniques generally only offer one sound effect intended for each passage of the text and do not allow for variation the reading experience with each new reading aloud of the source text. They also do not allow the sound effect to be adapted to the speaker's emotions and reading style. They also do not allow sound effects to be triggered when the speaker moves the book for interaction with the book.

En particulier, les effets sonores utilisés sont basés sur un principe de « musique linéaire » qui utilise des sons joués en boucle et une transition d'une boucle à l'autre par fondu enchainé à la fin de séquences prédéterminées.In particular, the sound effects used are based on a principle of "linear music" which uses sounds played in a loop and a transition from one loop to another by cross-fading at the end of predetermined sequences.

Les inventeurs ont ainsi cherché à fournir un procédé palliant ces inconvénients, en permettant l'intégration d'un procédé semblable au principe de musique adaptative à la lecture d'un texte, en particulier d'un livreThe inventors have thus sought to provide a method overcoming these drawbacks, by allowing the integration of a method similar to the principle of adaptive music when reading a text, in particular a book.

Objectifs de l'inventionObjectives of the invention

L'invention vise à fournir un procédé, un produit programme d'ordinateur et un système de réception et de traitement de données audios comprenant des paroles correspondant à la lecture en temps réel d'un texte source par un locuteur, pour le déclenchement d'effets sonores synchronisé avec ladite lecture du texte.The invention aims to provide a method, a computer program product and a system for receiving and processing audio data comprising words corresponding to the real-time reading of a source text by a speaker, for triggering sound effects synchronized with said reading of the text.

L'invention vise en particulier à fournir, dans au moins un mode de réalisation, un procédé, un produit programme d'ordinateur et un système de réception et de traitement de données audios permettant la prise en compte d'interactions et de variation d'émotions lors de la lecture du texte, pour un meilleur renouvellement de l'expérience de lecture, en particulier pour adapter l'effet sonore à l'émotion que veut transmettre le locuteur.The invention aims in particular to provide, in at least one embodiment, a method, a computer program product and a system for receiving and processing audio data allowing the taking into account of interactions and variation of emotions when reading the text, for a better renewal of the reading experience, in particular to adapt the sound effect to the emotion that the speaker wants to transmit.

L'invention vise également à fournir, dans au moins un mode de réalisation, un procédé, un produit programme d'ordinateur et un système de réception et de traitement de données audios permettant un meilleur suivi de la lecture du texte source pour garantir un déclenchement approprié des effets sonores au moment opportun.The invention also aims to provide, in at least one embodiment, a method, a computer program product and a system for receiving and processing audio data allowing better monitoring of the reading of the source text to guarantee appropriate triggering of the sound effects at the right time.

L'invention vise également à fournir, dans au moins un mode de réalisation, un procédé, un produit programme d'ordinateur et un système de réception et de traitement de données audios pouvant être embarqué dans un dispositif portable de taille réduite et pouvant fonctionner sans connexion à l'Internet.The invention also aims to provide, in at least one embodiment, a method, a computer program product and a system for receiving and processing audio data which can be embedded in a portable device of reduced size and which can operate without an Internet connection.

Exposé de l'inventionDisclosure of the invention

Pour ce faire, l'invention concerne un procédé de réception et de traitement, dans un système de réception et de traitement, de données audios comprenant des paroles correspondant à la lecture en temps réel d'un texte source par au moins un locuteur, pour le déclenchement d'effets sonores synchronisé avec ladite lecture du texte,
caractérisé en ce qu'il comprend :

une étape de détermination d'un index de position du locuteur dans le texte source, par détection d'une correspondance entre les données audios réceptionnées et le texte source enregistré dans un module de stockage du système de réception et de traitement,
une étape de réception d'au moins une donnée représentative d'une valeur de prosodie des données audios, dite donnée de prosodie, et/ou de réception d'au moins une donnée représentative de la présence ou de l'absence d'un mouvement d'un dispositif de reconnaissance de mouvement durant la lecture du texte et représentative de caractéristiques d'un mouvement présent, dite donnée de mouvement,
une étape de détermination de l'effet sonore à déclencher en fonction de l'index de position, et en fonction de la donnée de prosodie et/ou de la donnée de mouvement,
une étape de déclenchement de l'effet sonore déterminé.

To this end, the invention relates to a method for receiving and processing, in a receiving and processing system, audio data comprising words corresponding to the real-time reading of a source text by at least one speaker, for triggering sound effects synchronized with said reading of the text,
characterized in that it comprises:

a step of determining a position index of the speaker in the source text, by detecting a correspondence between the received audio data and the source text recorded in a storage module of the reception and processing system,
a step of receiving at least one piece of data representative of a prosody value of the audio data, called prosody data, and/or of receiving at least one piece of data representative of the presence or absence of a movement of a movement recognition device during the reading of the text and representative of characteristics of a present movement, called movement data,
a step of determining the sound effect to be triggered based on the position index, and based on the prosody data and/or the movement data,
a step of triggering the determined sound effect.

Un procédé de réception et de traitement selon l'invention permet donc une prise en compte des données de prosodie, et/ou de mouvements, pour déterminer si un effet sonore doit être déclenché lorsque l'index de position correspond à une position de déclenchement de l'effet sonore. La prise en compte de ces données supplémentaires permet un renouvellement de l'expérience de lecture et une expérience de lecture plus immersive et personnalisée, un effet sonore pouvant être déclenché ou sélectionné, ou modifié ou non en prenant en compte les variations de lecture. L'objectif est d'adapter l'effet sonore à l'émotion que veut transmettre le locuteur du texte source.A reception and processing method according to the invention therefore allows prosody and/or movement data to be taken into account to determine whether a sound effect should be triggered when the position index corresponds to a sound effect triggering position. Taking this additional data into account allows for a renewal of the reading experience and a more immersive and personalized reading experience, a sound effect being able to be triggered or selected, or modified or not by taking into account the reading variations. The objective is to adapt the sound effect to the emotion that the speaker of the source text wants to convey.

Le déclenchement d'effet sonore est ainsi adaptatif, et peut évoluer en temps réel à chaque index de position en fonction des mouvements et/ou de la prosodie, qui sont respectivement représentatifs d'actions et d'émotions. Le fonctionnement est ainsi similaire au principe de musique adaptative, utilisée notamment dans les jeux-vidéos où la musique change de façon dynamique en fonction de l'activité du joueur. Les effets sonores sont répartis en fragments et en plusieurs couches qui peuvent être sélectionnés et éventuellement combinés en fonction des mouvements et de la prosodie.The sound effect triggering is thus adaptive, and can evolve over time real at each position index based on movements and/or prosody, which are respectively representative of actions and emotions. The operation is thus similar to the principle of adaptive music, used in particular in video games where the music changes dynamically depending on the player's activity. The sound effects are divided into fragments and several layers which can be selected and possibly combined depending on the movements and prosody.

En particulier, la prise en compte de la prosodie de la lecture permet de reconnaître les émotions du locuteur, par exemple adapter les effets sonores déclenchés à l'intensité sonore de lecture, à la vitesse de lecture, le ton de lecture, etc. Une combinaison de ces caractéristiques des paroles constituant les données audios constitue la prosodie. Plus généralement, la prosodie est définie comme l'ensemble des caractéristiques non-verbaux d'une parole prononcée, en particulier les variations de débit, de hauteur (ton et intonation) et variation de durée (accentuation et rythme), qui peuvent être représentatif d'une émotion transmise dans les paroles à la lecture du texte. Un même texte peut transmettre des émotions différentes selon la prosodie de la parole pendant sa lecture par un locuteur sans considérer la signification des mots et phrases prononcés. La prosodie peut également caractériser un accent sociolinguistique tel qu'un accent régional et peut également ajouter du contexte et du sens à une lecture.In particular, taking into account the prosody of reading allows to recognize the emotions of the speaker, for example adapting the sound effects triggered to the sound intensity of reading, to the speed of reading, to the tone of reading, etc. A combination of these characteristics of the words constituting the audio data constitutes prosody. More generally, prosody is defined as the set of non-verbal characteristics of a spoken word, in particular the variations of flow, pitch (tone and intonation) and variation of duration (accentuation and rhythm), which can be representative of an emotion transmitted in the words when reading the text. The same text can transmit different emotions depending on the prosody of the speech during its reading by a speaker without considering the meaning of the words and sentences pronounced. Prosody can also characterize a sociolinguistic accent such as a regional accent and can also add context and meaning to a reading.

La prosodie est détectée et reconnue grâce à l'utilisation d'au moins un algorithme dédié préconfiguré, par exemple mettant en oeuvre au moins un modèle d'apprentissage pré-entraîné avec des données de prosodie annotées, ou par exemple en utilisant des seuils de prosodie prédéterminés à l'avance, ajustés grâce aux données audios spécifiques au locuteur. Le ou les algorithmes prennent en compte la combinaison des caractéristiques de prosodie des données audios pour déterminer une prosodie des données audios, qui peut être associée à une émotion transmise à la lecture du texte. Le procédé peut également mettre en oeuvre un modèle d'apprentissage choisi parmi un groupe de modèles d'apprentissage en fonction du genre et/ou de l'âge estimé du locuteur et/ou de la langue parlée par le locuteur, chaque modèle d'apprentissage du groupe étant entraîné avec des données au moins en partie différentes associées à un genre, une tranche d'âge et/ou une langue particuliers.Prosody is detected and recognized using at least one preconfigured dedicated algorithm, for example implementing at least one pre-trained learning model with annotated prosody data, or for example using pre-determined prosody thresholds, adjusted using speaker-specific audio data. The algorithm(s) take into account the combination of prosody characteristics of the audio data to determine a prosody of the audio data, which may be associated with an emotion conveyed when reading the text. The method may also implement a learning model selected from a group of learning models based on the speaker's gender and/or estimated age and/or the language spoken by the speaker, each learning model of the group being trained with data at least partly different associated with a particular gender, age group and/or language.

L'algorithme est de préférence rapide à exécuter et autonome, en particulier qui nécessite peu de ressources informatiques et ne nécessite pas de connexion à l'Internet ou à toute autre base de données externe.The algorithm is preferably fast to execute and self-contained, particularly requiring few computing resources and not requiring a connection to the Internet or any external database.

La prise en compte de mouvement permet d'associer à la lecture des interactions physiques sur le dispositif de reconnaissance de mouvement durant la lecture du texte, et de pouvoir lier la lecture et le mouvement au déclenchement d'effet sonore. La donnée de mouvement peut être une donnée indiquant l'absence de mouvement. La donnée de mouvement peut également comprendre une information représentative de la vitesse et/ou de l'accélération du mouvement.Taking motion into account makes it possible to associate physical interactions with the reading on the motion recognition device during the reading of the text, and to be able to link the reading and the motion to the triggering of a sound effect. The motion data may be data indicating the absence of motion. The motion data may also include information representative of the speed and/or acceleration of the motion.

Le mouvement est détectée et reconnue grâce à l'utilisation d'au moins un algorithme dédié préconfiguré, par exemple mettant en oeuvre au moins un modèle d'apprentissage pré-entraîné avec des données de mouvement annotées, ou par exemple en utilisant des modèles de mouvement prédéterminés à l'avance, ajustés grâce aux données de mouvement spécifiques au locuteur et/ou par rapport à la position initiale du dispositif de reconnaissance de mouvement. Le ou les algorithmes prennent en compte la combinaison des caractéristiques de mouvement détectés par le dispositif de reconnaissance de mouvement pour déterminer le mouvement reconnu.The motion is detected and recognized by using at least one pre-configured dedicated algorithm, for example implementing at least one pre-trained learning model with annotated motion data, or for example using pre-determined motion models, adjusted using the speaker-specific motion data and/or relative to the initial position of the motion recognition device. The at least one algorithm takes into account the combination of motion features detected by the motion recognition device to determine the recognized motion.

Avantageusement et selon l'invention, le texte source est enregistré dans le dispositif de traitement sous la forme d'une liste de phonèmes, et le dispositif de traitement est configuré pour détecter dans les données audios la présence de phonème correspondant au texte source.Advantageously and according to the invention, the source text is recorded in the processing device in the form of a list of phonemes, and the processing device is configured to detect in the audio data the presence of a phoneme corresponding to the source text.

Selon cet aspect de l'invention, la décomposition du texte en phonèmes permet d'améliorer la précision de l'index de position ainsi que la vitesse de calcul et ainsi d'améliorer les conditions de déclenchement. La détection de phonèmes permet un résultat meilleur qu'une détection de mots et est davantage compatible avec une lecture hachée, un retour en arrière dans la lecture, des pauses dans la lecture, etc.According to this aspect of the invention, the decomposition of the text into phonemes makes it possible to improve the accuracy of the position index as well as the calculation speed and thus to improve the triggering conditions. The phoneme detection allows a better result than a word detection and is more compatible with choppy reading, a rewind in reading, pauses in reading, etc.

En outre, la détection de phonème nécessite moins de ressources que la détection de mot et permet ainsi d'être mis en oeuvre par un dispositif portable et/ou embarqué, même en l'absence de connexion à l'Internet. En particulier, le nombre de phonèmes différents à détecter est très inférieur au nombre de mots qu'il est possible de détecter dans les procédés utilisant la détection de mots et la détection de phonème ne nécessite pas de déduction du mot suivant. La vitesse de détection de phonème est ainsi de l'ordre de 200ms.Furthermore, phoneme detection requires fewer resources than word detection and thus can be implemented by a portable and/or embedded device, even in the absence of an Internet connection. In particular, the number of different phonemes to be detected is much lower than the number of words that can be detected in methods using word detection and phoneme detection does not require deduction of the next word. The phoneme detection speed is thus of the order of 200ms.

Avantageusement et selon l'invention, l'index de position du locuteur correspond à un phonème de la liste de phonèmes du texte source, et l'étape de détermination de l'index de position du locuteur dans le texte source comprend :

une sous-étape de réception de la position actuelle de l'index dans la liste des phonèmes du texte source,
une sous-étape de comparaison d'au moins un phonème détecté dans les données audios avec au moins un phonème attendu parmi les phonèmes suivants dans la liste de phonèmes du texte source,
si aucun phonème détecté dans les données audios ne correspond avec les phonèmes suivants, une sous-étape de réception d'une pluralité de phonèmes détectés dans les données audios et une sous-étape de recherche dans au moins une partie du texte source d'une séquence de phonèmes dans la liste de phonèmes du texte source correspondant à ladite pluralité de phonèmes détectés,
si la pluralité de phonèmes détectés dans les données audios ne correspondant avec aucune séquence de phonèmes dans la liste de phonèmes du texte source, une sous-étape de recherche dans la totalité du texte source d'une séquence de phonèmes dans la liste de phonèmes du texte source correspondant à ladite pluralité de phonèmes détectés,
en cas de détection d'une correspondance d'un phonème ou d'une pluralité de phonème détectés avec la liste de phonèmes du texte source, une sous-étape de mise à jour de l'index avec ledit phonème correspondant ou le dernier phonème de la séquence de phonèmes correspondante.

Advantageously and according to the invention, the position index of the speaker corresponds to a phoneme from the list of phonemes of the source text, and the step of determining the position index of the speaker in the source text comprises:

a sub-step of receiving the current position of the index in the list of phonemes of the source text,
a sub-step of comparing at least one phoneme detected in the audio data with at least one phoneme expected among the following phonemes in the phoneme list of the source text,
if no phoneme detected in the audio data matches with the following phonemes, a sub-step of receiving a plurality of phonemes detected in the audio data and a sub-step of searching in at least a part of the source text for a sequence of phonemes in the phoneme list of the source text corresponding to said plurality of detected phonemes,
if the plurality of phonemes detected in the audio data do not correspond to any phoneme sequence in the phoneme list of the source text, a sub-step of searching in the entire source text for a phoneme sequence in the phoneme list of the source text corresponding to said plurality of detected phonemes,
in case of detection of a correspondence of a phoneme or of a plurality of detected phonemes with the list of phonemes of the source text, a sub-step of updating the index with said corresponding phoneme or the last phoneme of the corresponding sequence of phonemes.

Selon cet aspect de l'invention, le suivi de la correspondance des phonèmes tel que décrit permet une détermination efficace de l'index de position pour le déclenchement des effets sonores.According to this aspect of the invention, tracking the correspondence of phonemes as described allows efficient determination of the position index for triggering sound effects.

La sous-étape de recherche dans au moins une partie du texte de source d'une séquence de phonèmes peut être répétée à plusieurs reprises en élargissant petit à petit la taille de la partie du texte de source dans laquelle la pluralité de phonèmes détectés est recherchée.The substep of searching at least a portion of the source text for a sequence of phonemes may be repeated repeatedly by gradually expanding the size of the portion of the source text in which the plurality of detected phonemes are searched.

La sous-étape de recherche dans la totalité du texte source d'une séquence de phonèmes peut également être réalisée en continu et en parallèle aux autres sous-étapes, de sorte à fournir un résultat rapidement si les sous-étapes précédentes n'ont pas été concluantes et/ou si l'indice de confiance de la correspondance de la séquence de phonèmes avec la pluralité de phonèmes détectés est élevé.The sub-step of searching the entire source text for a sequence of phonemes can also be performed continuously and in parallel with the other sub-steps, so as to provide a result quickly if the previous sub-steps have not been conclusive and/or if the confidence index of the correspondence of the sequence of phonemes with the plurality of detected phonemes is high.

Avantageusement, si aucun phonème détecté dans les données audios ne correspond avec les phonèmes suivants, l'étape de détermination de l'index de position du locuteur dans le texte source comprend également une étape d'avancement de l'index au phonème suivant. Grâce à cette étape, le procédé peut continuer à avancer dans la liste de phonèmes pour éviter de perdre la correspondance entre les données audios et la liste des phonèmes.Advantageously, if no phoneme detected in the audio data matches the following phonemes, the step of determining the speaker's position index in the source text also includes a step of advancing the index to the next phoneme. By means of this step, the method can continue to advance in the list of phonemes to avoid losing the correspondence between the audio data and the list of phonemes.

Avantageusement et selon l'invention, la comparaison entre les données audios et la liste de phonèmes comprend une détermination d'un indice représentatif de la ressemblance entre le phonème détecté dans les données audios et le phonème de la liste de phonèmes. Cet indice représentatif de la ressemblance est également représentatif de la distance entre ces phonèmes selon sa valeur, par exemple une valeur égale à zéro indique une correspondance parfaite, une valeur proche de zéro indique une ressemblance et une valeur positive plus élevée indique une distance entre les phonèmes. Par exemple, dans la langue française, les phonèmes débutant par les lettres « p » et « b » sont considérés comme ressemblants et donc avec une distance faible, de la même façon que les lettres « t » et « d ». Si la valeur de l'indice est inférieure à un seuil prédéterminé, on peut considérer qu'il y a correspondance des phonèmes et avancer l'index. Dans les variantes de l'invention où un indice de confiance tel que décrit ci-après est mis en oeuvre, cet indice de confiance peut être baissé si la correspondance des phonèmes n'est pas parfaite (indice de ressemblance non nul mais inférieur au seuil prédéterminé). Ainsi, avantageusement et selon l'invention, la sous-étape de comparaison d'au moins un phonème détecté dans les données audios avec au moins un phonème attendu parmi les phonèmes suivants dans la liste de phonèmes du texte source comprend une détermination d'un indice représentatif de la ressemblance entre le phonème détecté dans les données audios et le phonème de la liste de phonèmes, et si l'indice représentatif de la ressemblance est inférieur à un seuil prédéterminé, une sous-étape de mise à jour de l'index avec ledit phonème détecté et une sous-étape de diminution de l'indice de confiance.Advantageously and according to the invention, the comparison between the audio data and the list of phonemes comprises a determination of an index representative of the resemblance between the phoneme detected in the audio data and the phoneme of the list of phonemes. This index representative of the resemblance is also representative of the distance between these phonemes according to its value, for example a value equal to zero indicates a perfect match, a value close to zero indicates a resemblance and a higher positive value indicates a distance between the phonemes. For example, in the French language, the phonemes beginning with the letters "p" and "b" are considered to be similar and therefore with a low distance, in the same way as the letters "t" and "d". If the value of the index is lower than a predetermined threshold, it can be considered that there is a match of the phonemes and the index can be advanced. In the variants of the invention where a confidence index as described below is implemented work, this confidence index can be lowered if the correspondence of the phonemes is not perfect (non-zero resemblance index but lower than the predetermined threshold). Thus, advantageously and according to the invention, the sub-step of comparing at least one phoneme detected in the audio data with at least one phoneme expected among the following phonemes in the list of phonemes of the source text comprises a determination of an index representative of the resemblance between the phoneme detected in the audio data and the phoneme in the list of phonemes, and if the index representative of the resemblance is lower than a predetermined threshold, a sub-step of updating the index with said detected phoneme and a sub-step of reducing the confidence index.

Chacune de ces sous-étapes présente sa propre tolérance aux erreurs, par exemple autorise ou non la substitution d'un phonème avec un phonème ressemblant, ou présente un seuil de pourcentage de correspondance entre une liste de phonèmes détectés avec une liste de phonèmes du texte source.Each of these sub-steps has its own error tolerance, for example whether or not to allow the substitution of a phoneme with a similar phoneme, or has a threshold for the percentage of correspondence between a list of detected phonemes and a list of phonemes in the source text.

Avantageusement et selon l'invention, l'étape de déclenchement de l'effet sonore comprend une sous-étape de vérification de la valeur d'un indice de confiance et en ce que l'effet sonore est déclenché uniquement si l'indice de confiance est supérieur à un seuil prédéterminé.Advantageously and according to the invention, the step of triggering the sound effect comprises a sub-step of verifying the value of a confidence index and in that the sound effect is triggered only if the confidence index is greater than a predetermined threshold.

Selon cet aspect de l'invention, l'indice de confiance permet ainsi d'éviter le déclenchement d'effet sonore en cas de manque de confiance dans le suivi de position. En particulier, cela permet d'éviter le déclenchement d'effet sonore à un moment inopportun du fait d'une mauvaise détermination de l'index de position.According to this aspect of the invention, the confidence index thus makes it possible to avoid triggering a sound effect in the event of a lack of confidence in the position tracking. In particular, this makes it possible to avoid triggering a sound effect at an inopportune moment due to an incorrect determination of the position index.

Chaque effet sonore peut être lié à son propre seuil prédéterminé.Each sound effect can be tied to its own predetermined threshold.

Avantageusement et selon l'invention, le procédé comprend une étape d'augmentation de la valeur de l'indice de confiance en cas de détection d'une correspondance d'un phonème ou d'une pluralité de phonème détectés avec la liste de phonèmes du texte source et comprend une étape de diminution de la valeur de l'indice de confiance si aucun phonème détecté dans les données audios ne correspond avec les phonèmes suivants.Advantageously and according to the invention, the method comprises a step of increasing the value of the confidence index in the event of detection of a correspondence of a phoneme or of a plurality of phonemes detected with the list of phonemes of the source text and comprises a step of decreasing the value of the confidence index if no phoneme detected in the audio data corresponds with the following phonemes.

Selon cet aspect de l'invention, l'augmentation et la diminution de l'indice de confiance sont liées à la détection de correspondance de phonèmes, en particulier l'indice de confiance permet ainsi d'éviter le déclenchement d'effet sonore si la détection de correspondance de phonème n'est pas réalisée avec suffisamment de confiance.According to this aspect of the invention, the increase and decrease of the confidence index are linked to the detection of phoneme correspondence, in particular the confidence index thus makes it possible to avoid the triggering of sound effect if the detection of phoneme correspondence is not carried out with sufficient confidence.

Avantageusement et selon l'invention, l'ensemble des effets sonores associés au texte source sont répartis dans une pluralité de séquences temporelles associées à des portions du texte source, chaque séquence temporelle étant associée à un effet sonore ou à un groupe d'effets sonores comprenant plusieurs effets sonores, et lorsque l'index de position correspond à une séquence temporelle associée à un groupe d'effets sonores, l'étape de détermination de l'effet sonore à déclencher comprend :

une sous-étape de récupération, en fonction de l'index de position, du groupe d'effets sonores associé à la séquence temporelle dans laquelle se trouve l'index de position,
une sous-étape de détermination, en fonction de la donnée de prosodie et/ou de la donnée de mouvement, de l'effet sonore à déclencher parmi les effets sonores du groupe d'effets sonores.

Advantageously and according to the invention, all of the sound effects associated with the source text are distributed in a plurality of time sequences associated with portions of the source text, each time sequence being associated with a sound effect or a group of sound effects comprising several sound effects, and when the position index corresponds to a time sequence associated with a group of sound effects, the step of determining the sound effect to be triggered comprises:

a sub-step of retrieving, as a function of the position index, the group of sound effects associated with the time sequence in which the position index is located,
a sub-step of determining, based on the prosody data and/or the motion data, the sound effect to be triggered among the sound effects of the sound effect group.

Selon cet aspect de l'invention, les séquences temporelles sont chacune associées à un ou plusieurs effets sonores et les effets sonores déclenchés sont adaptés aux données de prosodie et/ou de mouvement déterminés. L'objectif est de proposer, pour certaines séquences temporelles au moins, des variantes d'effet sonore à déclencher pour un même index de position.According to this aspect of the invention, the time sequences are each associated with one or more sound effects and the triggered sound effects are adapted to the determined prosody and/or movement data. The objective is to propose, for at least certain time sequences, sound effect variants to be triggered for the same position index.

Avantageusement et selon l'invention, les effets sonores pouvant être déclenchés dans chaque groupe d'effet sonore peuvent également dépendre de paramètres prédéfinis à l'avance par le locuteur, ou peuvent varier en fonction des données de prosodie et de l'indice de confiance dans les variantes associées.Advantageously and according to the invention, the sound effects that can be triggered in each sound effect group can also depend on parameters predefined in advance by the speaker, or can vary depending on the prosody data and the confidence index in the associated variants.

Par exemple, le locuteur peut sélectionner un mode « jour » ou un mode « nuit » qui permet d'empêcher le déclenchement de certains effets sonores dans le groupe d'effet sonore. La prise en compte de l'indice de confiance permet également de limiter les risques de déclenchement intempestif d'effet sonore au mauvais moment, par exemple en retardant le déclenchement et/ou en ne déclenchant pas un effet sonore spécifique.For example, the speaker can select a "day" or "night" mode that prevents certain sound effects from being triggered in the sound effect group. Taking into account the confidence index allows also to limit the risks of untimely triggering of sound effects at the wrong time, for example by delaying the triggering and/or not triggering a specific sound effect.

Le locuteur peut également activer ou désactiver la détection du mouvement ou de la prosodie pour la lecture.The speaker can also enable or disable motion or prosody detection for reading.

Avantageusement et selon l'invention, la donnée de prosodie comprend des données ou une combinaison de données parmi différents types de données dans la liste suivante :

des données sur l'intensité des paroles des données audios,
des données sur la fréquence et/ou la fréquence fondamentale des paroles des données audios,
des données sur l'intensité et/ou de fréquence et/ou de fréquence fondamentale des consonnes prononcées dans les données audios,
des données sur l'intensité et/ou de fréquence et/ou de fréquence fondamentale des voyelles prononcées dans les données audios,
des données sur la longueur des voyelles et/ou des paroles prononcées dans les données audios,
des données sur la longueur des consonnes prononcées dans les données audios,
des données sur le débit de parole des données audios.

Advantageously and according to the invention, the prosody data comprises data or a combination of data from different types of data in the following list:

data on the intensity of speech in audio data,
data on the frequency and/or fundamental frequency of the speech of the audio data,
data on the intensity and/or frequency and/or fundamental frequency of consonants pronounced in the audio data,
data on the intensity and/or frequency and/or fundamental frequency of vowels pronounced in the audio data,
data on the length of vowels and/or spoken words in the audio data,
data on the length of consonants pronounced in audio data,
data on the speech rate of audio data.

Selon cet aspect de l'invention, la prosodie dépend de plusieurs facteurs qui permettent de symboliser l'émotion du locuteur du texte source pendant la lecture à haute voix du texte. La donnée de prosodie peut se baser sur une valeur absolue de chaque donnée de la liste ou sur une variation temporelle des valeurs de celles-ci, par exemple une augmentation ou diminution graduelle de l'intensité sonore.According to this aspect of the invention, the prosody depends on several factors that make it possible to symbolize the emotion of the speaker of the source text during the reading aloud of the text. The prosody data can be based on an absolute value of each data of the list or on a temporal variation of the values thereof, for example a gradual increase or decrease in sound intensity.

Avantageusement et selon l'invention, le procédé comprend une étape de calibration permettant de déterminer ou d'estimer, à partir des données audios, le genre du locuteur et/ou l'âge du locuteur et/ou la langue parlée par le locuteur.Advantageously and according to the invention, the method comprises a calibration step making it possible to determine or estimate, from the audio data, the gender of the speaker and/or the age of the speaker and/or the language spoken by the speaker.

Selon cet aspect de l'invention, la détermination ou l'estimation de l'âge ou le genre du locuteur, ainsi que la langue parlée, peut permettre d'ajuster les seuils de détection de variation de prosodie pour mieux adapter le déclenchement d'effet sonore. En particulier, un homme ou une femme, un adulte et un enfant, ou deux locuteurs de langue différente ont généralement des variations moyennes de prosodie différentes.According to this aspect of the invention, determining or estimating the age or gender of the speaker, as well as the language spoken, can allow the thresholds to be adjusted. prosody variation detection to better adapt sound effect triggering. In particular, a man or a woman, an adult and a child, or two speakers of different languages generally have different average prosody variations.

Avantageusement et selon l'invention, la donnée de mouvement comprend des données relatives à un déplacement et/ou une rotation dans l'espace et/ou à la vitesse de déplacement du dispositif de reconnaissance de mouvement, et/ou des données relatives à la correspondance d'un mouvement ou d'une combinaison de mouvement détectés par le dispositif de reconnaissance de mouvement avec un mouvement prédéterminé parmi une liste de mouvement prédéterminé enregistrée dans un module de stockage du système de traitement.Advantageously and according to the invention, the movement data comprises data relating to a movement and/or a rotation in space and/or to the speed of movement of the movement recognition device, and/or data relating to the correspondence of a movement or a combination of movements detected by the movement recognition device with a predetermined movement from a list of predetermined movements recorded in a storage module of the processing system.

Selon cet aspect de l'invention, différents types de mouvement peuvent être pris en compte pour la sélection de l'effet sonore à déclencher. En particulier, le procédé peut comprendre la détection d'un mouvement prédéterminé, par exemple si l'utilisateur fait un cercle avec le dispositif de reconnaissance de mouvement.According to this aspect of the invention, different types of movement may be taken into account for the selection of the sound effect to be triggered. In particular, the method may comprise detecting a predetermined movement, for example if the user makes a circle with the motion recognition device.

Le mouvement dans l'espace est de façon connue constitué d'un ou d'une combinaison des mouvements suivants :

translation verticale,
translation latérale,
translation longitudinale,
tangage,
lacet,
roulis
mouvement de secousse,
chocs brefs.

Movement in space is known to consist of one or a combination of the following movements:

vertical translation,
lateral translation,
longitudinal translation,
pitch,
lace,
roll
shaking movement,
short shocks.

La vitesse et/ou l'accélération du mouvement peuvent également être intégrées à la donnée de mouvement.The speed and/or acceleration of the movement can also be integrated into the motion data.

Avantageusement et selon l'invention, l'étape de déclenchement d'effet sonore est exécutée lorsque l'index de position est à une position égale ou après un index de déclenchement prédéterminé, ou dans une fenêtre de déclenchement prédéterminée, et en ce que l'effet sonore peut être immédiatement émis dès que l'index de position atteint ou dépasse l'index de déclenchement ou atteint la fenêtre de déclenchement, ou émis de façon différée après la fin d'un effet sonore en cours d'émission.Advantageously and according to the invention, the sound effect triggering step is executed when the position index is at an equal position or after a predetermined trigger index, or within a predetermined trigger window, and in that the sound effect can be immediately emitted as soon as the position index reaches or exceeds the trigger index or reaches the trigger window, or emitted in a delayed manner after the end of a sound effect currently being emitted.

Selon cet aspect de l'invention, le déclenchement d'effet sonore peut être immédiat ou différé, par exemple l'effet sonore déterminé peut être déclenché à la fin d'une séquence temporelle. La temporalité du déclenchement dépend en particulier de la position ou de la fenêtre de déclenchement dans laquelle l'index de position se trouve.According to this aspect of the invention, the sound effect triggering may be immediate or delayed, for example the determined sound effect may be triggered at the end of a time sequence. The temporality of the triggering depends in particular on the position or trigger window in which the position index is located.

L'effet sonore peut également être déclenché à un moment temporel absolu (par exemple à un instant t) ou relatif (par exemple dans x secondes ou en fonction du tempo, par exemple après la dernière note d'une mesure d'un morceau de musique). L'effet sonore peut également être programmé pour être déclenché à plusieurs reprises, par exemple une fois immédiatement et une fois x secondes plus tard.The sound effect can also be triggered at an absolute time point (e.g. at time t) or relative time point (e.g. in x seconds or depending on the tempo, e.g. after the last note of a measure of a piece of music). The sound effect can also be programmed to be triggered multiple times, e.g. once immediately and once x seconds later.

L'invention concerne également un produit programme d'ordinateur de réception et de traitement de données audios correspondant à la lecture en temps réel d'un texte source par au moins un locuteur, pour le déclenchement d'effets sonores synchronisé avec ladite lecture du texte, ledit produit programme d'ordinateur comprenant des instructions de code de programme pour l'exécution, lorsque ledit produit programme d'ordinateur est exécuté sur un dispositif informatique, des étapes du procédé selon l'invention.The invention also relates to a computer program product for receiving and processing audio data corresponding to the real-time reading of a source text by at least one speaker, for triggering sound effects synchronized with said reading of the text, said computer program product comprising program code instructions for executing, when said computer program product is executed on a computing device, the steps of the method according to the invention.

Le produit programme d'ordinateur est avantageusement stocké dans le dispositif de traitement, en particulier dans un dispositif informatique portatif, de préférence un téléphone intelligent, une tablette numérique ou une montre intelligente, par exemple sous la forme d'une application.The computer program product is advantageously stored in the processing device, in particular in a portable computing device, preferably a smartphone, a digital tablet or a smart watch, for example in the form of an application.

L'invention concerne également un système de réception et de traitement de données audios correspondant à la lecture en temps réel d'un texte source par au moins un locuteur, pour le déclenchement d'effets sonores synchronisé avec ladite lecture du texte, caractérisé en ce qu'il comprend :

un module de stockage du texte source,
un module de réception des données audios,
un module de traitement des données audio, configuré pour la détermination d'un index de position du locuteur dans le texte source, par détection d'une correspondance entre les données audios réceptionnées et le texte source enregistré dans le module de stockage,
un dispositif de reconnaissance de mouvement configuré pour fournir au moins une donnée représentative de la présence ou de l'absence d'un mouvement d'un dispositif de reconnaissance de mouvement durant la lecture du texte et représentative de caractéristiques d'un mouvement présent, dite donnée de mouvement, et/ou un module d'analyse de la variation de la prosodie des données audios configuré pour fournir au moins une donnée représentative d'une variation de prosodie des données audios
un module de détermination de l'effet sonore à déclencher en fonction de l'index de position, et en fonction de la donnée de prosodie et/ou de la donnée de mouvement,
un dispositif d'émission sonore configuré pour l'émission de l'effet sonore déterminé.

The invention also relates to a system for receiving and processing audio data corresponding to the real-time reading of a source text by at least one speaker, for triggering sound effects synchronized with said reading of the text, characterized in that it includes:

a source text storage module,
an audio data reception module,
an audio data processing module, configured to determine a speaker position index in the source text, by detecting a correspondence between the received audio data and the source text recorded in the storage module,
a motion recognition device configured to provide at least one piece of data representative of the presence or absence of a movement of a motion recognition device during the reading of the text and representative of characteristics of a present movement, called motion data, and/or a module for analyzing the variation in the prosody of the audio data configured to provide at least one piece of data representative of a variation in the prosody of the audio data
a module for determining the sound effect to be triggered based on the position index, and based on the prosody data and/or the movement data,
a sound emitting device configured to emit the determined sound effect.

Avantageusement, le système de réception et de traitement selon l'invention est configuré pour mettre en oeuvre le procédé de réception et de traitement selon l'invention.Advantageously, the reception and processing system according to the invention is configured to implement the reception and processing method according to the invention.

Avantageusement et selon l'invention, le procédé de réception et de traitement selon l'invention est configuré pour être mis en oeuvre par un système de réception et de traitement selon l'invention.Advantageously and according to the invention, the reception and processing method according to the invention is configured to be implemented by a reception and processing system according to the invention.

Avantageusement et selon l'invention, le système comprend un dispositif informatique portatif, de préférence un téléphone intelligent, une tablette numérique ou une montre intelligente (smartwatch en anglais), comprenant le module de stockage du texte source, le module de réception des données audios, le module de traitement des données audio, le dispositif de reconnaissance de mouvement et/ou le module d'analyse de la variation de la prosodie, le module de détermination d'effet sonore et le dispositif d'émission sonore.Advantageously and according to the invention, the system comprises a portable computer device, preferably a smartphone, a digital tablet or a smart watch , comprising the source text storage module, the audio data reception module, the audio data processing module, the text recognition device, the audio data processing module ... movement and/or prosody variation analysis module, sound effect determination module and sound emission device.

Selon cet aspect de l'invention, une majeure partie ou l'ensemble des constituants du système peut être intégré à un dispositif informatique portatif tel qu'un téléphone intelligent (couramment appelé smartphone), de sorte à réunir les fonctionnalités dans un même système. Le dispositif informatique portatif peut également afficher le texte source ou de préférence être attaché à un support physique du texte source, en particulier où est imprimé le texte source, tel qu'un livre.In accordance with this aspect of the invention, most or all of the components of the system may be integrated into a portable computing device such as a smart phone (commonly referred to as a smartphone ), so as to combine the functionality in a single system. The portable computing device may also display the source text or preferably be attached to a physical medium of the source text, particularly where the source text is printed, such as a book.

Le dispositif informatique est configuré pour mettre en oeuvre le procédé de réception et de traitement selon l'invention en particulier grâce à une combinaison d'un ou plusieurs composants informatiques tels qu'un processeur (CPU pour Central Processing Unit en anglais), et/ou un processeur graphique (GPU pour Graphics Processing Unit en anglais), et/ou un processeur de signal numérique (DSP pour Digital Signal Processor en anglais), et/ou une ou plusieurs mémoires, et/ou un convertisseur analogique/numérique, un microphone, un accéléromètre, un gyroscope, etc.The computing device is configured to implement the reception and processing method according to the invention in particular thanks to a combination of one or more computing components such as a processor (CPU for Central Processing Unit in English), and/or a graphics processor (GPU for Graphics Processing Unit in English), and/or a digital signal processor (DSP for Digital Signal Processor in English), and/or one or more memories, and/or an analog/digital converter, a microphone, an accelerometer, a gyroscope, etc.

Avantageusement et selon l'invention, le système comprend un dispositif de reconnaissance de mouvement et un dispositif d'attache du dispositif de reconnaissance de mouvement, configuré pour permettre d'attacher le dispositif de reconnaissance de mouvement à un support sur lequel est imprimé le texte source, de sorte à ce que le dispositif de reconnaissance de mouvement soit mécaniquement solidaire dudit support lors de la lecture du support.Advantageously and according to the invention, the system comprises a motion recognition device and a device for attaching the motion recognition device, configured to allow the motion recognition device to be attached to a medium on which the source text is printed, so that the motion recognition device is mechanically secured to said medium when reading the medium.

Selon cet aspect de l'invention, le dispositif de reconnaissance de mouvement peut être directement associé au support physique sur lequel est imprimé le texte source, de sorte à ce qu'un mouvement du support physique entraîne un mouvement du dispositif de reconnaissance de mouvement. Ainsi, un mouvement du support physique est détecté et est pris en compte dans la donnée de mouvement.According to this aspect of the invention, the motion recognition device can be directly associated with the physical medium on which the source text is printed, such that a movement of the physical medium causes a movement of the motion recognition device. Thus, a movement of the physical medium is detected and is taken into account in the motion data.

Lorsque le dispositif de reconnaissance de mouvement est intégré dans un dispositif informatique portatif, le dispositif d'attache est configuré pour attacher le dispositif informatique portatif au support physique.When the motion recognition device is integrated into a portable computing device, the attachment device is configured to attach the portable computing device to the physical medium.

Avantageusement et selon l'invention, le dispositif d'attache est formé d'un élément ou d'une combinaison d'éléments parmi les éléments de la liste suivante :

des éléments élastiques reliés au support sur lequel est imprimé le texte source, configurés pour un maintien en position du dispositif de reconnaissance de mouvement,
une poche agencée dans la couverture dudit support, configurée pour accueillir le dispositif de reconnaissance de mouvement,
un compartiment agencé dans la couverture dudit support, configurée pour accueillir le dispositif de reconnaissance de mouvement,
un aimant permanent agencé sur ledit support et configuré pour une aimantation avec un élément magnétique agencé sur le dispositif de reconnaissance de mouvement, ou un aimant permanent agencé sur le dispositif de reconnaissance de mouvement et configuré pour une aimantation avec un élément magnétique agencé sur ledit support,
une pochette clipsée sur ledit support, configurée pour accueillir le dispositif de reconnaissance de mouvement.

Advantageously and according to the invention, the attachment device is formed from an element or a combination of elements from the elements in the following list:

elastic elements connected to the support on which the source text is printed, configured to hold the motion recognition device in position,
a pocket arranged in the cover of said support, configured to accommodate the motion recognition device,
a compartment arranged in the cover of said support, configured to accommodate the motion recognition device,
a permanent magnet arranged on said support and configured for magnetization with a magnetic element arranged on the motion recognition device, or a permanent magnet arranged on the motion recognition device and configured for magnetization with a magnetic element arranged on said support,
a pocket clipped onto said support, configured to accommodate the motion recognition device.

Selon cet aspect de l'invention, ces différentes variantes d'attache permettent une compatibilité avec différents types de dispositif de reconnaissance de mouvement, en particulier lorsque celui-ci est intégré dans un dispositif informatique portatif.According to this aspect of the invention, these different attachment variants allow compatibility with different types of motion recognition device, in particular when the latter is integrated into a portable computing device.

L'invention concerne également un procédé de réception et de traitement, un produit programme d'ordinateur et un système de réception et de traitement caractérisés en combinaison par tout ou partie des caractéristiques mentionnées ci-dessus ou ci-après.The invention also relates to a receiving and processing method, a computer program product and a receiving and processing system characterized in combination by all or part of the features mentioned above or below.

Liste des figuresList of figures

D'autres buts, caractéristiques et avantages de l'invention apparaîtront à la lecture de la description suivante donnée à titre uniquement non limitatif et qui se réfère aux figures annexées dans lesquelles :

[Fig. 1] est une vue schématique d'un procédé de réception et de traitement de données audios selon un mode de réalisation de l'invention,
[Fig. 2] est une vue schématique d'un texte source et d'une liste de phonèmes tels qu'enregistrés dans un module de stockage d'un système de réception et de traitement selon un mode de réalisation de l'invention,
[Fig. 3] est un graphique illustrant schématiquement une variation de prosodie de données audios traitées par un procédé de réception et de traitement selon un mode de réalisation de l'invention,
[Fig. 4] est un graphique illustrant schématiquement trois pistes audios comprenant des effets sonores pouvant être déclenchés lors de l'exécution d'un procédé de réception et de traitement selon un mode de réalisation de l'invention,
[Fig. 5] est une vue schématique d'un système de réception et de traitement de données audios selon un mode de réalisation de l'invention et selon une vue de face et de dos d'un livre formant support du texte source.

Other aims, characteristics and advantages of the invention will appear on reading the following description given solely for non-limiting purposes and which is refers to the attached figures in which:

[ Fig. 1 ] is a schematic view of a method for receiving and processing audio data according to one embodiment of the invention,
[ Fig. 2 ] is a schematic view of a source text and a list of phonemes as recorded in a storage module of a reception and processing system according to one embodiment of the invention,
[ Fig. 3 ] is a graph schematically illustrating a variation in prosody of audio data processed by a reception and processing method according to an embodiment of the invention,
[ Fig. 4 ] is a graph schematically illustrating three audio tracks comprising sound effects that can be triggered when executing a receiving and processing method according to one embodiment of the invention,
[ Fig. 5 ] is a schematic view of a system for receiving and processing audio data according to one embodiment of the invention and according to a front and back view of a book forming a support for the source text.

Description détaillée d'un mode de réalisation de l'inventionDetailed description of an embodiment of the invention

Sur les figures, les échelles et les proportions ne sont pas strictement respectées et ce, à des fins d'illustration et de clarté.In the figures, scales and proportions are not strictly respected, for the purposes of illustration and clarity.

En outre, les éléments identiques, similaires ou analogues sont désignés par les mêmes références dans toutes les figures.Furthermore, identical, similar or analogous elements are designated by the same references throughout the figures.

La figure 1 représente schématiquement un procédé 100 de réception et de traitement de données audios selon un mode de réalisation de l'invention. Le procédé permet la réception et le traitement de données audios comprenant des paroles correspondant à la lecture en temps réel d'un texte source par un locuteur, pour le déclenchement d'effets sonores synchronisé avec ladite lecture du texte. Le procédé est ici mis en oeuvre dans un système de réception et de traitement de données audio, dont un mode de réalisation est décrit plus bas en référence avec la figure 5.There figure 1 schematically represents a method 100 for receiving and processing audio data according to an embodiment of the invention. The method allows the reception and processing of audio data comprising words corresponding to the real-time reading of a source text by a speaker, for the triggering of sound effects synchronized with said reading of the text. The method is here implemented in a system for receiving and processing audio data, an embodiment of which is described below with reference to the figure 5 .

Le procédé comprend une étape 110 de détermination d'un index de position du locuteur dans le texte source, par détection d'une correspondance entre les données audios réceptionnées et le texte source enregistré dans un module de stockage du système de réception et de traitement.The method comprises a step 110 of determining a position index of the speaker in the source text, by detecting a correspondence between the received audio data and the source text recorded in a module of storage of the receiving and processing system.

Le texte source est enregistré dans le dispositif de traitement sous la forme d'une liste de phonèmes, et le dispositif de traitement est configuré pour détecter dans les données audios la présence de phonème correspondant au texte source. La liste de phonèmes est créée au préalable, de façon automatique, manuelle ou semi-automatique (par exemple avec correction manuelle d'un prétraitement automatique). La création de la liste de phonèmes est généralement réalisée hors du système mettant en oeuvre le procédé selon l'invention.The source text is recorded in the processing device in the form of a list of phonemes, and the processing device is configured to detect in the audio data the presence of a phoneme corresponding to the source text. The list of phonemes is created beforehand, automatically, manually or semi-automatically (for example with manual correction of an automatic pre-processing). The creation of the list of phonemes is generally carried out outside the system implementing the method according to the invention.

La figure 2 représente schématiquement un exemple 200 de phrase en langue française du texte source tel qu'enregistré dans le module de stockage « En un éclair, le chat grimpa dans l'arbre. », sa transposition en phonèmes « εn Λn é k le

le t∫ æt gr im pa d æ nz la r b re » et l'index attribué à chaque phonème associé à cette phrase exemple, l'index étant compris dans cet exemple entre 1 et 19. D'autres numérotations d'index peuvent être utilisées, par exemple le premier index peut correspondre à un index zéro, etc.There figure 2 schematically represents an example 200 of a sentence in French from the source text as recorded in the storage module "In a flash, the cat climbed the tree.", its transposition into phonemes "εn Λn é k le

the t∫ æt gr im pa d æ nz la rb re » and the index assigned to each phoneme associated with this example sentence, the index being in this example between 1 and 19. Other index numberings can be used, for example the first index can correspond to a zero index, etc.

L'index de position du locuteur correspond à l'index du phonème correspondant de la liste de phonèmes du texte source, et l'étape 110 de détermination de l'index de position du locuteur dans le texte source comprend :

une sous-étape 112 de réception de la position actuelle de l'index dans la liste des phonèmes du texte source. Au départ, l'index de position correspond à l'index du premier phonème, par exemple l'index 1 dans la phrase exemple de la figure 2.
une sous-étape 114 de comparaison d'au moins un phonème détecté dans les données audios avec au moins un phonème attendu parmi les phonèmes suivants dans la liste de phonèmes du texte source. Les phonèmes suivants dans la liste des phonèmes de la phrase exemple sont les phonèmes « Λn », « é », « k » et « le ».
si aucun phonème détecté dans les données audios ne correspond avec les phonèmes suivants, une sous-étape 116 de réception d'une pluralité de phonèmes détectés dans les données audios et une sous-étape 118 de recherche dans au moins une partie du texte source d'une séquence de phonèmes dans la liste de phonèmes du texte source correspondant à ladite pluralité de phonèmes détectés. Par exemple, dans la phrase exemple de la figure 2, les phonèmes « le t∫ æt gr im pa » sont détectés dans les données audios et l'index de position se place alors à la fin de cette suite de phonèmes dans le texte source, c'est-à-dire à l'index 12 qui correspond au dernier phonème de la suite.

si la pluralité de phonèmes détectés dans les données audios ne correspondant avec aucune séquence de phonèmes dans la liste de phonèmes du texte source, une sous-étape 120 de recherche dans la totalité du texte source d'une séquence de phonèmes dans la liste de phonèmes du texte source correspondant à ladite pluralité de phonèmes détectés. Cette sous-étape permet la recherche d'une séquence de phonèmes dans l'ensemble du texte source, par exemple les phrases suivant la phrase exemple ou les phrases précédant la phrase exemple, non visibles sur les figures.
en cas de détection d'une correspondance d'un phonème ou d'une pluralité de phonème détectés avec la liste de phonèmes du texte source, une sous-étape 122 de mise à jour de l'index avec ledit phonème correspondant ou le dernier phonème de la séquence de phonèmes correspondante. Cette étape finale du cycle permet la mise à jour de l'index en fonction de la correspondance qui a été établie dans les sous-étapes précédentes. L'index mis à jour est récupéré lors d'une nouvelle exécution de la sous-étape 112 de réception de la position actuelle de l'index.

The speaker position index corresponds to the index of the corresponding phoneme in the phoneme list of the source text, and step 110 of determining the speaker position index in the source text comprises:

a substep 112 of receiving the current position of the index in the list of phonemes of the source text. Initially, the position index corresponds to the index of the first phoneme, for example index 1 in the example sentence of the figure 2 .
a substep 114 of comparing at least one phoneme detected in the audio data with at least one phoneme expected among the following phonemes in the list of phonemes of the source text. The following phonemes in the list of phonemes of the example sentence are the phonemes “Λn”, “é”, “k” and “le”.
if no phoneme detected in the audio data matches with the following phonemes, a sub-step 116 of receiving a plurality of phonemes detected in the audio data and a sub-step 118 of searching in at least a part of the source text for a sequence of phonemes in the source text phoneme list corresponding to said plurality of detected phonemes. For example, in the example sentence of the figure 2 , the phonemes "le t∫ æt gr im pa" are detected in the audio data and the position index is then placed at the end of this sequence of phonemes in the source text, that is to say at index 12 which corresponds to the last phoneme of the sequence.

if the plurality of phonemes detected in the audio data do not correspond to any phoneme sequence in the phoneme list of the source text, a sub-step 120 of searching in the entire source text for a phoneme sequence in the phoneme list of the source text corresponding to said plurality of detected phonemes. This sub-step allows the search for a phoneme sequence in the entire source text, for example the sentences following the example sentence or the sentences preceding the example sentence, not visible in the figures.
in case of detection of a correspondence of a phoneme or of a plurality of detected phonemes with the list of phonemes of the source text, a sub-step 122 of updating the index with said corresponding phoneme or the last phoneme of the corresponding sequence of phonemes. This final step of the cycle allows the updating of the index according to the correspondence which was established in the preceding sub-steps. The updated index is retrieved during a new execution of the sub-step 112 of receiving the current position of the index.

Le procédé peut également comprendre d'autres sous-étapes non décrites permettant d'assurer la mise en correspondance des phonèmes détectés dans le texte audio et des phonèmes tels qu'enregistrés par le texte source.The method may also include other substeps not described for ensuring the matching of phonemes detected in the audio text and phonemes as recorded by the source text.

Le procédé comprend ensuite une étape 130 de réception d'au moins une donnée représentative d'une valeur de prosodie des données audios, dite donnée 10 de prosodie, et/ou de réception d'au moins une donnée représentative de la présence ou de l'absence d'un mouvement d'un dispositif de reconnaissance de mouvement durant la lecture du texte et représentative de caractéristiques d'un mouvement présent, dite donnée 12 de mouvement.The method then comprises a step 130 of receiving at least one piece of data representative of a prosody value of the audio data, called prosody data 10, and/or of receiving at least one piece of data representative of the presence or absence of a movement from a movement recognition device. during the reading of the text and representative of characteristics of a present movement, called movement data 12.

La prosodie est représentative de l'émotion donnée au locuteur du texte source lors de sa lecture. La prise en compte du mouvement permet quant à elle de permettre une interactivité lors de la lecture du texte source.Prosody is representative of the emotion given to the speaker of the source text when reading it. Taking movement into account allows for interactivity when reading the source text.

La prosodie et le mouvement sont pris en compte indépendamment ou en combinaison dans la suite du procédé. Ce choix peut être réalisé par un utilisateur du procédé, par exemple grâce à une interface lui permettant de sélectionner si la prosodie et/ou le mouvement sont pris en compte dans les lectures futures.Prosody and movement are considered independently or in combination in the rest of the process. This choice can be made by a user of the process, for example through an interface allowing him to select whether prosody and/or movement are taken into account in future readings.

La donnée de prosodie comprend des données ou une combinaison de données parmi différents types de données dans la liste suivante :

Prosody data includes data or a combination of data from different data types in the following list:

Une combinaison de ces données permet de mieux caractériser l'émotion et adapter le déclenchement d'effets sonores en fonction.A combination of this data makes it possible to better characterize the emotion and adapt the triggering of sound effects accordingly.

La figure 3 est un graphique 300 illustratif représentant un exemple de prosodie en fonction du temps, telle que détectée à partir des données audios. Pour des raisons de simplification et à but uniquement illustratif, la prosodie dans le graphique 300 prend en compte uniquement l'intensité et la fréquence fondamentale des données audios. La valeur attribuée à la donnée de prosodie évolue ici en deux dimensions mais en pratique la valeur de la donnée de prosodie est caractérisée par la variation de plusieurs données par exemple réunies dans un vecteur ou une matrice.There figure 3 is an illustrative 300 graph showing an example of prosody versus time as detected from the audio data. For simplicity and for illustrative purposes only, the prosody in graph 300 considers only the intensity and fundamental frequency of the audio data. The value assigned to the prosody data here evolves in two dimensions but in practice the value of the prosody data is characterized by the variation of several data for example gathered in a vector or a matrix.

Durant un premier intervalle de temps, dit intervalle 310 de calibration, l'intensité et la fréquence fondamentale sont mesurées pour obtenir une valeur qui sera considérée comme moyenne pour le locuteur des données audios. L'intervalle 310 de calibration peut être défini par un déclencheur, par exemple à partir d'un premier index de position prédéfini, et se terminer à partir d'un deuxième index de position prédéfini.During a first time interval, called calibration interval 310, the intensity and the fundamental frequency are measured to obtain a value that will be considered as average for the speaker of the audio data. The calibration interval 310 can be defined by a trigger, for example from a first predefined position index, and end from a second predefined position index.

La valeur moyenne de prosodie déterminée lors de l'intervalle 310 de calibration est attribuée à une zone dite zone 312 moyenne. En cas de variation de la valeur de prosodie au cours du temps, cette valeur est susceptible de se trouver dans d'autres zones, une zone 314 haute définie par les valeurs au-dessus d'un seuil 315 haut et une zone 316 basse définie par les valeurs en dessous d'un seuil 317 bas. La variation de la valeur est par exemple représentative d'un pourcentage de variation de la fréquence fondamentale et d'un pourcentage de variation de l'intensité, ces deux valeurs pouvant être pondérées différemment pour donner la valeur totale. Davantage de zones peuvent être définies en fonction des données prises en compte dans la prosodie, par exemple quatre zones, cinq zones, ou davantage de zones. Les zones ainsi décrites ont principalement un but illustratif et les interprétations des valeurs de prosodies peuvent se baser sur d'autres méthodes de classement, par exemple des arbres de décisions, etc.The average prosody value determined during the calibration interval 310 is assigned to a zone called the average zone 312. In the event of a variation in the prosody value over time, this value is likely to be found in other zones, a high zone 314 defined by the values above a high threshold 315 and a low zone 316 defined by the values below a low threshold 317. The variation in the value is for example representative of a percentage variation in the fundamental frequency and a percentage variation in the intensity, these two values being able to be weighted differently to give the total value. More zones can be defined according to the data taken into account in the prosody, for example four zones, five zones, or more zones. The zones thus described are mainly for illustrative purposes and the interpretations of the prosody values can be based on other classification methods, for example decision trees, etc.

Les valeurs seuil définissant chaque zone peuvent également dépendre de données de calibration obtenues à une étape de calibration permettant de déterminer ou d'estimer, à partir des données audios, le genre du locuteur et/ou l'âge du locuteur et/ou la langue parlée par le locuteur. Ces données de calibration peuvent également permettre de déterminer un modèle d'apprentissage automatique a mettre en oeuvre pour la détermination des données de prosodie, parmi un groupe de modèles d'apprentissage automatique embarqués dans le système mettant en oeuvre le procédé, chaque modèle d'apprentissage automatique étant entraîné au moins en partie par des données liées au genre du locuteur et/ou à l'âge du locuteur et/ou à la langue parlée par le locuteur et donc adapté à l'analyse de la prosodie pour ce genre, cet âge ou cette langue.The threshold values defining each zone may also depend on calibration data obtained at a calibration step making it possible to determine or estimate, from the audio data, the gender of the speaker and/or the age of the speaker and/or the language spoken by the speaker. These calibration data may also make it possible to determine a machine learning model to be implemented for determining the prosody data, from a group of machine learning models embedded in the system implementing the method, each machine learning model being trained at least in part by data related to the gender of the speaker and/or the age of the speaker and/or the language spoken by the speaker and therefore adapted to the analysis of the prosody. for this gender, age or language.

Chaque zone correspond ensuite à une piste audio comprenant les effets sonores à déclencher. Lorsqu'un effet sonore doit être déclenché, par exemple aux instants 318a, 318b, 318c, 318d représentés par des cercles, l'effet sonore de la piste audio associée à la zone dans laquelle se trouve la valeur de prosodie peut être déclenché. Ici, les instants 318a, 318c et 318d sont liés à la zone haute et l'instant 318b est lié à la zone basse. Les données de prosodie peuvent également être utilisées en temps réel pour faire varier l'effet sonore en cours, par exemple en augmentant ou diminuant son volume, sa hauteur, son tempo, son écho, etc.Each zone then corresponds to an audio track containing the sound effects to be triggered. When a sound effect is to be triggered, for example at times 318a, 318b, 318c, 318d represented by circles, the sound effect of the audio track associated with the zone in which the prosody value is located can be triggered. Here, times 318a, 318c and 318d are linked to the high zone and time 318b is linked to the low zone. The prosody data can also be used in real time to vary the current sound effect, for example by increasing or decreasing its volume, pitch, tempo, echo, etc.

Le déclenchement de l'effet sonore peut bien entendu être soumis à d'autres conditions, en particulier à la donnée de mouvement.The triggering of the sound effect can of course be subject to other conditions, in particular to the movement data.

La donnée de mouvement comprend des données relatives à un déplacement et/ou une rotation dans l'espace et/ou à la vitesse de déplacement du dispositif de reconnaissance de mouvement, et/ou des données relatives à la correspondance d'un mouvement ou d'une combinaison de mouvement détectés par le dispositif de reconnaissance de mouvement avec un mouvement prédéterminé parmi une liste de mouvement prédéterminé enregistrée dans un module de stockage du système de traitement. L'absence de mouvement est également une donnée de mouvement qui peut être interprété par le procédé.The motion data comprises data relating to a movement and/or rotation in space and/or to the speed of movement of the motion recognition device, and/or data relating to the correspondence of a movement or a combination of movements detected by the motion recognition device with a predetermined movement among a list of predetermined movements recorded in a storage module of the processing system. The absence of movement is also motion data which can be interpreted by the method.

D'autres détails sur la donnée de mouvement sont décrits plus bas en référence à la figure 5 illustrant un système de réception et traitement de données audios.Further details on the motion data are described below with reference to the figure 5 illustrating a system for receiving and processing audio data.

De retour à la figure 1, le procédé comprend ensuite une étape 140 de détermination de l'effet sonore à déclencher en fonction de l'index de position, et en fonction de la donnée de prosodie et/ou de la donnée de mouvement. L'effet sonore déclenché est ainsi adapté en fonction du contexte de lecture fourni par la prosodie et/ou les mouvements.Back to the figure 1 , the method then comprises a step 140 of determining the sound effect to be triggered as a function of the position index, and as a function of the prosody data and/or the movement data. The sound effect triggered is thus adapted as a function of the reading context provided by the prosody and/or the movements.

En particulier, dans un mode de réalisation de l'invention, l'ensemble des effets sonores associés au texte source sont répartis dans une pluralité de séquences temporelles associées à des portions du texte source, chaque séquence temporelle étant associée à un effet sonore ou à un groupe d'effets sonores comprenant plusieurs effets sonores, et en ce que lorsque l'index de position correspond à une séquence temporelle associée à un groupe d'effets sonores, l'étape 140 de détermination de l'effet sonore à déclencher comprend :

une sous-étape 142 de récupération, en fonction de l'index de position, du groupe d'effets sonores associé à la séquence temporelle dans laquelle se trouve l'index de position,
une sous-étape 144 de détermination, en fonction de la donnée de prosodie et/ou de la donnée de mouvement, de l'effet sonore à déclencher parmi les effets sonores du groupe d'effets sonores.

In particular, in one embodiment of the invention, all of the sound effects associated with the source text are distributed in a plurality of time sequences associated with portions of the source text, each time sequence being associated with a sound effect or a group of sound effects comprising several sound effects, and in that when the position index corresponds to a time sequence associated with a group of sound effects, step 140 of determining the sound effect to be triggered comprises:

a sub-step 142 of recovering, as a function of the position index, the group of sound effects associated with the time sequence in which the position index is located,
a sub-step 144 of determining, as a function of the prosody data and/or the movement data, the sound effect to be triggered among the sound effects of the group of sound effects.

La figure 4 représente schématiquement un graphique 400 temporel comprenant trois pistes audios, une piste 412 moyenne, une piste 414 haute et une piste 416 basse, correspondant respectivement aux effets sonores à déclencher en fonction des zones moyenne, haute et basse décrites précédemment. Comme décrit précédemment, ces pistes sont principalement à but illustratif et en pratique les déclenchements d'effets sonores peuvent ne pas être associés à des pistes particulières.There figure 4 schematically represents a time graph 400 comprising three audio tracks, a medium track 412, a high track 414 and a low track 416, corresponding respectively to the sound effects to be triggered according to the medium, high and low zones described above. As described above, these tracks are mainly for illustrative purposes and in practice the triggering of sound effects may not be associated with particular tracks.

Les pistes audios se lisent de gauche à droite au fur et à mesure de l'avancée de l'index de position et de la lecture. Les effets sonores des pistes audios sont regroupés en groupes de un à trois effets sonores associés à des séquences 421, 422, 423, 424, 425, 426, 427, 428, 429 temporelles.Audio tracks play from left to right as the position index and playback progress. Audio track sound effects are grouped into groups of one to three sound effects associated with time sequences.

La première séquence 421 temporelle est associée à un groupe qui comprend par exemple un effet sonore 421a qui est appliqué quelle que soit la zone atteinte par la valeur de prosodie. La deuxième séquence 422 temporelle est associée à un groupe qui comprend également un effet sonore 422a, en particulier car ces deux premières séquences temporelles correspondent à l'intervalle de calibration tel que décrit précédemment.The first time sequence 421 is associated with a group that includes for example a sound effect 421a that is applied regardless of the area reached by the prosody value. The second time sequence 422 is associated with a group that also includes a sound effect 422a, in particular because these first two time sequences correspond to the calibration interval as described above.

Les deux séquences 423 et 424 suivantes sont associées à des groupes comprenant chacun deux effets sonores, 423a et 423b d'une part et 424a et 424b d'autre part. Ainsi, si la valeur de prosodie est dans la zone haute, les effets sonores 423b et 424b seront joués tandis que si la valeur de prosodie est dans la zone moyenne ou basse, les effets sonores 423a et 424a seront joués.The following two sequences 423 and 424 are associated with groups each comprising two sound effects, 423a and 423b on the one hand and 424a and 424b on the other hand. Thus, if the prosody value is in the high zone, the sound effects 423b and 424b will be played while if the prosody value is in the low zone medium or low, sound effects 423a and 424a will be played.

Enfin, les séquences 425 à 429 suivantes sont associées à des groupes comprenant un effet sonore pour chaque zone, référencés respectivement 425a, 425b et 425c, 426a, 426b et 426c, 427a, 427b et 427c, 428a, 428b et 428c, 429a, 429b et 429c. Ainsi, un effet sonore est disponible pour la zone haute, la zone moyenne et la zone basse atteinte par la valeur de prosodie.Finally, the following sequences 425 to 429 are associated with groups comprising a sound effect for each zone, referenced respectively 425a, 425b and 425c, 426a, 426b and 426c, 427a, 427b and 427c, 428a, 428b and 428c, 429a, 429b and 429c. Thus, a sound effect is available for the high zone, the middle zone and the low zone reached by the prosody value.

L'étape 150 de déclenchement d'effet sonore est exécutée lorsque l'index de position est à une position égale ou après un index de déclenchement prédéterminé, ou dans une fenêtre de déclenchement prédéterminée, et l'effet sonore peut être immédiatement émis dès que l'index de position atteint ou dépasse l'index de déclenchement ou atteint la fenêtre de déclenchement, ou émis de façon différée après la fin d'un effet sonore en cours d'émission. Si l'index de position atteint par exemple le repère 430, et que la valeur de prosodie a évolué depuis le début de la séquence 424 temporelle, l'effet sonore joué peut être modifié immédiatement. Alternativement, si l'index de position atteint par exemple le repère 431, et que la valeur de prosodie a évolué depuis le début de la séquence 426 temporelle l'effet sonore joué peut être modifié uniquement au début de la séquence 427 temporelle.The sound effect triggering step 150 is executed when the position index is at a position equal to or after a predetermined trigger index, or within a predetermined trigger window, and the sound effect may be immediately output as soon as the position index reaches or exceeds the trigger index or reaches the trigger window, or output in a delayed manner after the end of a sound effect currently being output. If the position index reaches, for example, the mark 430, and the prosody value has changed since the start of the time sequence 424, the sound effect played may be changed immediately. Alternatively, if the position index reaches, for example, the mark 431, and the prosody value has changed since the start of the time sequence 426, the sound effect played may be changed only at the start of the time sequence 427.

Selon un mode de réalisation non représenté, ces pistes audios correspondent à un fond sonore et des effets sonores supplémentaires peuvent être ajoutés en cours de lecture en fonction des données de prosodie et/ou des données de mouvement et/ou de l'index de position. En outre, plusieurs effets sonores peuvent être combinés, en fonction de la prosodie et/ou du mouvement. Par exemple, pour une lecture d'un texte, un premier effet sonore comprenant le son d'un seul instrument de musique peut être joué et si l'intensité sonore détectée par détermination de la prosodie augmente, un ou plusieurs autres effets sonores ajoutant des instruments de musique peuvent se combiner au premier effet sonore.According to a non-represented embodiment, these audio tracks correspond to a sound background and additional sound effects can be added during playback depending on the prosody data and/or the motion data and/or the position index. Furthermore, several sound effects can be combined, depending on the prosody and/or the motion. For example, for a reading of a text, a first sound effect comprising the sound of a single musical instrument can be played and if the sound intensity detected by determining the prosody increases, one or more other sound effects adding musical instruments can be combined with the first sound effect.

En outre, le nombre de pistes sonores et de zones sonores peut être plus important que représentés.Additionally, the number of soundtracks and sound zones may be greater than represented.

Le procédé comprend enfin une étape 150 de déclenchement de l'effet sonore déterminé. L'effet sonore est en particulier émis par un dispositif d'émission sonore du système de réception et de traitement. Ce dispositif d'émission sonore est par exemple un haut-parleur ou un casque audio.The method finally comprises a step 150 of triggering the effect determined sound. The sound effect is in particular emitted by a sound emission device of the reception and processing system. This sound emission device is for example a loudspeaker or an audio headset.

L'émission de l'effet sonore permet ainsi d'augmenter l'immersion lors de la lecture du texte source.The sound effect thus increases immersion when reading the source text.

L'étape 150 de déclenchement de l'effet sonore comprend une sous-étape 152 de vérification de la valeur d'un indice de confiance et l'effet sonore est déclenché uniquement si l'indice de confiance est supérieur à un seuil prédéterminé.Step 150 of triggering the sound effect comprises a sub-step 152 of verifying the value of a confidence index and the sound effect is triggered only if the confidence index is greater than a predetermined threshold.

L'indice de confiance est géré par un sous-processus 160 de gestion de l'indice de confiance, qui comprend une étape 162 d'augmentation de la valeur de l'indice de confiance en cas de détection d'une correspondance d'un phonème ou d'une pluralité de phonème détectés avec la liste de phonèmes du texte source et il comprend une étape 164 de diminution de la valeur de l'indice de confiance si aucun phonème détecté dans les données audios ne correspond avec les phonèmes suivants, ou si le phonème détecté est suffisamment proche du phonème attendu pour permettre l'avancement de l'index de position mais n'a pas une correspondance exacte. L'indice de confiance permet ainsi de ne pas déclencher l'effet sonore en cas d'incertitude trop importante sur la précision de suivi des phonèmes. L'indice de confiance peut être également impacté par d'autres paramètres.The confidence index is managed by a subprocess 160 for managing the confidence index, which comprises a step 162 for increasing the value of the confidence index in the event of detection of a correspondence of a phoneme or of a plurality of phonemes detected with the list of phonemes of the source text and it comprises a step 164 for decreasing the value of the confidence index if no phoneme detected in the audio data corresponds with the following phonemes, or if the detected phoneme is sufficiently close to the expected phoneme to allow the advancement of the position index but does not have an exact correspondence. The confidence index thus makes it possible not to trigger the sound effect in the event of too great an uncertainty on the precision of tracking of the phonemes. The confidence index can also be impacted by other parameters.

En outre, un indice de confiance peut être également appliqué indépendamment à la donnée de prosodie.Additionally, a confidence index can also be applied independently to the prosody data.

La figure 5 représente schématiquement un système 500 de réception et de traitement de données audios selon un mode de réalisation de l'invention. Le système permet la réception et le traitement de données audios correspondant à la lecture en temps réel d'un texte source par au moins un locuteur, pour le déclenchement d'effets sonores synchronisé avec ladite lecture du texte, en mettant en oeuvre les étapes du procédé de réception et de traitement décrit précédemment.There figure 5 schematically represents a system 500 for receiving and processing audio data according to an embodiment of the invention. The system allows the reception and processing of audio data corresponding to the real-time reading of a source text by at least one speaker, for the triggering of sound effects synchronized with said reading of the text, by implementing the steps of the reception and processing method described above.

Le système 500 de réception comprend ici un dispositif informatique portatif formé d'un téléphone 510 intelligent (plus connu sous le nom de smartphone en anglais) comprenant un module de stockage du texte source, le module de réception des données audios, un module de traitement des données audio, un dispositif de reconnaissance de mouvement et un module d'analyse de la variation de la prosodie, un module de détermination d'effet sonore et un dispositif d'émission sonore, adapté pour mettre en oeuvre les étapes associées du procédé de réception et de traitement des données audios. Le dispositif de reconnaissance de mouvement est par exemple un accéléromètre, un gyromètre, un magnétomètre ou une combinaison. Les différents modules sont gérés par exemple par le processeur et la ou les mémoires de stockages du téléphone 510 intelligent.The reception system 500 here comprises a portable computing device formed by a smart phone 510 (better known as a smartphone in English) comprising a source text storage module, the audio data reception module, an audio data processing module, a motion recognition device and a prosody variation analysis module, a sound effect determination module and a sound emission device, adapted to implement the associated steps of the audio data reception and processing method. The motion recognition device is for example an accelerometer, a gyrometer, a magnetometer or a combination. The different modules are managed for example by the processor and the storage memory(ies) of the smart phone 510.

Le système 500 de réception comprend également un dispositif d'attache du dispositif de reconnaissance de mouvement, plus particulièrement dans ce mode de réalisation un dispositif d'attache du téléphone 510 intelligent comprenant ce dispositif de reconnaissance du mouvement, configuré pour permettre d'attacher le dispositif de reconnaissance de mouvement à un support sur lequel est imprimé le texte source, de sorte à ce que le dispositif de reconnaissance de mouvement soit mécaniquement solidaire dudit support lors de la lecture du support. Le support sur lequel est imprimé le texte source est ici un livre dont la figure 5 représente une vue 590a de l'intérieur du livre et une vue 590b de l'extérieur du livre.The receiving system 500 also comprises a device for attaching the motion recognition device, more particularly in this embodiment a device for attaching the smart phone 510 comprising this motion recognition device, configured to allow the motion recognition device to be attached to a medium on which the source text is printed, such that the motion recognition device is mechanically secured to said medium when reading the medium. The medium on which the source text is printed is here a book whose figure 5 represents a view 590a of the inside of the book and a view 590b of the outside of the book.

En particulier, le dispositif d'attache comprend ici un élastique 512 traversant la quatrième de couverture du livre à travers des encoches 514 du dispositif d'attache. Le téléphone 510 intelligent peut par exemple être agencé au niveau de la dernière page 592a de garde du livre (aussi appelée troisième de couverture, troisième plat ou contre plat arrière). De façon alternative non représentée, le téléphone 510 intelligent peut être agencé avec le même dispositif d'attache sur la quatrième de couverture 592b (aussi appelée quatrième plat).In particular, the attachment device here comprises an elastic band 512 passing through the back cover of the book through notches 514 of the attachment device. The smartphone 510 may for example be arranged at the last guard page 592a of the book (also called the third cover, third cover or back cover). Alternatively, not shown, the smartphone 510 may be arranged with the same attachment device on the back cover 592b (also called the fourth cover).

D'autres types de dispositif d'attaches peuvent être mis en oeuvre :

des éléments élastiques reliés au support sur lequel est imprimé le texte source, configurés pour un maintien en position du dispositif de reconnaissance de mouvement,
une poche agencée dans la couverture dudit support, configurée pour accueillir le dispositif de reconnaissance de mouvement,
un compartiment agencé dans la couverture dudit support, configurée pour accueillir le dispositif de reconnaissance de mouvement,
un aimant permanent agencé sur ledit support et configuré pour une aimantation avec un élément magnétique agencé sur le dispositif de reconnaissance de mouvement, ou un aimant permanent agencé sur le dispositif de reconnaissance de mouvement et configuré pour une aimantation avec un élément magnétique agencé sur ledit support,
une pochette clipsée sur ledit support, configurée pour accueillir le dispositif de reconnaissance de mouvement
etc.

Other types of fastening devices can be implemented:

elastic elements connected to the support on which the source text is printed, configured to hold the motion recognition device in position,
a pocket arranged in the cover of said support, configured to accommodate the motion recognition device,
a compartment arranged in the cover of said support, configured to accommodate the motion recognition device,
a permanent magnet arranged on said support and configured for magnetization with a magnetic element arranged on the motion recognition device, or a permanent magnet arranged on the motion recognition device and configured for magnetization with a magnetic element arranged on said support,
a pocket clipped onto said support, configured to accommodate the motion recognition device
etc.

Le lien entre le dispositif de reconnaissance de mouvement et le support permet de détecter un mouvement du support, ici le livre, lors de la lecture et de proposer un déclenchement d'effet sonore associé. Un mouvement peut par exemple consister en :

une rotation du livre à la manière d'un volant de véhicule, et déclencher des effets sonores associés si l'index de position de la lecture est associé à un tel déclenchement,
un tapotement du livre pour mimer un toquage de porte, ou jouer d'un instrument de percussion,
secouer le livre pour mimer la secousse de maracas, un mouvement d'un éventail,
lever le livre,
tangage gauche/droite pour mimer le bercement d'un animal, d'un bébé, pour le mouvement d'un bâton de pluie,
roulis avant/arrière pour accélérer/ralentir la musique,
etc.

The link between the motion recognition device and the support makes it possible to detect a movement of the support, here the book, during reading and to propose a triggering of an associated sound effect. A movement can for example consist of:

a rotation of the book in the manner of a vehicle steering wheel, and trigger associated sound effects if the reading position index is associated with such triggering,
tapping the book to mime a door knock, or playing a percussion instrument,
shake the book to mime the shaking of maracas, a movement of a fan,
lift the book,
left/right pitching to mime the rocking of an animal, a baby, for the movement of a rain stick,
roll forward/backward to speed up/slow down the music,
etc.

Le mouvement peut également devoir être exécuté à un nombre minimum d'occurrences ou avec un angle de rotation minimum pour valider le déclenchement de l'effet sonore.The movement may also need to be performed at a minimum number of occurrences or with a minimum rotation angle to validate the triggering of the sound effect.

L'invention ne se limite pas aux modes de réalisation décrits. En particulier :

les effets sonores peuvent être prédéterminés ou générés en temps réel, par exemple via une intelligence artificielle générative,
le dispositif de reconnaissance de mouvement peut ne pas être attaché au support du texte source s'il est en mouvement en même temps que le texte source (s'il est intégré dans une montre intelligente portée par le locuteur par exemple),
une détection de certains mots en complément des phonèmes peut être mise en oeuvre pour la détection de cas particulier, en particulier pour la détection de mots personnalisés remplaçant ou complétant une portion du texte source,
plusieurs locuteurs peuvent être détectés pendant la lecture du texte source et leur prosodie respective prise en compte.

The invention is not limited to the embodiments described. In particular:

sound effects can be predetermined or generated in real time, e.g. example via generative artificial intelligence,
the motion recognition device may not be attached to the source text carrier if it is moving at the same time as the source text (for example, if it is integrated into a smart watch worn by the speaker),
a detection of certain words in addition to phonemes can be implemented for the detection of special cases, in particular for the detection of personalized words replacing or completing a portion of the source text,
multiple speakers can be detected during reading of the source text and their respective prosody taken into account.

Claims

Method for receiving and processing, in a reception and processing system, audio data comprising words corresponding to the real-time reading of a source text by at least one speaker, for triggering sound effects synchronized with said reading of the text, characterized in that it comprises: - a step (110) of determining an index (12) of the speaker's position in the source text, by detecting a correspondence between the audio data received and the source text recorded in a storage module of the reception and processing system,

- a step (112) of receiving at least one piece of data representative of a prosody value of the audio data, called prosody data (10), and/or of receiving at least one piece of data (12) representative of the presence or absence of a movement of a movement recognition device during the reading of the text and representative of characteristics of a present movement, called movement data,

- a step (140) of determining the sound effect to be triggered as a function of the position index (12), and as a function of the prosody data and/or the movement data,

- a step (150) of triggering the determined sound effect.

Reception and processing method according to claim 1, characterized in that the source text is recorded in the processing device in the form of a list of phonemes, and the processing device is configured to detect in the audio data the presence of a phoneme corresponding to the source text.

Reception and processing method according to claim 2, characterized in that the speaker position index (12) corresponds to a phoneme from the list of phonemes of the source text, and in that the step (110) of determining the speaker position index in the source text includes: - a sub-step (112) of receiving the current position of the index (12) in the list of phonemes of the source text,

- a sub-step (114) of comparing at least one phoneme detected in the audio data with at least one phoneme expected among the following phonemes in the list of phonemes of the source text,

- if no phoneme detected in the audio data corresponds with the following phonemes, a sub-step (116) of receiving a plurality of phonemes detected in the audio data and a sub-step (118) of searching in at least part of the source text for a sequence of phonemes in the list of phonemes of the source text corresponding to said plurality of detected phonemes,

- if the plurality of phonemes detected in the audio data do not correspond to any sequence of phonemes in the list of phonemes of the source text, a sub-step (120) of searching in the entire source text for a sequence of phonemes in the list of phonemes of the source text corresponding to said plurality of detected phonemes,

- in case of detection of a correspondence of a phoneme or of a plurality of detected phonemes with the list of phonemes of the source text, a sub-step (122) of updating the index with said corresponding phoneme or the last phoneme of the corresponding sequence of phonemes.

Reception and processing method according to one of claims 1 to 3, characterized in that the step of triggering the sound effect comprises a sub-step (152) of verifying the value of a confidence index and in that the sound effect is triggered only if the confidence index is greater than a predetermined threshold.

Reception and processing method according to a combination of claims 3 and 4, characterized in that the comparison sub-step (114) of at least one phoneme detected in the audio data with at least one phoneme expected among the following phonemes in the list of phonemes of the source text comprises a determination of an index representative of the resemblance between the phoneme detected in the audio data and the phoneme in the list of phonemes, and in that if the index representative of the resemblance is less than a predetermined threshold, a sub-step (122) of updating the index with said detected phoneme and a sub-step of decreasing the confidence index.

Reception and processing method according to a combination of claims 2 and 4, characterized in that it comprises a step (162) of increasing the value of the confidence index in the event of detection of a correspondence of a phoneme or of a plurality of phonemes detected with the list of phonemes of the source text and in that it comprises a step (164) of decreasing the value of the confidence index if no phoneme detected in the audio data corresponds with the following phonemes.

Reception and processing method according to one of claims 1 to 6, characterized in that all of the sound effects associated with the source text are distributed in a plurality of time sequences associated with portions of the source text, each time sequence being associated with a sound effect or a group of sound effects comprising several sound effects, and in that when the position index corresponds to a time sequence associated with a group of sound effects, the step of determining the sound effect to be triggered comprises: - a sub-step (142) of recovering, as a function of the position index, the group of sound effects associated with the time sequence in which the position index is located,

- a sub-step (144) of determining, as a function of the prosody data and/or the movement data, the sound effect to be triggered among the sound effects of the group of sound effects.

Reception and processing method according to one of claims 1 to 7, characterized in that the prosody data comprises data or a combination of data from different types of data in the following list: - data on the intensity of speech in audio data,

- data on the frequency and/or fundamental frequency of the speech of the audio data,

- data on the intensity and/or frequency and/or fundamental frequency of consonants pronounced in the audio data,

- data on the intensity and/or frequency and/or fundamental frequency of the vowels pronounced in the audio data,

- data on the length of vowels and/or spoken words in the audio data,

- data on the length of consonants pronounced in the audio data,

- data on the speech rate of audio data.

Reception and processing method according to one of claims 1 to 8, characterized in that the movement data comprises data relating to a movement and/or a rotation in space and/or to the speed of movement of the movement recognition device, and/or data relating to the correspondence of a movement or a combination of movements detected by the movement recognition device with a predetermined movement from a list of predetermined movements recorded in a storage module of the processing system.

A method of receiving and processing according to one of claims 1 to 9, characterized in that the sound effect triggering step (150) is executed when the position index (12) is at a position equal to or after a predetermined trigger index, or within a predetermined trigger window, and in that the sound effect can be immediately emitted as soon as that the position index reaches or exceeds the trigger index or reaches the trigger window, or emitted in a delayed manner after the end of a sound effect currently being emitted.

Computer program product for receiving and processing audio data corresponding to the real-time reading of a source text by at least one speaker, for triggering sound effects synchronized with said reading of the text, said computer program product comprising program code instructions for executing, when said computer program product is executed on a computing device, the steps of the method according to one of claims 1 to 10.

System for receiving and processing audio data corresponding to the real-time reading of a source text by at least one speaker, for triggering sound effects synchronized with said reading of the text, characterized in that it comprises: - a source text storage module,

- an audio data reception module,

- an audio data processing module, configured to determine a speaker position index in the source text, by detecting a correspondence between the received audio data and the source text recorded in the storage module,

- a motion recognition device configured to provide at least one piece of data representative of the presence or absence of a movement of a motion recognition device during the reading of the text and representative of characteristics of a present movement, called motion data, and/or a module for analyzing the variation of the prosody of the audio data configured to provide at least one piece of data representative of a variation of prosody of the audio data

- a module for determining the sound effect to be triggered based on the position index, and based on the prosody data and/or the movement data,

- a sound emitting device configured to emit the determined sound effect.

System according to claim 12, characterized in that it comprises a portable computer device (510), preferably a smartphone, a digital tablet or a smart watch, comprising the source text storage module, the audio data reception module, the audio data processing module, the motion recognition device and/or the prosody variation analysis module, the sound effect determination module and the sound emission device.

System according to one of claims 12 or 13, characterized in that it comprises a motion recognition device and a device (512) for attaching the motion recognition device, configured to allow the motion recognition device to be attached to a medium on which the source text is printed, so that the motion recognition device is mechanically secured to said medium when the medium is read.

System according to claim 14, characterized in that the attachment device is formed of an element or a combination of elements from the elements in the following list: - elastic elements (512) connected to the support on which the source text is printed, configured to hold the motion recognition device in position,

- a pocket arranged in the cover of said support, configured to accommodate the motion recognition device,

- a compartment arranged in the cover of said support, configured to accommodate the motion recognition device,

- a permanent magnet arranged on said support and configured for magnetization with a magnetic element arranged on the motion recognition device, or a permanent magnet arranged on the motion recognition device and configured for magnetization with a magnetic element arranged on said support,

- a pocket clipped onto said support, configured to accommodate the motion recognition device.