Disclosure of Invention
According to the defects of the prior art, the invention provides a brand-new vocal emotion multi-mode magic thinking and method. Achieving the objects and intents of the invention.
The aim and the intention of the invention are realized by adopting the following technical proposal and working steps:
1. VEM-Token magic model basic scheme implementation step
The invention is used as a construction method of a VEM-Token vocal emotion multi-mode magic model, which comprises the following steps:
ST100, collecting a sample file and a user file, capturing beats and dividing the beats by using a VEM-Token model to respectively obtain a VEM-Token1 sequence of the sample file and a VEM-Token2 sequence of the user file, aligning the beats of all the VEM-Token2 sequences by using the VEM-Token1 sequence, and generating the VEM-Token2 sequence.
ST200, identifying the VEM parameters of the VEM-Token1 sequence according to the VEM parameters included in the VEM-Token model, determining a magic solution by a user, processing the VEM parameters of the VEM-Token2 sequence, and generating the VEM-Token2 sequence of a result file conforming to the VEM-Token1 sequence style with the magic solution by magic solution.
2. Sample file and user file beat capture alignment step
In the foregoing basic aspect, the present invention provides a method or step of beat capture and alignment in terms of sample files and user file models, including, but not limited to, one or more of the following combinations:
ST110 for a sample file that does not include loop segments, beat capture is used to mark the start and end points of the beat, generating the complete VEM-Token1 sequence.
ST120, for a sample file comprising loop segments, starting from the second loop segment until all loop segments are finished, after beat capture, executing a starting point fine tuning step and an end point fine tuning step, realizing beat alignment inside the loop segments, and generating all VEM-Token1 sequences.
ST130 is to execute beat capturing and beat aligning steps including a starting point fine tuning step and an end point fine tuning step in all VEM-Token2 sequences according to the VEM-Token1 sequences to process corresponding beats to generate the VEM-Token2 sequences.
The ST140 user files specifically comprise user files generated by simulating sample file singing by a user, user files generated by collecting voice characteristics of the user and completely cloning according to the sample files, and user files generated by mixing local singing and user file generated result files generated by local cloning.
3. VEM-Token model
Based on the foregoing, the present invention provides steps or methods in terms of a VEM-Token model, including, but not limited to, a combination of one or more of the following vocal emotion multimodal models:
The ST210 is that the VEM-Token model also comprises a VEM library and a VEM processor.
The VEM parameters include VEM classification, VEM function and established VEM coordinate system recording multiple modalities in vocal emotion.
ST212 the modalities include one or a combination of lyrics, singing voice, accompaniment, illustration, vocal style, music, emotional basis, accompaniment instrument, video, image, and ambient sound.
ST213 emotional basis includes one or a combination of happiness, sadness, anger, fear, aversion, surprise, calm, mind, expectancy, trust, love, remoistening, emotion, enemy.
ST214 the vocal style comprises one or a combination of folk song playing method, popular song playing method, rock song playing method, western song playing method, popular song playing method, composed song playing method and opera playing method.
ST215 the voice style includes the combination of the natural singing, speaking, the vocal cords when reciting, the fundamental frequency and overtones of the resonant cavity, which are the natural features of the user to distinguish from other people.
ST216, collecting various vocal files by VEM classification, performing emotion judgment on singing and accompaniment on the vocal files by a vocal expert or a learning algorithm of a human, training a VEM function by adopting supervised learning and deep learning to obtain VEM parameters, and adding the VEM parameters to a VEM library.
ST220 is that the VEM processor provides an operation interface for the interaction of the user and the machine, and the specific implementation of the magic scheme is completed.
4. Magic scheme model
Based on the foregoing, the present invention provides a step or method in a magic solution model, including, but not limited to, one or more of the following combinations:
the magic scheme comprises the steps of singing magic, accompaniment magic, emotion overtone magic, and emotion fluctuation magic, and specifically comprises the following steps:
ST 310. The magic scheme comprises singing magic, and specifically comprises:
ST311, preprocessing the sample file and the user file by using a VEM processor, setting a singing filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.1 sequence of singing voice of the sample file and a VEM-token2.1 sequence of singing voice of the user file.
ST312, capturing beat starting points and beat end points of all the VEM-Token1.1 sequences and all the VEM-Token2.1 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.1 sequences with the beat starting points and the beat end points of the VEM-Token1.1 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
ST 320. The magic scheme also comprises accompaniment sound magic, and specifically comprises:
ST321, preprocessing the sample file and the user file by using a VEM processor, setting an accompaniment filter, converting the sample file and the user file into a frequency spectrum format file, separating a VEM-token1.2 sequence of accompaniment sounds of the sample file and a VEM-token2.2 sequence of accompaniment sounds of the user file.
ST322, capturing beat starting points and beat end points of all the VEM-Token1.2 sequences and all the VEM-Token2.2 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.2 sequences with the VEM-Token1.2 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
ST330 the magic scheme also comprises emotion overtone magic, which comprises the following steps:
ST331, preprocessing the sample file and the user file by using a VEM processor, setting an emotion overtone filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.3 sequence of emotion overtones of the sample file and a VEM-token2.3 sequence of emotion overtones of the user file.
ST332, capturing beat starting points and beat end points of all the VEM-Token1.3 sequences and all the VEM-Token2.3 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.3 sequences with the VEM-Token1.3 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
ST340 the magic scheme also comprises emotion fluctuation magic, which specifically comprises:
ST341, preprocessing the sample file and the user file by using a VEM processor, setting an emotion fluctuation filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.4 sequence for forming emotion fluctuation of the sample file and a VEM-token2.4 sequence for forming emotion fluctuation of the user file.
ST342, capturing beat starting points and beat end points of all the VEM-Token1.4 sequences and all the VEM-Token2.4 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.4 sequences with the VEM-Token1.4 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
5. Learning and singing magic improvement
Based on the foregoing, the present invention provides a method or method for learning a magic model for a sample song file by a user, including, but not limited to, one or more of the following steps:
The magic scheme also comprises a step of learning to sing the magic, and specifically comprises the following steps:
ST410 the learning magic singing comprises simple learning singing, and specifically comprises:
ST411, using a VEM processor, according to all or part of the sample file, taking the bar formed by combining a plurality of beats as a unit, using the user to learn to sing more than once and recording and converting into a plurality of groups of singing voice VEM-token2.1 sequences of the user file, and then using the user to select a group of singing voice VEM-token2.1 sequences which are most preferable to the corresponding singing voice VEM-token1.1 sequences of the sample file in the plurality of groups as the selected singing voice VEM-token2.1 sequences.
ST412, using ST310, ST311, ST312 steps, generating a beat captured and aligned VEM-token2.1 sequence from the selected singing voice VEM-token2.1 sequence.
ST420 is that the learning magic singing improvement also comprises the mixed learning singing, which comprises the following steps:
ST421, setting weights A and B for the VEM-Token1.1 sequence and the VEM-Token2.1 sequence, and calculating the VEM-Token2.1 sequence after mixed singing by adopting the sequence of A multiplied by VEM-Token1.1 and B multiplied by VEM-Token2.1, wherein A is less than 0.3 and A multiplied by B=1.0.
6. Vocal cloning
Based on the foregoing, the present invention provides a method or procedure for the preparation of a magic model of vocal cloning, including, but not limited to, one or more of the following combinations:
the magic scheme also comprises a voice cloning magic step, and specifically comprises the following steps:
ST430 the magic scheme also comprises voice clone magic, which comprises the following steps:
ST431, query the VEM library, if there is VEM parameter of customer's voice style, obtain VEM parameter of customer's voice style.
If the VEM parameters of the voice styles of the clients do not exist, or the VEM parameters of the voice styles of the existing clients do not meet the requirements of the users, collecting singing records of a section of exercise music, a section of speaking and reciting records of the user, converting the records into a frequency spectrum format file with the voice styles of the user by a VEM processor through a singing filter, a mood overtone filter and a mood fluctuation filter, and obtaining the VEM parameters of the voice styles of the user according to VEM classification. And stored in the VEM library.
ST432, recognizing a lyric spectrum 1 in the sample file by adopting voice recognition included in a VEM-Token model according to the singing voice VEM-Token1.1 sequence of the sample file, wherein the lyric spectrum 1 comprises lyrics and positions of a start point and an end point of a beat where the lyrics are positioned.
ST433 copies the lyrics spectrum 1 of the sample file to become the lyrics spectrum 2 of the user file, adopts a voice synthesizer according to the VEM parameters of the voice style of the user, and finishes the voice cloning magic change of the user word by word according to the lyrics spectrum 2 of the user file and the starting point and the end point position of the beat, thus becoming the singing voice VEM-token2.1 sequence of the user.
7. Lyric magic
Based on the foregoing, the present invention provides a model step or method on a lyric magic model, including, but not limited to, one or more of the following in combination:
the magic scheme also comprises a lyric magic step, and specifically comprises the following steps:
ST500, when lyrics spectrum 2 is inconsistent with song collection of tunes of poems 1 or user's needs to modify, executing lyrics magic, specifically including:
ST510, decomposing the song collection of tunes of poems and the lyrics spectrum 1 according to the semantic grammar to form a lyrics sentence 2 and a lyrics sentence 1, and executing according to the following steps:
ST511, wherein the words of the lyric sentence 2 and the lyric sentence 1 are the same, the song collection of tunes of poems is copied as the lyric spectrum 1 according to the beat, each word in the lyrics of the lyric sentence 2 is filled to the corresponding beat position one by one, or the words are manually modified by a user to form the lyric spectrum 2 after magic modification.
ST512, wherein the words of the lyric sentence 2 and the words of the lyric sentence 1 are different, the words of the lyric sentence 2 are manually modified by a user, and the lyrics are filled into the corresponding beat positions one by one to form a lyric spectrum 2 after magic modification.
ST513 is to use a voice synthesizer to complete the voice cloning magic change of the user word by word according to the VEM parameters of the voice style of the user, the song collection of tunes of poems and the starting point and the end point of the beat to become the singing voice VEM-token2.1 sequence of the user.
8. Magic pitch, decorative magic pitch, etc
On the basis of the foregoing, the present invention also includes, but is not limited to, steps or methods of one or more combinations of the following:
the magic scheme also comprises the steps of pitch calibration magic, decoration sound magic, beat length magic, rhythm speed magic, beat strength magic, tone magic, emotion magic, video magic, free magic, multiple sample magic and listening to the real-time monitoring, and the method specifically comprises the following steps:
ST520 pitch calibration magic, which is to adjust the fundamental frequency of sound to the node frequency of twelve-tone equal to the node frequency of two adjacent node frequencies up or down to the node frequency:
ST530 front decorative sound magic change: in one beat, when the beat start point of the singing voice VEM-token2.1 of the user file and the beat start point of the singing voice VEM-token1.1 of the corresponding sample file fall within 1/2 of the beat on the time axis, the decorative sound is adopted to compensate before the beat start point of the singing voice VEM-token2.1 of the user file so as to align the start points, the decorative sound including a tremolo, a slide sound, a extension sound, a breathing sound.
ST540 post-decorative sound magic change in one beat, when the beat end point of the singing voice VEM-token2.1 of the user file and the beat end point of the singing voice VEM-token1.1 of the corresponding sample file advance within 1/2 beat on the time axis, the decorative sound or rest sound is adopted to compensate after the beat end point of the singing voice VEM-token2.1 of the user file so as to align the end points.
ST550 is a beat length magic change, in which when the beat of the singing voice VEM-token2.1 of the user file is not consistent with the occurrence length of the beat of the singing voice VEM-token1.1 of the corresponding sample file, a time warping algorithm step or a decoration sound magic change step is adopted to compress or expand the beat of the singing voice VEM-token2.1 of the user file so as to be aligned with the beat of the singing voice VEM-token1.1 of the corresponding sample file.
ST560, quick and slow rhythm magic change, namely when the overall rhythm of the user file needs to be accelerated or slowed down, adopting a time expansion algorithm step to synchronously compress or extend the rhythm of singing and accompaniment of the user file.
ST570, beat strength magic, namely aiming at singing voice VEM-token2.1 of a user file and/or accompaniment voice VEM-token2.2 of the user file, adopting different treatments based on a base layer and an emotion layer according to the needs of a user, wherein:
ST571 the base layer comprises the steps of volume dynamic processing, envelope shaping, which are adjusted on threshold for VEM-Token2.1 and VEM-Token 2.2.
ST 572. The emotion layer comprises intelligent dynamic control based on AI/machine learning, specifically comprises training a model to intelligently identify beats and musical instruments in audio, automatically generating dynamic processing parameters according to preset emotion labels or target loudness curves, and storing training results in a VEM library.
ST580 is to synchronously check the beat length of VEM-Token2.2 when the beat length of VEM-Token2.1 is changed, and to capture by time stretching and align VEM-Token2.1 and VEM-Token2.2 when the beat length is inconsistent.
ST590, tone color magic change, specifically comprising:
ST591 for the singing voice VEM-token1.1 sequence of the sample file and the singing voice VEM-token2.1 sequence of the user file, a filter comprising a plurality of groups of higher harmonics of a fundamental frequency is adopted, the frequencies of the higher harmonics of the plurality groups are decomposed, namely F1, F2, F3, &, fn, the overtones of the higher harmonics are decomposed, namely A1, A2, A3, &, an, wherein n is the number of the harmonics, and n is less than 50.
ST592, adopting the steps of volume dynamic processing and envelope modeling to respectively amplify or reduce overtone amplitude of more than one appointed harmonic component in A1, A2, A3, the AN in the VEM-token2.1 sequence so as to change the tone of the user file.
ST593, inquiring the tone color dynamic processing parameters of AI/machine learning in the VEM library, and adjusting the dynamic processing parameters to change the tone color of the user file.
ST5A0 is emotion magic change, which specifically comprises inquiring a VEM library according to emotion magic change demands of a user, and respectively adjusting VEM parameters including emotion magic change, vocal style and vocal style demands by taking a VEM-token2.1 sequence as an independent variable to obtain emotion magic change results.
ST5B0, when the user file needs to be matched with the video, the content and the rhythm of the video picture are adjusted according to the VEM parameters and the rhythm so as to adapt to the requirement of the user file.
ST5C0, free magic, the user changes the content and rhythm of the VEM-token2.1, VEM-token2.2 and video pictures according to the VEM parameters and more than one mode, so as to adapt to the requirements of user files.
ST5D0, multiple sample magic, wherein the user selects more than one sample file, selects part of sample files corresponding to part of parameters in the VEM parameters respectively, selects other part of sample files corresponding to the other part of parameters, and generates the content and rhythm of the VEM-Token2.1, VEM-Token2.2 and video pictures by magic so as to adapt to the requirements of the user files.
ST5E0, namely the listening and getting real-time listening, specifically comprising the steps of real-time listening, judging and grading the magic result by a user, generating a VEM-Token2 sequence of a result file by the magic process and the magic result, and submitting and storing the VEM-Token2 sequence in a VEM library.
9. Member management
Based on the foregoing, the present invention further includes, but is not limited to, member management for users, specifically including one or more of the following steps or methods in combination:
The magic model also comprises member management, and specifically comprises the following steps:
ST600, applying for members to the magic model according to the needs of the user, establishing a member file, and storing the member file in a VEM library.
ST610, the member archive comprises user information, sample file information, user file information, VEM parameters, voiceprint encryption and voiceprint decryption, wherein the encryption and decryption keys comprise member signatures, member images, member videos and member VEM parameters.
ST620, member management includes advancing, backing, rolling back, adding, deleting, inquiring, modifying, storing and maintaining operation steps in the magic process.
ST630, member management also comprises instant modification, instant monitoring, real-time scoring, supervised learning, reinforcement learning, rewarding and punishment of user files, and the results are stored in a VEM library.
10. Others
Based on the foregoing, the magic model of the present invention further includes, but is not limited to, the following steps or methods:
ST700 is that the magic model also comprises an application system of a mobile terminal and an application system of a PC terminal, and also comprises an application system of a cloud mode and a blockchain application system.
ST800 the magic model also comprises a supporting hardware system, comprising a communication interface, a recording module, a tuning module, a playback module, an encryption module and a decryption module, comprising a tremble system interface, a face-to-face video number interface and an AI karaoke supporting system.
ST900, according to the synchronization signal provided for the subsequent large model application, accessing an AI system comprising DeepSeek, kimi.AI and ChatGPT to form an AI Agent.
STA00 magic model also includes interface protocol, providing hardware and network based MIDI protocol, MSC extension protocol, OSC network protocol, providing transport layer based AES3/PDIF protocol, MADI protocol, providing network audio transport protocol such as Dante protocol, AVB/TSN protocol, AES67 protocol.
11. Object and intent of the invention
The invention relates to a method for constructing a VEM-Token vocal emotion multi-mode magic model, which aims at and aims at:
realizing 'sound and student' and realizing recognition and understanding based on vocal music and music emotion.
An automatic and intelligent vocal AI magic model for emotion recognition and understanding of music files is created, and VEM parameters are introduced.
An automated magic method and steps for creating a user's learning from a sample music file.
Greatly improves the probability of automatic parameter adjustment in the magic process, and realizes standardization and automation.
A model facing music/vocal music is innovated, and a large model is connected to form an intelligent Agent for music AI application or a special application system, so that powerful support is provided for application of the wide AI.
The modeling of the vocal emotion is realized, and the quantitative characterization capability of the vocal emotion in a multi-mode is realized.
12. Advantageous effects of the invention
(1) "Phonological" is achieved, enabling AI to recognize and understand music/sound moods.
(2) The modeling of the vocal emotion is solved, and the quantitative characterization capability of the vocal emotion in a multi-mode is established.
(3) Realizing the automatic and intelligent vocal music modification based on emotion understanding.
(4) The high-efficiency magic method is realized by adopting automation, semi-automation and artificial intelligence to replace the artificial parameter adjustment.
(5) The "illusion" of AI operation is greatly reduced.
Detailed Description
The invention is applied as the granted Chinese invention patent VEM-Token vocal emotion multimode Token-based singing and accompaniment deep learning method, CN120126506 and a patent pool patent of the invention patent VEM-Token beat capturing and Ji Moxing constructing method, 202511249168.0, which are under examination, and is focused on the imitation and magic-modified invention way when a user learns a sample song, and further basic innovation is made.
The objects and intentions of the present invention can be achieved by the following specific examples. It should be noted herein that the specific embodiments have specific application and industrial applicability. Therefore, the embodiments do not include all of the features and steps of the present invention, nor are they intended to be limiting of the present invention. The description of the claims of the present invention is a summary of the invention.
This example is one example of the present invention.
Specific embodiments of the present invention are exemplified as follows:
novel construction method of VEM-Token vocal multi-mode magic model-an AI learning intelligent agent and AI karaoke system
Description of the drawings
The contents of this embodiment mainly include, but are not limited to, those composed of the following main schematic drawings, which are fig. 1 to 3.
Description of the implementation procedure
The method steps of the present embodiment mainly include steps 1 to 10. Wherein each of the 10 parts comprises a number of sub-steps. Unless specifically stated otherwise, these sub-steps are not all required, nor are their sequencing required unless otherwise stated, but are optimized and further selected by the patent practitioner according to the needs of some specific tasks.
The specific working steps are as follows:
1. VEM-Token magic model basic scheme implementation step
The invention is used as a construction method of a VEM-Token vocal emotion multi-mode magic model, which comprises the following steps:
ST100, collecting a sample file and a user file, capturing beats and dividing the beats by using a VEM-Token model to respectively obtain a VEM-Token1 sequence of the sample file and a VEM-Token2 sequence of the user file, aligning the beats of all the VEM-Token2 sequences by using the VEM-Token1 sequence, and generating the VEM-Token2 sequence.
ST200, identifying the VEM parameters of the VEM-Token1 sequence according to the VEM parameters included in the VEM-Token model, determining a magic solution by a user, processing the VEM parameters of the VEM-Token2 sequence, and generating the VEM-Token2 sequence of a result file conforming to the VEM-Token1 sequence style with the magic solution by magic solution.
Wherein, the model of the VEM-Token vocal emotion multi-mode refers to the model in the VEM-Token vocal emotion multi-mode Token singing and accompaniment deep learning method, CN120126506, which specifically comprises the steps of 1-4:
(1) Recording emotion by adopting more than one mode, marking vocal emotion multi-mode as VEM, constructing VEM classification, VEM coordinate system, VEM function and VEM library, wherein the vocal emotion comprises one or combination of happiness, sadness, anger, fear, aversion, surprise, calm, expectancy, trust, love, hawk, emotion and enemy, and the multi-mode comprises one or combination of lyrics, singing voice, accompaniment, vocal style, music, emotion foundation, accompaniment instrument, video and image, and the VEM coordinate system comprises a coordinate axis system established according to independent emotion, opposite emotion pairs and associated opposite emotion groups.
(2) Collecting vocal samples according to VEM classification, performing emotion judgment on the vocal sounds and accompaniments by a vocal music expert of a human being, training a VEM function by adopting supervised learning and deep learning to obtain VEM parameters, and adding the VEM parameters to a VEM library.
(3) And (3) carrying out beat calibration on the vocal files by adopting a VEM processor, separating out song sound streams and accompaniment streams, carrying out VEM-Token segmentation on the vocal files according to beats, converting the song sound streams into VEM-Token1 sequences, converting the accompaniment streams into VEM-Token2 sequences, and adding the VEM-Token2 sequences into a preprocessing library.
(4) Deep learning is adopted to respectively generate song collection of tunes of poems, VEM-Token song sound spectrum, VEM-Token accompaniment spectrum and VEM-Token music score.
The model of VEM-Token beat capture and beat alignment refers to a model in a method of VEM-Token beat capture and construction Ji Moxing, 202511249168.0, and specifically comprises the steps of 5-7:
(5) For a vocal file, setting a beat model comprising beat capturing and beat alignment according to a VEM-Token vocal emotion multi-mode model, capturing the beat of the vocal file, dividing the vocal file into VEM-Token sequences according to the beat, and marking the positions of the start point and the end point of the beat in each VEM-Token;
(6) Setting a starting point alignment model, comprising:
dividing a sample file included in the vocal music file and a user file generated by singing the user simulated sample file into a VEM-Token1 sequence and a VEM-Token2 sequence respectively, adopting a starting point fine tuning step according to the starting point of each VEM-Token1, and adjusting the starting points of the VEM-tokens 2 at corresponding positions one by one so as to align with the starting points of the VEM-tokens 1 at the corresponding positions;
For each segment of the circulating segments included in the vocal music file, starting from the second segment by taking the first segment as a reference, adopting a starting point fine adjustment step to adjust the starting point of each VEM-Token of each segment one by one, so that the starting point of each VEM-Token at the corresponding position of the first segment is aligned until all circulating segments are ended;
(7) Setting an endpoint alignment model, comprising:
According to the end point of each VEM-Token1, adopting an end point fine tuning step to adjust the end points of the VEM-Token2 at the corresponding positions one by one so as to align with the end points of the VEM-Token1 at the corresponding positions;
And for each segment of the circulating segments, taking the first segment as a reference, starting from the second segment, adopting an end point fine tuning step to adjust the end point of each VEM-Token of each segment one by one, so that the end points of the VEM-tokens at the positions corresponding to the first segment are aligned until all circulating segments are ended.
In the present application, the magic model is an algorithm that includes a user file modeled as a sample file.
It should be noted here that the sample file here includes more than one.
If the user only needs to imitate a single sample file, the sample file at this time is the only song, and only the VEM parameters of the song are needed. For example, the user only wants to imitate the "sease song" of the dao, then the sample file is the song file of the dao version, and all VEM parameters of the song are selected.
If the user needs to select different VEM parameters in two sample files, then the sample files are two. For example, if the user likes a part of the "sease" which simulates the knife man and another part of the "sease" which is a sub-man, then the two singers will need to select the "sease" and VEM parameters thereof.
FIG. 1A schematic diagram of a VEM-Token magic model
In fig. 1, one or more sample files and user files are input to a VEM processor, and sent to a VEM library query by a path 101, and if the VEM library finds that the sample files and even the user files are stored, the VEM parameters of the files and/or the VEM-Token1 sequence/VEM-Token 2 sequence are sent to a beat capture and alignment module by a path 103. According to the needs of the user, the VEM parameters and the VEM-Token1 sequence of the sample file can be sent to the VEM-Token magic matrix through a 102 path. If no sample file exists in the VEM parameter library, the VEM parameters of the sample file and/or the user file are parsed by the VEM processor.
In the beat capturing and alignment module, the data from the VEM processor includes at least a VEM-Token2 sequence of the user file, a VEM-Token1 sequence of the sample file further including 103 paths, or a VEM-Token1 sequence of the sample file and a VEM-Token2 sequence of the user file from the VEM processor, and the beat capturing and beat alignment steps are performed. Then the voice and accompaniment sound are sent to a multi-mode learning analysis module to be further decomposed into a singing voice and accompaniment voice of a sample file and a user file, a VEM-token1.1 sequence, a VEM-token1.2 sequence, a VEM-token2.1 sequence and a VEM-token2.2 sequence. In addition, according to the needs of the user, the emotion overtones and emotion fluctuation are continuously decomposed into a sample emotion overtone VEM-token1.3, a sample emotion fluctuation VEM-token1.4, and emotion overtones VEM-token2.3 and sample emotion fluctuation VEM-token2.4 of the user file.
In fig. 1, for convenience of illustration, a VEM-Token magic matrix is provided, and all the magic items in the present invention are incorporated into the VEM-Token magic matrix for description. It should be understood by the user of this patent that this is only an illustrative description and is not necessarily a module or hardware nor a limitation of the invention.
The information sources of the VEM-Token magic matrix comprise 2 paths, which are respectively:
All information originates from the multi-module learning analysis module, i.e. VEM-Token1.1, VEM-Token1.2, VEM-Token1.3, VEM-Token1.4 comprising sample files, and VEM-Token2.1, VEM-Token2.2, VEM-Token2.3, VEM-Token2.4 comprising user files, in which case no content in the sample files is stored in the VEM library.
The overall information originates from the multi-module learning analysis module, which is the content of the user file, and from the VEM library through the 104 path, which is the content of the sample file. Here because the contents of the sample file are already stored in the VEM library.
The information output of the VEM-Token magic matrix comprises the magic result of the user file, wherein the magic result of the user file is stored in the VEM library according to the wish of the user.
2. Sample file and user file beat capture alignment step
In the foregoing basic aspect, the present invention provides a method or step of beat capture and alignment in terms of sample files and user file models, including, but not limited to, one or more of the following combinations:
ST110 for a sample file that does not include loop segments, beat capture is used to mark the start and end points of the beat, generating the complete VEM-Token1 sequence.
ST120, for a sample file comprising loop segments, starting from the second loop segment until all loop segments are finished, after beat capture, executing a starting point fine tuning step and an end point fine tuning step, realizing beat alignment inside the loop segments, and generating all VEM-Token1 sequences.
ST130 is to execute beat capturing and beat aligning steps including a starting point fine tuning step and an end point fine tuning step in all VEM-Token2 sequences according to the VEM-Token1 sequences to process corresponding beats to generate the VEM-Token2 sequences.
The ST140 user files specifically comprise user files generated by simulating sample file singing by a user, user files generated by collecting voice characteristics of the user and completely cloning according to the sample files, and user files generated by mixing local singing and user file generated result files generated by local cloning.
It is emphasized here that the loop segments are frequent in a song, e.g. a three-segment loop. In a typical case, the beats in the loop segments are aligned, so that the start and end fine adjustments need to be performed. However, with the emotional processing of singers or different styles of songs, the beats of individual bars are not absolutely aligned in the circulation period. In this regard, the user of the present patent needs to perform different treatments according to songs.
3. VEM-Token model
Based on the foregoing, the present invention provides steps or methods in terms of a VEM-Token model, including, but not limited to, a combination of one or more of the following vocal emotion multimodal models:
The ST210 is that the VEM-Token model also comprises a VEM library and a VEM processor.
The VEM parameters include VEM classification, VEM function and established VEM coordinate system recording multiple modalities in vocal emotion.
ST212 the modalities include one or a combination of lyrics, singing voice, accompaniment, illustration, vocal style, music, emotional basis, accompaniment instrument, video, image, and ambient sound.
ST213 emotional basis includes one or a combination of happiness, sadness, anger, fear, aversion, surprise, calm, mind, expectancy, trust, love, remoistening, emotion, enemy.
ST214 the vocal style comprises one or a combination of folk song playing method, popular song playing method, rock song playing method, western song playing method, popular song playing method, composed song playing method and opera playing method.
ST215 the voice style includes the combination of the natural singing, speaking, the vocal cords when reciting, the fundamental frequency and overtones of the resonant cavity, which are the natural features of the user to distinguish from other people.
ST216, collecting various vocal files by VEM classification, performing emotion judgment on singing and accompaniment on the vocal files by a vocal expert or a learning algorithm of a human, training a VEM function by adopting supervised learning and deep learning to obtain VEM parameters, and adding the VEM parameters to a VEM library.
ST220 is that the VEM processor provides an operation interface for the interaction of the user and the machine, and the specific implementation of the magic scheme is completed.
It should be noted here that the VEM-Token model is a model of vocal emotion multi-mode vocabulary, and is different from the vocabulary processed by natural language of the existing NLP-Token, and the NLP-Token is based on direct interpretation of words and Chinese characters. The VEM-Token is a brand new explanation taking the music beat as a word element, and meanwhile, the VEM-Token is also related to VEM parameters of vocal emotion multi-mode, and vector data of various expression modes, such as love, hawk, emotion and enemy, as well as directions and scales of the VEM parameters are contained in the VEM parameters. In addition, a VEM library of supervised learning of typical songs by human vocal specialists is included in the model. Therefore, in this case, the accuracy of the VEM parameters of the songs in the VEM library is high, so that the reliability of the result of the operation is high, and the probability of "illusion" such as the NLP-Token model is low.
4. Magic scheme model
Based on the foregoing, the present invention provides a step or method in a magic solution model, including, but not limited to, one or more of the following combinations:
the magic scheme comprises the steps of singing magic, accompaniment magic, emotion overtone magic, and emotion fluctuation magic, and specifically comprises the following steps:
ST 310. The magic scheme comprises singing magic, and specifically comprises:
ST311, preprocessing the sample file and the user file by using a VEM processor, setting a singing filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.1 sequence of singing voice of the sample file and a VEM-token2.1 sequence of singing voice of the user file.
ST312, capturing beat starting points and beat end points of all the VEM-Token1.1 sequences and all the VEM-Token2.1 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.1 sequences with the beat starting points and the beat end points of the VEM-Token1.1 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
ST 320. The magic scheme also comprises accompaniment sound magic, and specifically comprises:
ST321, preprocessing the sample file and the user file by using a VEM processor, setting an accompaniment filter, converting the sample file and the user file into a frequency spectrum format file, separating a VEM-token1.2 sequence of accompaniment sounds of the sample file and a VEM-token2.2 sequence of accompaniment sounds of the user file.
ST322, capturing beat starting points and beat end points of all the VEM-Token1.2 sequences and all the VEM-Token2.2 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.2 sequences with the VEM-Token1.2 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
ST330 the magic scheme also comprises emotion overtone magic, which comprises the following steps:
ST331, preprocessing the sample file and the user file by using a VEM processor, setting an emotion overtone filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.3 sequence of emotion overtones of the sample file and a VEM-token2.3 sequence of emotion overtones of the user file.
ST332, capturing beat starting points and beat end points of all the VEM-Token1.3 sequences and all the VEM-Token2.3 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.3 sequences with the VEM-Token1.3 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
ST340 the magic scheme also comprises emotion fluctuation magic, which specifically comprises:
ST341, preprocessing the sample file and the user file by using a VEM processor, setting an emotion fluctuation filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.4 sequence for forming emotion fluctuation of the sample file and a VEM-token2.4 sequence for forming emotion fluctuation of the user file.
ST342, capturing beat starting points and beat end points of all the VEM-Token1.4 sequences and all the VEM-Token2.4 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.4 sequences with the VEM-Token1.4 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
The 4 magic schemes of singing magic change, accompaniment magic change, emotion overtone magic change and emotion fluctuation magic change are the magic schemes of the foundation of a user learning singing file facing a sample song file, and one of the important points is capturing and aligning beats of the two. Secondly, the 4 magic schemes can be combined, for example, in many cases, accompaniment sounds can be directly selected from accompaniment tracks of a sample file without magic. It is particularly emphasized that in beat trimming, after each start trimming, an end trimming is recommended to ensure that the length of the beat is fixed.
5. Learning and singing magic improvement
Based on the foregoing, the present invention provides a method or method for learning a magic model for a sample song file by a user, including, but not limited to, one or more of the following steps:
The magic scheme also comprises a step of learning to sing the magic, and specifically comprises the following steps:
ST410 the learning magic singing comprises simple learning singing, and specifically comprises:
ST411, using a VEM processor, according to all or part of the sample file, taking the bar formed by combining a plurality of beats as a unit, using the user to learn to sing more than once and recording and converting into a plurality of groups of singing voice VEM-token2.1 sequences of the user file, and then using the user to select a group of singing voice VEM-token2.1 sequences which are most preferable to the corresponding singing voice VEM-token1.1 sequences of the sample file in the plurality of groups as the selected singing voice VEM-token2.1 sequences.
ST412, using ST310, ST311, ST312 steps, generating a beat captured and aligned VEM-token2.1 sequence from the selected singing voice VEM-token2.1 sequence.
ST420 is that the learning magic singing improvement also comprises the mixed learning singing, which comprises the following steps:
ST421, setting weights A and B for the VEM-Token1.1 sequence and the VEM-Token2.1 sequence, and calculating the VEM-Token2.1 sequence after mixed singing by adopting the sequence of A multiplied by VEM-Token1.1 and B multiplied by VEM-Token2.1, wherein A is less than 0.3 and A multiplied by B=1.0.
The learning and singing magic is the first step that a user imitates a sample file, and in general, each sentence of lyrics and each beat of learning and singing need to be repeatedly learned and recorded many times, and then the most satisfactory one is selected in the learning and singing magic. Next, for the selected recording, the beat start and end points of the VEM-token2.1 sequence are aligned with the beat start and end points of the VEM-token1.1 sequence.
For mixed learning singing, after capturing and aligning beats, the data of the VEM-token1.1 sequence of the corresponding beats of the sample file are discounted, for example, 5% of the data, and the data of the VEM-token2.1 sequence of the user learning singing result are discounted by 95%, and the data are synthesized to be used as the final user learning singing file. The purpose of this step is to include a small portion of the sample song file in the user's singing results so as to bring the magic wand closer to the sample song. It should be noted here that, in particular, 5% or more is required, it should be determined in the monitoring that the ratio cannot be too large, otherwise the effect is poor due to too large difference between the sample and the user.
6. Vocal cloning
Based on the foregoing, the present invention provides a method or procedure for the preparation of a magic model of vocal cloning, including, but not limited to, one or more of the following combinations:
the magic scheme also comprises a voice cloning magic step, and specifically comprises the following steps:
ST430 the magic scheme also comprises voice clone magic, which comprises the following steps:
ST431, query the VEM library, if there is VEM parameter of customer's voice style, obtain VEM parameter of customer's voice style.
If the VEM parameters of the voice styles of the clients do not exist, or the VEM parameters of the voice styles of the existing clients do not meet the requirements of the users, collecting singing records of a section of exercise music, a section of speaking and reciting records of the user, converting the records into a frequency spectrum format file with the voice styles of the user by a VEM processor through a singing filter, a mood overtone filter and a mood fluctuation filter, and obtaining the VEM parameters of the voice styles of the user according to VEM classification. And stored in the VEM library.
ST432, recognizing a lyric spectrum 1 in the sample file by adopting voice recognition included in a VEM-Token model according to the singing voice VEM-Token1.1 sequence of the sample file, wherein the lyric spectrum 1 comprises lyrics and positions of a start point and an end point of a beat where the lyrics are positioned.
ST433 copies the lyrics spectrum 1 of the sample file to become the lyrics spectrum 2 of the user file, adopts a voice synthesizer according to the VEM parameters of the voice style of the user, and finishes the voice cloning magic change of the user word by word according to the lyrics spectrum 2 of the user file and the starting point and the end point position of the beat, thus becoming the singing voice VEM-token2.1 sequence of the user.
The human voice is like a human fingerprint, has personalized characteristics, is mainly represented by the structural characteristics of a sounding cavity of the human voice, is represented acoustically, namely, the fundamental frequency, the overtone type and the respective amplitude of the voice in a frequency spectrum format file are included in VEM parameters, and can find out the characteristics of the voice by analyzing the fundamental frequency, the overtone type and the respective amplitude.
In principle, as long as the voice VEM parameters of the user are in the VEM library, the voice VEM-token2.1 sequence of the user can be constructed by adding the singing voice VEM-token1.1 sequence of the sample file and the lyric spectrum 2 of the user by adopting a voice synthesizer, and further the voice clone magic of the user is realized. Even if the VEM database does not have the VEM parameters of the user, the VEM processor is used for collecting voice characteristics of the user and converting the voice characteristics into the VEM parameters, so that the magic scheme of voice cloning can be realized.
7. Lyric magic
Based on the foregoing, the present invention provides a model step or method on a lyric magic model, including, but not limited to, one or more of the following in combination:
the magic scheme also comprises a lyric magic step, and specifically comprises the following steps:
ST500, when lyrics spectrum 2 is inconsistent with song collection of tunes of poems 1 or user's needs to modify, executing lyrics magic, specifically including:
ST510, decomposing the song collection of tunes of poems and the lyrics spectrum 1 according to the semantic grammar to form a lyrics sentence 2 and a lyrics sentence 1, and executing according to the following steps:
ST511, wherein the words of the lyric sentence 2 and the lyric sentence 1 are the same, the song collection of tunes of poems is copied as the lyric spectrum 1 according to the beat, each word in the lyrics of the lyric sentence 2 is filled to the corresponding beat position one by one, or the words are manually modified by a user to form the lyric spectrum 2 after magic modification.
ST512, wherein the words of the lyric sentence 2 and the words of the lyric sentence 1 are different, the words of the lyric sentence 2 are manually modified by a user, and the lyrics are filled into the corresponding beat positions one by one to form a lyric spectrum 2 after magic modification.
ST513 is to use a voice synthesizer to complete the voice cloning magic change of the user word by word according to the VEM parameters of the voice style of the user, the song collection of tunes of poems and the starting point and the end point of the beat to become the singing voice VEM-token2.1 sequence of the user.
The lyric magic is a magic item commonly used in a magic model, and is also a magic item commonly used in a sample song by a user, and in most cases, the modification of the song collection of tunes of poems is only a local modification to an original document, and the meaning of lyrics and filling words after rhyme are comprehensively considered, rather than an impulse modification.
8. Magic pitch, decorative magic pitch, etc
On the basis of the foregoing, the present invention also includes, but is not limited to, steps or methods of one or more combinations of the following:
the magic scheme also comprises the steps of pitch calibration magic, decoration sound magic, beat length magic, rhythm speed magic, beat strength magic, tone magic, emotion magic, video magic, free magic, multiple sample magic and listening to the real-time monitoring, and the method specifically comprises the following steps:
ST520 pitch calibration magic cube according to the twelve-tone law of music, the fundamental frequency of sound must be equal to the node frequency of the twelve-tone law, and the sound frequency between two adjacent node frequencies must be adjusted up or down to the node frequency.
Pitch calibration, which is important for a non-professional singer. In music theory, pitch is the frequency of the pitch, also called intonation, which is basically too much by professional singers due to long-term training. The pitch calibration magic here may include two aspects, one is the calibration of the automatic overall pitch level, and the other is the pronunciation calibration in the beat selected autonomously by the user.
ST530 front decorative sound magic change: in one beat, when the beat start point of the singing voice VEM-token2.1 of the user file and the beat start point of the singing voice VEM-token1.1 of the corresponding sample file fall within 1/2 of the beat on the time axis, the decorative sound is adopted to compensate before the beat start point of the singing voice VEM-token2.1 of the user file so as to align the start points, the decorative sound including a tremolo, a slide sound, a extension sound, a breathing sound.
ST540 post-decorative sound magic change in one beat, when the beat end point of the singing voice VEM-token2.1 of the user file and the beat end point of the singing voice VEM-token1.1 of the corresponding sample file advance within 1/2 beat on the time axis, the decorative sound or rest sound is adopted to compensate after the beat end point of the singing voice VEM-token2.1 of the user file so as to align the end points.
Regarding the selection of the types of front and rear decorative sounds, the specific conditions such as the length of the compensation time difference, the emotion VEM parameter of the beat and the like are generally seen, and the selection is performed by a user according to the understanding of the user on songs.
ST550 is a beat length magic change, in which when the beat of the singing voice VEM-token2.1 of the user file is not consistent with the occurrence length of the beat of the singing voice VEM-token1.1 of the corresponding sample file, a time warping algorithm step or a decoration sound magic change step is adopted to compress or expand the beat of the singing voice VEM-token2.1 of the user file so as to be aligned with the beat of the singing voice VEM-token1.1 of the corresponding sample file.
ST560, quick and slow rhythm magic change, namely when the overall rhythm of the user file needs to be accelerated or slowed down, adopting a time expansion algorithm step to synchronously compress or extend the rhythm of singing and accompaniment of the user file.
ST570, beat strength magic, namely aiming at singing voice VEM-token2.1 of a user file and/or accompaniment voice VEM-token2.2 of the user file, adopting different treatments based on a base layer and an emotion layer according to the needs of a user, wherein:
ST571 the base layer comprises the steps of volume dynamic processing, envelope shaping, which are adjusted on threshold for VEM-Token2.1 and VEM-Token 2.2.
ST 572. The emotion layer comprises intelligent dynamic control based on AI/machine learning, specifically comprises training a model to intelligently identify beats and musical instruments in audio, automatically generating dynamic processing parameters according to preset emotion labels or target loudness curves, and storing training results in a VEM library.
ST580 is to synchronously check the beat length of VEM-Token2.2 when the beat length of VEM-Token2.1 is changed, and to capture by time stretching and align VEM-Token2.1 and VEM-Token2.2 when the beat length is inconsistent.
Beats and rhythms are basic physical quantities in music and are the most basic elements of music invented by human beings. Because music is a rhythmic number originating from a collective human labor, it is developed in synchronized dance steps expressing happiness. Therefore, magic of the beat rhythm is very important.
It should be noted that the time warping algorithm is a calculation method that only changes the length of audio, but does not change the pitch fundamental frequency, and does not change the structure of the harmonic components of sound. The time scaling algorithm specifically includes, but is not limited to, the following three:
(1) Time domain algorithm, SOLA (Synchronous Overlap-Add) and its variants (e.g., WSOLA, SOLA-FS),
The basic principle is to divide the input signal into short segments (segments) with overlap, time scale each segment (by compressing or expanding the overlap region between segments), find the best crossing point (based on waveform similarity) to minimize distortion when splicing, and re-superimpose the processed segments into the output signal.
The method has the basic characteristics of high calculation speed, suitability for real-time and low-resource application (such as an old tape simulator and a simple tone changer), poor transient (such as a drum point and a sound head) processing capability, easiness in generating click sound and reverberation feeling, unsatisfactory music processing tone quality, representing application of early telephone voice message speed control and some simple audio plug-ins.
(2) The frequency domain algorithm (parameterized/phase vocoder), which is currently the most popular and widely used type of method, is based on Short-Time Fourier Transform (STFT). Mainly includes but is not limited to:
The phase vocoder (Phase Vocoder) has the advantages of much better sound quality than the time domain method, and is especially suitable for audio with rich harmonic waves (such as human voice and piano). The disadvantage is that the processing of the percussive audio (e.g. snare drum) is still imperfect, producing characteristic "robot sound" or "reverberant feel" artifacts.
Enhanced phase vocoder based on transient processing. The processing quality of music such as drums, bass and the like is greatly improved. Representative, "Complex" and "Complex Pro" modes of iZotope Radius, serato Pitch' nTime, ableton Live.
(3) Machine learning/deep learning based data driven algorithms, which are the current leading research direction, aim to generate more natural time-warping results by learning large amounts of data. Mainly includes but is not limited to:
Based on the method of generating the model, the principle is that the model learns the potential distribution of the audio signal. Given an input audio, the model generates the most audible, best-matching output audio based on the target time duration. The advantage is the great potential, theoretically producing the most natural, minimal artifacts, and even "imagining" reasonable details to fill in the stretched time. Disadvantages are the need for massive data training, the huge consumption of computing resources, the difficulty in real time, and the possibility of model creation of uncontrollable illusions (ARTIFACTS).
The principle of the neural vocoder-based method is that waveforms are not directly processed, but intermediate characterization of audio (such as mel spectrogram, F0 pitch, harmonic information) is extracted first. The advantage is that the tone quality is generally far better than that of a traditional phase vocoder, because the neural vocoder is trained with a lot of high quality audio, and can reconstruct more natural sound textures. The disadvantage is that it depends on the quality of the nerve vocoder and is also generally computationally intensive.
ST590, tone color magic change, specifically comprising:
ST591 for the singing voice VEM-token1.1 sequence of the sample file and the singing voice VEM-token2.1 sequence of the user file, a filter comprising a plurality of groups of higher harmonics of a fundamental frequency is adopted, the frequencies of the higher harmonics of the plurality groups are decomposed, namely F1, F2, F3, &, fn, the overtones of the higher harmonics are decomposed, namely A1, A2, A3, &, an, wherein n is the number of the harmonics, and n is less than 50.
ST592, adopting the steps of volume dynamic processing and envelope modeling to respectively amplify or reduce overtone amplitude of more than one appointed harmonic component in A1, A2, A3, the AN in the VEM-token2.1 sequence so as to change the tone of the user file.
ST593, inquiring the tone color dynamic processing parameters of AI/machine learning in the VEM library, and adjusting the dynamic processing parameters to change the tone color of the user file.
The tone color is mainly the personalized characteristic of the own pronunciation of the user, and is similar to the voice fingerprint. The characteristic of the voice is characterized by individuation, namely the structural characteristic of a resonance cavity of human voice is mainly represented on acoustics, namely the fundamental frequency, the overtone type and the respective amplitude of the voice in a frequency spectrum format file are included in VEM parameters, and the voice characteristic can be found by analyzing the fundamental frequency, the overtone type and the respective amplitude.
In principle, as long as the VEM parameters of the user are in the VEM library, the singing voice VEM-token2.1 sequence of the user can be constructed according to the singing voice VEM-token1.1 sequence of the sample file, so that the tone magic change of the user is realized. Even if the VEM parameters of the user are not in the VEM library, the VEM processor is used for collecting voice characteristics of the user and converting the voice characteristics into the VEM parameters, so that the magic scheme of tone can be realized.
ST5A0 is emotion magic change, which specifically comprises inquiring a VEM library according to emotion magic change demands of a user, and respectively adjusting VEM parameters including emotion magic change, vocal style and vocal style demands by taking a VEM-token2.1 sequence as an independent variable to obtain emotion magic change results.
Emotion magic is the key point of the invention, based on a VEM-Token model, emotion has a plurality of classifications, namely VEM classification, and emotion basis degree is counted into VEM parameters, and emotion points are included in corresponding coordinates.
For example, the "love-and-remount, emotion-enemy and happy-and-apprehension" emotion elements which are opposite and related are respectively represented by a X, Y, Z orthogonal three-dimensional spherical coordinate system, the coordinate intervals are respectively X-axis "love-and-remount" -100% to 100%, Y-axis "happy-and-apprehension" -100% to 100%, and Z-axis "emotion-enemy" -100% to 100%, and the three coordinate axes are mutually related in an orthogonal manner, as shown in fig. 2. For example, the emotional parameters recorded at the emotional points E (X, Y, Z) of a beat are (80%, -20%, 50%). By adopting the method, emotion is accurately recorded.
The user of the present invention is specifically alerted here that the VEM coordinate system includes, but is not limited to:
Independent emotions comprise independent and uncorrelated emotions, a unidirectional one-dimensional coordinate axis is constructed, the lowest point of the emotion is taken as a coordinate 0 point, and the highest point of the emotion is taken as a coordinate maximum point.
The opposite emotion pair comprises two mutually opposite emotion components, and a bidirectional one-dimensional coordinate axis is constructed, wherein the midpoint of the opposite emotion pair is taken as a coordinate 0 point, the highest point of the positive emotion is taken as a coordinate positive maximum point, and the highest point of the negative emotion is taken as a coordinate negative maximum point.
The related opposite emotion groups comprise more than one group of opposite emotion pairs which are related with each other, wherein 0 points of the respective coordinates are aligned in a superposition way, the respective two-way one-dimensional coordinate axes are super-orthogonal, the super-plane demarcation is carried out, the respective positive emotions are arranged on the same side of the super-plane, the respective negative emotions are arranged on the opposite side of the super-plane, a super-orthogonal coordinate system is constructed, the highest point of the respective positive emotions is taken as the positive maximum point of the super-orthogonal coordinate, and the highest point of the respective negative emotions is taken as the negative maximum point of the super-orthogonal coordinate.
It should be noted that independent emotions, opposite emotion pairs, and associated opposite emotion sets, sometimes relative, are not always the same classification of modalities for different styles of vocal works, different cultural backgrounds of vocal works. Thus, independent emotions, opposite emotion pairs, and associated opposite emotion groups are all, in extreme cases, pointers to a piece of vocal work.
It should be emphasized that by "super-orthogonal coordinate system" we mean that the rectangular coordinate axes of more than 3 dimensions are arranged in the same system, e.g. 4-dimensional, 5-dimensional or more, and that we describe and record by means of mathematically super-space, since we cannot draw directly on the paper. Moreover, these hyperdimensions are not uniquely orthogonal, but rather intersect mathematically, rather than orthogonally, and even based on the curved coordinates of the Riemann geometry.
FIG. 3 is a schematic diagram of a partial user operation interface. In the figure, 3 color belts respectively represent the interface of the adjusting section of 3 groups of emotion pairs, a green ring on the color belts is an adjusting sliding cursor of emotion VEM parameters, and the emotion values of the 3 groups of emotion pairs, namely 'love-remount', 'happy-worry', 'emotion-enemy', can be adjusted by pulling the green ring cursor on the color belts left and right. The design man-machine interaction interface comprises, but is not limited to, a one-dimensional, two-dimensional screen, a three-dimensional space and even a multi-dimensional space interaction interface, and the VEM parameters comprise, but are not limited to, a dragging cursor type, a numerical value input type and a color editing type, and also comprise a 'what you see is what you hear' real-time monitoring mode. The editing result and the editing mode are stored in a VEM library in a centralized way.
ST5B0, when the user file needs to be matched with the video, the content and the rhythm of the video picture are adjusted according to the VEM parameters and the rhythm so as to adapt to the requirement of the user file.
Because the invention is based on a VEM-Token model, the mode can be understood as an 'audio-visual' mode, so that a subsequent 'visual map' model can be connected, the text information Token generated by 'audio-visual' can drive the 'visual map' to generate static images and dynamic videos, and a user can also manually connect the videos to generate video magic changes.
ST5C0, free magic, the user changes the content and rhythm of the VEM-token2.1, VEM-token2.2 and video pictures according to the VEM parameters and more than one mode, so as to adapt to the requirements of user files.
The free magic modification comprises the magic modification of the singing voice of the user, the accompaniment voice of the user and the video picture according to the intention and the intention of the user, and the magic modification mode comprises manual and semi-automatic.
ST5D0, multiple sample magic, wherein the user selects more than one sample file, selects part of sample files corresponding to part of parameters in the VEM parameters respectively, selects other part of sample files corresponding to the other part of parameters, and generates the content and rhythm of the VEM-Token2.1, VEM-Token2.2 and video pictures by magic so as to adapt to the requirements of the user files.
The multi-sample magic change is realized on the basis of more than one sample file, for example, a sample file of a folk song style and a sample file of a more elegant male high-pitched style are referred to in the singing voice of a user song, and in addition, accompaniment sounds refer to the sample files of accompaniment of a national musical instrument, that is, 3 sample files are referred to, so that multi-sample magic change is realized.
ST5E0, namely the listening and getting real-time listening, specifically comprising the steps of real-time listening, judging and grading the magic result by a user, generating a VEM-Token2 sequence of a result file by the magic process and the magic result, and submitting and storing the VEM-Token2 sequence in a VEM library.
The real-time monitoring is an important function, is beneficial to improving the efficiency of magic modification for users, and is used for monitoring and watching the magic modification result in real time while the users conduct magic modification, so as to achieve the effect of listening.
The user is particularly reminded of the invention that designing human-machine interaction interfaces includes but is not limited to one-dimensional, two-dimensional screen, three-dimensional space and even multi-dimensional space interaction interfaces, and VEM parameters include but are not limited to dragging cursors, numerical input, color editing, and real-time listening modes of 'what you see is what you hear'. The editing result and the editing mode are stored in a VEM library in a centralized way.
9. Member management
Based on the foregoing, the present invention further includes, but is not limited to, member management for users, specifically including one or more of the following steps or methods in combination:
The magic model also comprises member management, and specifically comprises the following steps:
ST600, applying for members to the magic model according to the needs of the user, establishing a member file, and storing the member file in a VEM library.
ST610, the member archive comprises user information, sample file information, user file information, VEM parameters, voiceprint encryption and voiceprint decryption, wherein the encryption and decryption keys comprise member signatures, member images, member videos and member VEM parameters.
ST620, member management includes advancing, backing, rolling back, adding, deleting, inquiring, modifying, storing and maintaining operation steps in the magic process.
ST630, member management also comprises instant modification, instant monitoring, real-time scoring, supervised learning, reinforcement learning, rewarding and punishment of user files, and the results are stored in a VEM library.
The system is beneficial to realizing cloud mode + fixed terminal or mobile terminal, wherein the VEM library and VEM parameters can be stored in the cloud center and terminal, and can also be stored in the nodes of the blockchain in a blockchain mode. Thus, users are managed in a membership manner, providing convenience in intellectual property management for users.
10. Others
Based on the foregoing solution, the magic model of the present invention further includes, but is not limited to, the following expansion steps or methods:
ST700 is that the magic model also comprises an application system of a mobile terminal and an application system of a PC terminal, and also comprises an application system of a cloud mode and a blockchain application system.
ST800 the magic model also comprises a supporting hardware system, comprising a communication interface, a recording module, a tuning module, a playback module, an encryption module and a decryption module, comprising a tremble system interface, a face-to-face video number interface and an AI karaoke supporting system.
ST900, according to the synchronization signal provided for the subsequent large model application, accessing an AI system comprising DeepSeek, kimi.AI and ChatGPT to form an AI Agent.
STA00 magic model also includes interface protocol, providing hardware and network based MIDI protocol, MSC extension protocol, OSC network protocol, providing transport layer based AES3/PDIF protocol, MADI protocol, providing network audio transport protocol such as Dante protocol, AVB/TSN protocol, AES67 protocol.
Wherein MIDI is a music digital interface (Musical Instrument DIGITAL INTERFACE), MSC is MIDI display control (MIDI Show Control), OSC is open sound control (Open Sound Control), AES3/PDIF is a point-to-point digital audio transmission standard, MADI is a multi-channel audio digital interface (Multichannel Audio DIGITAL INTERFACE), dante is an auto-discovery, low-delay, high-channel number, co-student protocol developed by Audinate, AVB/TSN is an audio-video low-delay transmission protocol, and AES67 is an interoperability standard protocol based on existing network technology.
It is specifically stated that in this specification, with respect to a variable name (e.g., VEM-Token with a numerical suffix), the same variable (e.g., VEM-Token 1.1) is inconsistent according to the conventions noted in the computer arts, since the calculation of the variable name varies with the step, as will be understood by those skilled in the art.