CN120853611A

CN120853611A - The construction method of VEM-Token vocal emotion multimodal magic modification model

Info

Publication number: CN120853611A
Application number: CN202511340091.8A
Authority: CN
Inventors: 丁贤根; 丁远彤
Original assignee: Harbour Star Health Biology Shenzhen Co ltd
Current assignee: Harbour Star Health Biology Shenzhen Co ltd
Priority date: 2025-09-19
Filing date: 2025-09-19
Publication date: 2025-10-28
Anticipated expiration: 2045-09-19

Abstract

The VEM-Token multimodal vocal emotion modification model, unlike natural language processing (NLP-Token) models that interpret music through text, is an innovative model that inherently incorporates multimodal vocal emotion information. The model captures and aligns sample songs and user-learned songs on a tempo basis, identifying multiple modalities of vocal emotion. It then divides the file's tokens by tempo, using supervised and reinforcement learning to obtain VEM parameters, decomposing the song into vocals, instrumental, and emotion. The modification model offers multimodal modification methods for mimicking sample songs, including vocals, instrumental, emotional overtones, mood swings, learning to sing, voice cloning, lyrics, pitch calibration, embellishments, beat length, tempo, and strength, as well as free and multi-sample modification. The model also provides membership management, mobile and PC application systems, dedicated support hardware, and communication protocols such as MIDI, facilitating integration with popular AI models and reducing model hallucinations, resulting in AI vocal agents and AI karaoke.

Description

Construction method of VEM-Token vocal emotion multi-mode magic model

Technical Field

The invention relates to the field of artificial intelligence, in particular to model construction and processing of an AI Agent, AI music and speech recitation, and especially relates to the sub-field of a magic model of a 'sound-to-text' model for dividing a token by adopting a multi-mode method of music beat and vocal emotion. Specifically, when people simulate sample vocal music or sample recitation, beat capturing and beat alignment are carried out so as to realize and optimize the simulation effect, and a construction method of the VEM-Token vocal emotion multi-mode magic model is constructed.

Background

Currently, in the field of artificial intelligence, the decomposition of information is still based on Natural Language Token segmentation methods of NLP-Token (Natural-Language-Processing Token). If the model is based on a text information mode, the NLP-Token has natural advantages, and the large model has all books and web pages based on characters of human beings learned and memorized into the large model. For non-literal information modes, such as singing voice, emotion, music style and the like, the current large model still searches for literal descriptions in the model memory, which are taken from books and web pages learned in the past, and obtains 'explanation' and 'understanding' through the literal descriptions, that is, the information mode based on the literal words, such as NLP-Token, is still adopted. Since we cannot guide where the large model is the corpus obtained at this time, and cannot predict the correctness of the corpora, creating the "illusion" of the large model is unavoidable.

Specifically, the prior art includes:

(1) Tools based on manual editing and Digital Audio Workstations (DAWs) rely heavily on manual operations. Representative techniques and software include Melodyne (Celemony) by a "note granule" technique, allowing the user to manually adjust pitch, duration, volume, tremolo, and even formants for each note. Auto-Tune (Antares Audio Technologies), originally designed as a real-time pitch correction effector, its "graphics mode" can also be used to manually fine-tune pitch lines. Waves Tune/iZotope Nectar, etc., to provide similar pitch and sound editing functions. Adobe Audition/Audacity/Logic Pro/Cubase and other DAW built-in tools provide basic functions such as compression, equalization (EQ), reverberation, volume envelope editing and the like.

(2) Automated methods based on rules and Digital Signal Processing (DSP) attempt to automatically complete part of the correction work by preset algorithm rules. Representative techniques and software are one-touch repair plug-ins such as Auto-Tune "Auto Mode" and Waves Tune Real-Time. A tempo alignment algorithm, such as the "WARP MARKING" technique in Ableton Live, may automatically analyze and stretch the audio to align the grid. Automatic pitch correction most DAWs and plug-ins provide a "one-key repair" function.

(3) Based on data-driven and Machine Learning (ML) approaches, much of these research efforts have focused on local applications of "timbre conversion" and "singing synthesis".

It can be seen that the model-wise understanding of the vocal emotion multi-mode is not solved based on the current traditional NLP-Token method.

The inventor group puts forward a brand new model design for the first time, which comprises a Chinese patent VEM-Token Vocal multi-mode Token converted singing voice and accompaniment deep learning method which is already authorized, CN120126506 (hereinafter referred to as a 'VEM-Token Vocal multi-mode model') and a method for capturing and constructing Ji Moxing of the invention patent VEM-Token beat under examination, 202511249168.0 (hereinafter referred to as a 'VEM-Token beat model') which successfully takes a music beat as an information word element, namely VEM-Token (voice-Emotion-Multimodal Token, word element of Vocal emotion multi-mode) to divide a music file, interprets and understand the meaning of music through the VEM parameter of the Vocal emotion multi-mode, and is commonly referred to as a 'music text', namely, a Vocal music/music generating word element Token. In the first two aspects, a certain number of songs with standard styles and classification are collected by a human music expert, the vocal files are subjected to frequency spectrum, beats are detected, the frequency spectrum vocal files are divided into VEM-Token sequences according to the vocal beats, a VEM coordinate system, a VEM function and a VEM library are established according to lyrics, singing voice, accompaniment, singer emotion, accompaniment emotion, videos, images and other multi-modes, VEM-Token identification is performed, song sound streams and accompaniment streams are separated, the vocal samples are subjected to multi-mode emotion scoring according to the vocal expert, VEM parameters are obtained by adopting supervised learning and deep learning algorithms, and the multi-mode emotion of the vocal samples is learned. For other vocal works, vocal multimodal moods, output songs collection of tunes of poems, VEM-Token song spectra, VEM-Token accompaniment spectra, and VEM-Token score can be identified. The invention can be used as patent pool patents of CN120126506 and 202511249168.0, can be connected into an AI system or an application system which is independently developed, and is developed into a vocal intelligent Agent which can listen to the music recognition spectrum of the singing. And performing supervised learning to obtain a VEM library.

Due to the existence of such a specialized and accurate library of learning VEMs generated by human expert supervised learning, the possibility of the late-stage large model generating "hallucinations" when applied is almost lost.

The VEM-Token concepts referred to in this application are based on the basic concepts and steps defined in the CN120126506, 202511249168.0 patent, unless otherwise emphasized or specifically defined by this application. Wherein, the references to "vocal files", "music files", "songs", "musical compositions" in the present document are the same unless specifically stated otherwise.

The invention is intended to be applied to the development and application of the current large models, such as OpenAI, deepSeek, google Gemini, kimi, bean-enclosed large models, religion, etc., which form various Agent agents after connecting the front end and the back end for some applications. The invention aims to access the large models and realize two-way communication with the large models, form artificial intelligence application based on music/vocal music, and even intelligent agent of music AI, and provide powerful innovation and support for the application of the expansion AI.

Shortcomings of the prior art methods

(1) Without vocal emotion modeling, the quantitative characterization capability of vocal emotion multimodality is lacking.

(2) Recognition and understanding based on vocal music and music emotion cannot be achieved.

(3) Automatic and intelligent vocal music modification based on emotion recognition and understanding cannot be realized.

(4) The magic method is based on manual operation, can only be used for manually adjusting parameters, and has low efficiency and no standardization and automation.

Disclosure of Invention

According to the defects of the prior art, the invention provides a brand-new vocal emotion multi-mode magic thinking and method. Achieving the objects and intents of the invention.

The aim and the intention of the invention are realized by adopting the following technical proposal and working steps:

1. VEM-Token magic model basic scheme implementation step

The invention is used as a construction method of a VEM-Token vocal emotion multi-mode magic model, which comprises the following steps:

ST100, collecting a sample file and a user file, capturing beats and dividing the beats by using a VEM-Token model to respectively obtain a VEM-Token1 sequence of the sample file and a VEM-Token2 sequence of the user file, aligning the beats of all the VEM-Token2 sequences by using the VEM-Token1 sequence, and generating the VEM-Token2 sequence.

ST200, identifying the VEM parameters of the VEM-Token1 sequence according to the VEM parameters included in the VEM-Token model, determining a magic solution by a user, processing the VEM parameters of the VEM-Token2 sequence, and generating the VEM-Token2 sequence of a result file conforming to the VEM-Token1 sequence style with the magic solution by magic solution.

2. Sample file and user file beat capture alignment step

In the foregoing basic aspect, the present invention provides a method or step of beat capture and alignment in terms of sample files and user file models, including, but not limited to, one or more of the following combinations:

ST110 for a sample file that does not include loop segments, beat capture is used to mark the start and end points of the beat, generating the complete VEM-Token1 sequence.

ST120, for a sample file comprising loop segments, starting from the second loop segment until all loop segments are finished, after beat capture, executing a starting point fine tuning step and an end point fine tuning step, realizing beat alignment inside the loop segments, and generating all VEM-Token1 sequences.

ST130 is to execute beat capturing and beat aligning steps including a starting point fine tuning step and an end point fine tuning step in all VEM-Token2 sequences according to the VEM-Token1 sequences to process corresponding beats to generate the VEM-Token2 sequences.

The ST140 user files specifically comprise user files generated by simulating sample file singing by a user, user files generated by collecting voice characteristics of the user and completely cloning according to the sample files, and user files generated by mixing local singing and user file generated result files generated by local cloning.

3. VEM-Token model

Based on the foregoing, the present invention provides steps or methods in terms of a VEM-Token model, including, but not limited to, a combination of one or more of the following vocal emotion multimodal models:

The ST210 is that the VEM-Token model also comprises a VEM library and a VEM processor.

The VEM parameters include VEM classification, VEM function and established VEM coordinate system recording multiple modalities in vocal emotion.

ST212 the modalities include one or a combination of lyrics, singing voice, accompaniment, illustration, vocal style, music, emotional basis, accompaniment instrument, video, image, and ambient sound.

ST213 emotional basis includes one or a combination of happiness, sadness, anger, fear, aversion, surprise, calm, mind, expectancy, trust, love, remoistening, emotion, enemy.

ST214 the vocal style comprises one or a combination of folk song playing method, popular song playing method, rock song playing method, western song playing method, popular song playing method, composed song playing method and opera playing method.

ST215 the voice style includes the combination of the natural singing, speaking, the vocal cords when reciting, the fundamental frequency and overtones of the resonant cavity, which are the natural features of the user to distinguish from other people.

ST216, collecting various vocal files by VEM classification, performing emotion judgment on singing and accompaniment on the vocal files by a vocal expert or a learning algorithm of a human, training a VEM function by adopting supervised learning and deep learning to obtain VEM parameters, and adding the VEM parameters to a VEM library.

ST220 is that the VEM processor provides an operation interface for the interaction of the user and the machine, and the specific implementation of the magic scheme is completed.

4. Magic scheme model

Based on the foregoing, the present invention provides a step or method in a magic solution model, including, but not limited to, one or more of the following combinations:

the magic scheme comprises the steps of singing magic, accompaniment magic, emotion overtone magic, and emotion fluctuation magic, and specifically comprises the following steps:

ST 310. The magic scheme comprises singing magic, and specifically comprises:

ST311, preprocessing the sample file and the user file by using a VEM processor, setting a singing filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.1 sequence of singing voice of the sample file and a VEM-token2.1 sequence of singing voice of the user file.

ST312, capturing beat starting points and beat end points of all the VEM-Token1.1 sequences and all the VEM-Token2.1 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.1 sequences with the beat starting points and the beat end points of the VEM-Token1.1 sequences by adopting a starting point fine tuning model and an end point fine tuning model.

ST 320. The magic scheme also comprises accompaniment sound magic, and specifically comprises:

ST321, preprocessing the sample file and the user file by using a VEM processor, setting an accompaniment filter, converting the sample file and the user file into a frequency spectrum format file, separating a VEM-token1.2 sequence of accompaniment sounds of the sample file and a VEM-token2.2 sequence of accompaniment sounds of the user file.

ST322, capturing beat starting points and beat end points of all the VEM-Token1.2 sequences and all the VEM-Token2.2 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.2 sequences with the VEM-Token1.2 sequences by adopting a starting point fine tuning model and an end point fine tuning model.

ST330 the magic scheme also comprises emotion overtone magic, which comprises the following steps:

ST331, preprocessing the sample file and the user file by using a VEM processor, setting an emotion overtone filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.3 sequence of emotion overtones of the sample file and a VEM-token2.3 sequence of emotion overtones of the user file.

ST332, capturing beat starting points and beat end points of all the VEM-Token1.3 sequences and all the VEM-Token2.3 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.3 sequences with the VEM-Token1.3 sequences by adopting a starting point fine tuning model and an end point fine tuning model.

ST340 the magic scheme also comprises emotion fluctuation magic, which specifically comprises:

ST341, preprocessing the sample file and the user file by using a VEM processor, setting an emotion fluctuation filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.4 sequence for forming emotion fluctuation of the sample file and a VEM-token2.4 sequence for forming emotion fluctuation of the user file.

ST342, capturing beat starting points and beat end points of all the VEM-Token1.4 sequences and all the VEM-Token2.4 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.4 sequences with the VEM-Token1.4 sequences by adopting a starting point fine tuning model and an end point fine tuning model.

5. Learning and singing magic improvement

Based on the foregoing, the present invention provides a method or method for learning a magic model for a sample song file by a user, including, but not limited to, one or more of the following steps:

The magic scheme also comprises a step of learning to sing the magic, and specifically comprises the following steps:

ST410 the learning magic singing comprises simple learning singing, and specifically comprises:

ST411, using a VEM processor, according to all or part of the sample file, taking the bar formed by combining a plurality of beats as a unit, using the user to learn to sing more than once and recording and converting into a plurality of groups of singing voice VEM-token2.1 sequences of the user file, and then using the user to select a group of singing voice VEM-token2.1 sequences which are most preferable to the corresponding singing voice VEM-token1.1 sequences of the sample file in the plurality of groups as the selected singing voice VEM-token2.1 sequences.

ST412, using ST310, ST311, ST312 steps, generating a beat captured and aligned VEM-token2.1 sequence from the selected singing voice VEM-token2.1 sequence.

ST420 is that the learning magic singing improvement also comprises the mixed learning singing, which comprises the following steps:

ST421, setting weights A and B for the VEM-Token1.1 sequence and the VEM-Token2.1 sequence, and calculating the VEM-Token2.1 sequence after mixed singing by adopting the sequence of A multiplied by VEM-Token1.1 and B multiplied by VEM-Token2.1, wherein A is less than 0.3 and A multiplied by B=1.0.

6. Vocal cloning

Based on the foregoing, the present invention provides a method or procedure for the preparation of a magic model of vocal cloning, including, but not limited to, one or more of the following combinations:

the magic scheme also comprises a voice cloning magic step, and specifically comprises the following steps:

ST430 the magic scheme also comprises voice clone magic, which comprises the following steps:

ST431, query the VEM library, if there is VEM parameter of customer's voice style, obtain VEM parameter of customer's voice style.

If the VEM parameters of the voice styles of the clients do not exist, or the VEM parameters of the voice styles of the existing clients do not meet the requirements of the users, collecting singing records of a section of exercise music, a section of speaking and reciting records of the user, converting the records into a frequency spectrum format file with the voice styles of the user by a VEM processor through a singing filter, a mood overtone filter and a mood fluctuation filter, and obtaining the VEM parameters of the voice styles of the user according to VEM classification. And stored in the VEM library.

ST432, recognizing a lyric spectrum 1 in the sample file by adopting voice recognition included in a VEM-Token model according to the singing voice VEM-Token1.1 sequence of the sample file, wherein the lyric spectrum 1 comprises lyrics and positions of a start point and an end point of a beat where the lyrics are positioned.

ST433 copies the lyrics spectrum 1 of the sample file to become the lyrics spectrum 2 of the user file, adopts a voice synthesizer according to the VEM parameters of the voice style of the user, and finishes the voice cloning magic change of the user word by word according to the lyrics spectrum 2 of the user file and the starting point and the end point position of the beat, thus becoming the singing voice VEM-token2.1 sequence of the user.

7. Lyric magic

Based on the foregoing, the present invention provides a model step or method on a lyric magic model, including, but not limited to, one or more of the following in combination:

the magic scheme also comprises a lyric magic step, and specifically comprises the following steps:

ST500, when lyrics spectrum 2 is inconsistent with song collection of tunes of poems 1 or user's needs to modify, executing lyrics magic, specifically including:

ST510, decomposing the song collection of tunes of poems and the lyrics spectrum 1 according to the semantic grammar to form a lyrics sentence 2 and a lyrics sentence 1, and executing according to the following steps:

ST511, wherein the words of the lyric sentence 2 and the lyric sentence 1 are the same, the song collection of tunes of poems is copied as the lyric spectrum 1 according to the beat, each word in the lyrics of the lyric sentence 2 is filled to the corresponding beat position one by one, or the words are manually modified by a user to form the lyric spectrum 2 after magic modification.

ST512, wherein the words of the lyric sentence 2 and the words of the lyric sentence 1 are different, the words of the lyric sentence 2 are manually modified by a user, and the lyrics are filled into the corresponding beat positions one by one to form a lyric spectrum 2 after magic modification.

ST513 is to use a voice synthesizer to complete the voice cloning magic change of the user word by word according to the VEM parameters of the voice style of the user, the song collection of tunes of poems and the starting point and the end point of the beat to become the singing voice VEM-token2.1 sequence of the user.

8. Magic pitch, decorative magic pitch, etc

On the basis of the foregoing, the present invention also includes, but is not limited to, steps or methods of one or more combinations of the following:

the magic scheme also comprises the steps of pitch calibration magic, decoration sound magic, beat length magic, rhythm speed magic, beat strength magic, tone magic, emotion magic, video magic, free magic, multiple sample magic and listening to the real-time monitoring, and the method specifically comprises the following steps:

ST520 pitch calibration magic, which is to adjust the fundamental frequency of sound to the node frequency of twelve-tone equal to the node frequency of two adjacent node frequencies up or down to the node frequency:

ST530 front decorative sound magic change: in one beat, when the beat start point of the singing voice VEM-token2.1 of the user file and the beat start point of the singing voice VEM-token1.1 of the corresponding sample file fall within 1/2 of the beat on the time axis, the decorative sound is adopted to compensate before the beat start point of the singing voice VEM-token2.1 of the user file so as to align the start points, the decorative sound including a tremolo, a slide sound, a extension sound, a breathing sound.

ST540 post-decorative sound magic change in one beat, when the beat end point of the singing voice VEM-token2.1 of the user file and the beat end point of the singing voice VEM-token1.1 of the corresponding sample file advance within 1/2 beat on the time axis, the decorative sound or rest sound is adopted to compensate after the beat end point of the singing voice VEM-token2.1 of the user file so as to align the end points.

ST550 is a beat length magic change, in which when the beat of the singing voice VEM-token2.1 of the user file is not consistent with the occurrence length of the beat of the singing voice VEM-token1.1 of the corresponding sample file, a time warping algorithm step or a decoration sound magic change step is adopted to compress or expand the beat of the singing voice VEM-token2.1 of the user file so as to be aligned with the beat of the singing voice VEM-token1.1 of the corresponding sample file.

ST560, quick and slow rhythm magic change, namely when the overall rhythm of the user file needs to be accelerated or slowed down, adopting a time expansion algorithm step to synchronously compress or extend the rhythm of singing and accompaniment of the user file.

ST570, beat strength magic, namely aiming at singing voice VEM-token2.1 of a user file and/or accompaniment voice VEM-token2.2 of the user file, adopting different treatments based on a base layer and an emotion layer according to the needs of a user, wherein:

ST571 the base layer comprises the steps of volume dynamic processing, envelope shaping, which are adjusted on threshold for VEM-Token2.1 and VEM-Token 2.2.

ST 572. The emotion layer comprises intelligent dynamic control based on AI/machine learning, specifically comprises training a model to intelligently identify beats and musical instruments in audio, automatically generating dynamic processing parameters according to preset emotion labels or target loudness curves, and storing training results in a VEM library.

ST580 is to synchronously check the beat length of VEM-Token2.2 when the beat length of VEM-Token2.1 is changed, and to capture by time stretching and align VEM-Token2.1 and VEM-Token2.2 when the beat length is inconsistent.

ST590, tone color magic change, specifically comprising:

ST591 for the singing voice VEM-token1.1 sequence of the sample file and the singing voice VEM-token2.1 sequence of the user file, a filter comprising a plurality of groups of higher harmonics of a fundamental frequency is adopted, the frequencies of the higher harmonics of the plurality groups are decomposed, namely F1, F2, F3, &, fn, the overtones of the higher harmonics are decomposed, namely A1, A2, A3, &, an, wherein n is the number of the harmonics, and n is less than 50.

ST592, adopting the steps of volume dynamic processing and envelope modeling to respectively amplify or reduce overtone amplitude of more than one appointed harmonic component in A1, A2, A3, the AN in the VEM-token2.1 sequence so as to change the tone of the user file.

ST593, inquiring the tone color dynamic processing parameters of AI/machine learning in the VEM library, and adjusting the dynamic processing parameters to change the tone color of the user file.

ST5A0 is emotion magic change, which specifically comprises inquiring a VEM library according to emotion magic change demands of a user, and respectively adjusting VEM parameters including emotion magic change, vocal style and vocal style demands by taking a VEM-token2.1 sequence as an independent variable to obtain emotion magic change results.

ST5B0, when the user file needs to be matched with the video, the content and the rhythm of the video picture are adjusted according to the VEM parameters and the rhythm so as to adapt to the requirement of the user file.

ST5C0, free magic, the user changes the content and rhythm of the VEM-token2.1, VEM-token2.2 and video pictures according to the VEM parameters and more than one mode, so as to adapt to the requirements of user files.

ST5D0, multiple sample magic, wherein the user selects more than one sample file, selects part of sample files corresponding to part of parameters in the VEM parameters respectively, selects other part of sample files corresponding to the other part of parameters, and generates the content and rhythm of the VEM-Token2.1, VEM-Token2.2 and video pictures by magic so as to adapt to the requirements of the user files.

ST5E0, namely the listening and getting real-time listening, specifically comprising the steps of real-time listening, judging and grading the magic result by a user, generating a VEM-Token2 sequence of a result file by the magic process and the magic result, and submitting and storing the VEM-Token2 sequence in a VEM library.

9. Member management

Based on the foregoing, the present invention further includes, but is not limited to, member management for users, specifically including one or more of the following steps or methods in combination:

The magic model also comprises member management, and specifically comprises the following steps:

ST600, applying for members to the magic model according to the needs of the user, establishing a member file, and storing the member file in a VEM library.

ST610, the member archive comprises user information, sample file information, user file information, VEM parameters, voiceprint encryption and voiceprint decryption, wherein the encryption and decryption keys comprise member signatures, member images, member videos and member VEM parameters.

ST620, member management includes advancing, backing, rolling back, adding, deleting, inquiring, modifying, storing and maintaining operation steps in the magic process.

ST630, member management also comprises instant modification, instant monitoring, real-time scoring, supervised learning, reinforcement learning, rewarding and punishment of user files, and the results are stored in a VEM library.

10. Others

Based on the foregoing, the magic model of the present invention further includes, but is not limited to, the following steps or methods:

ST700 is that the magic model also comprises an application system of a mobile terminal and an application system of a PC terminal, and also comprises an application system of a cloud mode and a blockchain application system.

ST800 the magic model also comprises a supporting hardware system, comprising a communication interface, a recording module, a tuning module, a playback module, an encryption module and a decryption module, comprising a tremble system interface, a face-to-face video number interface and an AI karaoke supporting system.

ST900, according to the synchronization signal provided for the subsequent large model application, accessing an AI system comprising DeepSeek, kimi.AI and ChatGPT to form an AI Agent.

STA00 magic model also includes interface protocol, providing hardware and network based MIDI protocol, MSC extension protocol, OSC network protocol, providing transport layer based AES3/PDIF protocol, MADI protocol, providing network audio transport protocol such as Dante protocol, AVB/TSN protocol, AES67 protocol.

11. Object and intent of the invention

The invention relates to a method for constructing a VEM-Token vocal emotion multi-mode magic model, which aims at and aims at:

realizing 'sound and student' and realizing recognition and understanding based on vocal music and music emotion.

An automatic and intelligent vocal AI magic model for emotion recognition and understanding of music files is created, and VEM parameters are introduced.

An automated magic method and steps for creating a user's learning from a sample music file.

Greatly improves the probability of automatic parameter adjustment in the magic process, and realizes standardization and automation.

A model facing music/vocal music is innovated, and a large model is connected to form an intelligent Agent for music AI application or a special application system, so that powerful support is provided for application of the wide AI.

The modeling of the vocal emotion is realized, and the quantitative characterization capability of the vocal emotion in a multi-mode is realized.

12. Advantageous effects of the invention

(1) "Phonological" is achieved, enabling AI to recognize and understand music/sound moods.

(2) The modeling of the vocal emotion is solved, and the quantitative characterization capability of the vocal emotion in a multi-mode is established.

(3) Realizing the automatic and intelligent vocal music modification based on emotion understanding.

(4) The high-efficiency magic method is realized by adopting automation, semi-automation and artificial intelligence to replace the artificial parameter adjustment.

(5) The "illusion" of AI operation is greatly reduced.

Drawings

List of drawings:

FIG. 1A schematic diagram of a VEM-Token magic model

FIG. 2 schematic representation of three-dimensional space-related emotions

FIG. 3 is a schematic diagram of a partial user operation interface

Detailed description of the drawings:

see the specific examples for details.

Detailed Description

The invention is applied as the granted Chinese invention patent VEM-Token vocal emotion multimode Token-based singing and accompaniment deep learning method, CN120126506 and a patent pool patent of the invention patent VEM-Token beat capturing and Ji Moxing constructing method, 202511249168.0, which are under examination, and is focused on the imitation and magic-modified invention way when a user learns a sample song, and further basic innovation is made.

The objects and intentions of the present invention can be achieved by the following specific examples. It should be noted herein that the specific embodiments have specific application and industrial applicability. Therefore, the embodiments do not include all of the features and steps of the present invention, nor are they intended to be limiting of the present invention. The description of the claims of the present invention is a summary of the invention.

This example is one example of the present invention.

Specific embodiments of the present invention are exemplified as follows:

novel construction method of VEM-Token vocal multi-mode magic model-an AI learning intelligent agent and AI karaoke system

Description of the drawings

The contents of this embodiment mainly include, but are not limited to, those composed of the following main schematic drawings, which are fig. 1 to 3.

Description of the implementation procedure

The method steps of the present embodiment mainly include steps 1 to 10. Wherein each of the 10 parts comprises a number of sub-steps. Unless specifically stated otherwise, these sub-steps are not all required, nor are their sequencing required unless otherwise stated, but are optimized and further selected by the patent practitioner according to the needs of some specific tasks.

The specific working steps are as follows:

1. VEM-Token magic model basic scheme implementation step

Wherein, the model of the VEM-Token vocal emotion multi-mode refers to the model in the VEM-Token vocal emotion multi-mode Token singing and accompaniment deep learning method, CN120126506, which specifically comprises the steps of 1-4:

(1) Recording emotion by adopting more than one mode, marking vocal emotion multi-mode as VEM, constructing VEM classification, VEM coordinate system, VEM function and VEM library, wherein the vocal emotion comprises one or combination of happiness, sadness, anger, fear, aversion, surprise, calm, expectancy, trust, love, hawk, emotion and enemy, and the multi-mode comprises one or combination of lyrics, singing voice, accompaniment, vocal style, music, emotion foundation, accompaniment instrument, video and image, and the VEM coordinate system comprises a coordinate axis system established according to independent emotion, opposite emotion pairs and associated opposite emotion groups.

(2) Collecting vocal samples according to VEM classification, performing emotion judgment on the vocal sounds and accompaniments by a vocal music expert of a human being, training a VEM function by adopting supervised learning and deep learning to obtain VEM parameters, and adding the VEM parameters to a VEM library.

(3) And (3) carrying out beat calibration on the vocal files by adopting a VEM processor, separating out song sound streams and accompaniment streams, carrying out VEM-Token segmentation on the vocal files according to beats, converting the song sound streams into VEM-Token1 sequences, converting the accompaniment streams into VEM-Token2 sequences, and adding the VEM-Token2 sequences into a preprocessing library.

(4) Deep learning is adopted to respectively generate song collection of tunes of poems, VEM-Token song sound spectrum, VEM-Token accompaniment spectrum and VEM-Token music score.

The model of VEM-Token beat capture and beat alignment refers to a model in a method of VEM-Token beat capture and construction Ji Moxing, 202511249168.0, and specifically comprises the steps of 5-7:

(5) For a vocal file, setting a beat model comprising beat capturing and beat alignment according to a VEM-Token vocal emotion multi-mode model, capturing the beat of the vocal file, dividing the vocal file into VEM-Token sequences according to the beat, and marking the positions of the start point and the end point of the beat in each VEM-Token;

(6) Setting a starting point alignment model, comprising:

dividing a sample file included in the vocal music file and a user file generated by singing the user simulated sample file into a VEM-Token1 sequence and a VEM-Token2 sequence respectively, adopting a starting point fine tuning step according to the starting point of each VEM-Token1, and adjusting the starting points of the VEM-tokens 2 at corresponding positions one by one so as to align with the starting points of the VEM-tokens 1 at the corresponding positions;

For each segment of the circulating segments included in the vocal music file, starting from the second segment by taking the first segment as a reference, adopting a starting point fine adjustment step to adjust the starting point of each VEM-Token of each segment one by one, so that the starting point of each VEM-Token at the corresponding position of the first segment is aligned until all circulating segments are ended;

(7) Setting an endpoint alignment model, comprising:

According to the end point of each VEM-Token1, adopting an end point fine tuning step to adjust the end points of the VEM-Token2 at the corresponding positions one by one so as to align with the end points of the VEM-Token1 at the corresponding positions;

And for each segment of the circulating segments, taking the first segment as a reference, starting from the second segment, adopting an end point fine tuning step to adjust the end point of each VEM-Token of each segment one by one, so that the end points of the VEM-tokens at the positions corresponding to the first segment are aligned until all circulating segments are ended.

In the present application, the magic model is an algorithm that includes a user file modeled as a sample file.

It should be noted here that the sample file here includes more than one.

If the user only needs to imitate a single sample file, the sample file at this time is the only song, and only the VEM parameters of the song are needed. For example, the user only wants to imitate the "sease song" of the dao, then the sample file is the song file of the dao version, and all VEM parameters of the song are selected.

If the user needs to select different VEM parameters in two sample files, then the sample files are two. For example, if the user likes a part of the "sease" which simulates the knife man and another part of the "sease" which is a sub-man, then the two singers will need to select the "sease" and VEM parameters thereof.

FIG. 1A schematic diagram of a VEM-Token magic model

In fig. 1, one or more sample files and user files are input to a VEM processor, and sent to a VEM library query by a path 101, and if the VEM library finds that the sample files and even the user files are stored, the VEM parameters of the files and/or the VEM-Token1 sequence/VEM-Token 2 sequence are sent to a beat capture and alignment module by a path 103. According to the needs of the user, the VEM parameters and the VEM-Token1 sequence of the sample file can be sent to the VEM-Token magic matrix through a 102 path. If no sample file exists in the VEM parameter library, the VEM parameters of the sample file and/or the user file are parsed by the VEM processor.

In the beat capturing and alignment module, the data from the VEM processor includes at least a VEM-Token2 sequence of the user file, a VEM-Token1 sequence of the sample file further including 103 paths, or a VEM-Token1 sequence of the sample file and a VEM-Token2 sequence of the user file from the VEM processor, and the beat capturing and beat alignment steps are performed. Then the voice and accompaniment sound are sent to a multi-mode learning analysis module to be further decomposed into a singing voice and accompaniment voice of a sample file and a user file, a VEM-token1.1 sequence, a VEM-token1.2 sequence, a VEM-token2.1 sequence and a VEM-token2.2 sequence. In addition, according to the needs of the user, the emotion overtones and emotion fluctuation are continuously decomposed into a sample emotion overtone VEM-token1.3, a sample emotion fluctuation VEM-token1.4, and emotion overtones VEM-token2.3 and sample emotion fluctuation VEM-token2.4 of the user file.

In fig. 1, for convenience of illustration, a VEM-Token magic matrix is provided, and all the magic items in the present invention are incorporated into the VEM-Token magic matrix for description. It should be understood by the user of this patent that this is only an illustrative description and is not necessarily a module or hardware nor a limitation of the invention.

The information sources of the VEM-Token magic matrix comprise 2 paths, which are respectively:

All information originates from the multi-module learning analysis module, i.e. VEM-Token1.1, VEM-Token1.2, VEM-Token1.3, VEM-Token1.4 comprising sample files, and VEM-Token2.1, VEM-Token2.2, VEM-Token2.3, VEM-Token2.4 comprising user files, in which case no content in the sample files is stored in the VEM library.

The overall information originates from the multi-module learning analysis module, which is the content of the user file, and from the VEM library through the 104 path, which is the content of the sample file. Here because the contents of the sample file are already stored in the VEM library.

The information output of the VEM-Token magic matrix comprises the magic result of the user file, wherein the magic result of the user file is stored in the VEM library according to the wish of the user.

2. Sample file and user file beat capture alignment step

It is emphasized here that the loop segments are frequent in a song, e.g. a three-segment loop. In a typical case, the beats in the loop segments are aligned, so that the start and end fine adjustments need to be performed. However, with the emotional processing of singers or different styles of songs, the beats of individual bars are not absolutely aligned in the circulation period. In this regard, the user of the present patent needs to perform different treatments according to songs.

3. VEM-Token model

It should be noted here that the VEM-Token model is a model of vocal emotion multi-mode vocabulary, and is different from the vocabulary processed by natural language of the existing NLP-Token, and the NLP-Token is based on direct interpretation of words and Chinese characters. The VEM-Token is a brand new explanation taking the music beat as a word element, and meanwhile, the VEM-Token is also related to VEM parameters of vocal emotion multi-mode, and vector data of various expression modes, such as love, hawk, emotion and enemy, as well as directions and scales of the VEM parameters are contained in the VEM parameters. In addition, a VEM library of supervised learning of typical songs by human vocal specialists is included in the model. Therefore, in this case, the accuracy of the VEM parameters of the songs in the VEM library is high, so that the reliability of the result of the operation is high, and the probability of "illusion" such as the NLP-Token model is low.

4. Magic scheme model

ST 310. The magic scheme comprises singing magic, and specifically comprises:

The 4 magic schemes of singing magic change, accompaniment magic change, emotion overtone magic change and emotion fluctuation magic change are the magic schemes of the foundation of a user learning singing file facing a sample song file, and one of the important points is capturing and aligning beats of the two. Secondly, the 4 magic schemes can be combined, for example, in many cases, accompaniment sounds can be directly selected from accompaniment tracks of a sample file without magic. It is particularly emphasized that in beat trimming, after each start trimming, an end trimming is recommended to ensure that the length of the beat is fixed.

5. Learning and singing magic improvement

The learning and singing magic is the first step that a user imitates a sample file, and in general, each sentence of lyrics and each beat of learning and singing need to be repeatedly learned and recorded many times, and then the most satisfactory one is selected in the learning and singing magic. Next, for the selected recording, the beat start and end points of the VEM-token2.1 sequence are aligned with the beat start and end points of the VEM-token1.1 sequence.

For mixed learning singing, after capturing and aligning beats, the data of the VEM-token1.1 sequence of the corresponding beats of the sample file are discounted, for example, 5% of the data, and the data of the VEM-token2.1 sequence of the user learning singing result are discounted by 95%, and the data are synthesized to be used as the final user learning singing file. The purpose of this step is to include a small portion of the sample song file in the user's singing results so as to bring the magic wand closer to the sample song. It should be noted here that, in particular, 5% or more is required, it should be determined in the monitoring that the ratio cannot be too large, otherwise the effect is poor due to too large difference between the sample and the user.

6. Vocal cloning

The human voice is like a human fingerprint, has personalized characteristics, is mainly represented by the structural characteristics of a sounding cavity of the human voice, is represented acoustically, namely, the fundamental frequency, the overtone type and the respective amplitude of the voice in a frequency spectrum format file are included in VEM parameters, and can find out the characteristics of the voice by analyzing the fundamental frequency, the overtone type and the respective amplitude.

In principle, as long as the voice VEM parameters of the user are in the VEM library, the voice VEM-token2.1 sequence of the user can be constructed by adding the singing voice VEM-token1.1 sequence of the sample file and the lyric spectrum 2 of the user by adopting a voice synthesizer, and further the voice clone magic of the user is realized. Even if the VEM database does not have the VEM parameters of the user, the VEM processor is used for collecting voice characteristics of the user and converting the voice characteristics into the VEM parameters, so that the magic scheme of voice cloning can be realized.

7. Lyric magic

The lyric magic is a magic item commonly used in a magic model, and is also a magic item commonly used in a sample song by a user, and in most cases, the modification of the song collection of tunes of poems is only a local modification to an original document, and the meaning of lyrics and filling words after rhyme are comprehensively considered, rather than an impulse modification.

8. Magic pitch, decorative magic pitch, etc

ST520 pitch calibration magic cube according to the twelve-tone law of music, the fundamental frequency of sound must be equal to the node frequency of the twelve-tone law, and the sound frequency between two adjacent node frequencies must be adjusted up or down to the node frequency.

Pitch calibration, which is important for a non-professional singer. In music theory, pitch is the frequency of the pitch, also called intonation, which is basically too much by professional singers due to long-term training. The pitch calibration magic here may include two aspects, one is the calibration of the automatic overall pitch level, and the other is the pronunciation calibration in the beat selected autonomously by the user.

Regarding the selection of the types of front and rear decorative sounds, the specific conditions such as the length of the compensation time difference, the emotion VEM parameter of the beat and the like are generally seen, and the selection is performed by a user according to the understanding of the user on songs.

Beats and rhythms are basic physical quantities in music and are the most basic elements of music invented by human beings. Because music is a rhythmic number originating from a collective human labor, it is developed in synchronized dance steps expressing happiness. Therefore, magic of the beat rhythm is very important.

It should be noted that the time warping algorithm is a calculation method that only changes the length of audio, but does not change the pitch fundamental frequency, and does not change the structure of the harmonic components of sound. The time scaling algorithm specifically includes, but is not limited to, the following three:

(1) Time domain algorithm, SOLA (Synchronous Overlap-Add) and its variants (e.g., WSOLA, SOLA-FS),

The basic principle is to divide the input signal into short segments (segments) with overlap, time scale each segment (by compressing or expanding the overlap region between segments), find the best crossing point (based on waveform similarity) to minimize distortion when splicing, and re-superimpose the processed segments into the output signal.

The method has the basic characteristics of high calculation speed, suitability for real-time and low-resource application (such as an old tape simulator and a simple tone changer), poor transient (such as a drum point and a sound head) processing capability, easiness in generating click sound and reverberation feeling, unsatisfactory music processing tone quality, representing application of early telephone voice message speed control and some simple audio plug-ins.

(2) The frequency domain algorithm (parameterized/phase vocoder), which is currently the most popular and widely used type of method, is based on Short-Time Fourier Transform (STFT). Mainly includes but is not limited to:

The phase vocoder (Phase Vocoder) has the advantages of much better sound quality than the time domain method, and is especially suitable for audio with rich harmonic waves (such as human voice and piano). The disadvantage is that the processing of the percussive audio (e.g. snare drum) is still imperfect, producing characteristic "robot sound" or "reverberant feel" artifacts.

Enhanced phase vocoder based on transient processing. The processing quality of music such as drums, bass and the like is greatly improved. Representative, "Complex" and "Complex Pro" modes of iZotope Radius, serato Pitch' nTime, ableton Live.

(3) Machine learning/deep learning based data driven algorithms, which are the current leading research direction, aim to generate more natural time-warping results by learning large amounts of data. Mainly includes but is not limited to:

Based on the method of generating the model, the principle is that the model learns the potential distribution of the audio signal. Given an input audio, the model generates the most audible, best-matching output audio based on the target time duration. The advantage is the great potential, theoretically producing the most natural, minimal artifacts, and even "imagining" reasonable details to fill in the stretched time. Disadvantages are the need for massive data training, the huge consumption of computing resources, the difficulty in real time, and the possibility of model creation of uncontrollable illusions (ARTIFACTS).

The principle of the neural vocoder-based method is that waveforms are not directly processed, but intermediate characterization of audio (such as mel spectrogram, F0 pitch, harmonic information) is extracted first. The advantage is that the tone quality is generally far better than that of a traditional phase vocoder, because the neural vocoder is trained with a lot of high quality audio, and can reconstruct more natural sound textures. The disadvantage is that it depends on the quality of the nerve vocoder and is also generally computationally intensive.

ST590, tone color magic change, specifically comprising:

The tone color is mainly the personalized characteristic of the own pronunciation of the user, and is similar to the voice fingerprint. The characteristic of the voice is characterized by individuation, namely the structural characteristic of a resonance cavity of human voice is mainly represented on acoustics, namely the fundamental frequency, the overtone type and the respective amplitude of the voice in a frequency spectrum format file are included in VEM parameters, and the voice characteristic can be found by analyzing the fundamental frequency, the overtone type and the respective amplitude.

In principle, as long as the VEM parameters of the user are in the VEM library, the singing voice VEM-token2.1 sequence of the user can be constructed according to the singing voice VEM-token1.1 sequence of the sample file, so that the tone magic change of the user is realized. Even if the VEM parameters of the user are not in the VEM library, the VEM processor is used for collecting voice characteristics of the user and converting the voice characteristics into the VEM parameters, so that the magic scheme of tone can be realized.

Emotion magic is the key point of the invention, based on a VEM-Token model, emotion has a plurality of classifications, namely VEM classification, and emotion basis degree is counted into VEM parameters, and emotion points are included in corresponding coordinates.

For example, the "love-and-remount, emotion-enemy and happy-and-apprehension" emotion elements which are opposite and related are respectively represented by a X, Y, Z orthogonal three-dimensional spherical coordinate system, the coordinate intervals are respectively X-axis "love-and-remount" -100% to 100%, Y-axis "happy-and-apprehension" -100% to 100%, and Z-axis "emotion-enemy" -100% to 100%, and the three coordinate axes are mutually related in an orthogonal manner, as shown in fig. 2. For example, the emotional parameters recorded at the emotional points E (X, Y, Z) of a beat are (80%, -20%, 50%). By adopting the method, emotion is accurately recorded.

The user of the present invention is specifically alerted here that the VEM coordinate system includes, but is not limited to:

Independent emotions comprise independent and uncorrelated emotions, a unidirectional one-dimensional coordinate axis is constructed, the lowest point of the emotion is taken as a coordinate 0 point, and the highest point of the emotion is taken as a coordinate maximum point.

The opposite emotion pair comprises two mutually opposite emotion components, and a bidirectional one-dimensional coordinate axis is constructed, wherein the midpoint of the opposite emotion pair is taken as a coordinate 0 point, the highest point of the positive emotion is taken as a coordinate positive maximum point, and the highest point of the negative emotion is taken as a coordinate negative maximum point.

The related opposite emotion groups comprise more than one group of opposite emotion pairs which are related with each other, wherein 0 points of the respective coordinates are aligned in a superposition way, the respective two-way one-dimensional coordinate axes are super-orthogonal, the super-plane demarcation is carried out, the respective positive emotions are arranged on the same side of the super-plane, the respective negative emotions are arranged on the opposite side of the super-plane, a super-orthogonal coordinate system is constructed, the highest point of the respective positive emotions is taken as the positive maximum point of the super-orthogonal coordinate, and the highest point of the respective negative emotions is taken as the negative maximum point of the super-orthogonal coordinate.

It should be noted that independent emotions, opposite emotion pairs, and associated opposite emotion sets, sometimes relative, are not always the same classification of modalities for different styles of vocal works, different cultural backgrounds of vocal works. Thus, independent emotions, opposite emotion pairs, and associated opposite emotion groups are all, in extreme cases, pointers to a piece of vocal work.

It should be emphasized that by "super-orthogonal coordinate system" we mean that the rectangular coordinate axes of more than 3 dimensions are arranged in the same system, e.g. 4-dimensional, 5-dimensional or more, and that we describe and record by means of mathematically super-space, since we cannot draw directly on the paper. Moreover, these hyperdimensions are not uniquely orthogonal, but rather intersect mathematically, rather than orthogonally, and even based on the curved coordinates of the Riemann geometry.

FIG. 3 is a schematic diagram of a partial user operation interface. In the figure, 3 color belts respectively represent the interface of the adjusting section of 3 groups of emotion pairs, a green ring on the color belts is an adjusting sliding cursor of emotion VEM parameters, and the emotion values of the 3 groups of emotion pairs, namely 'love-remount', 'happy-worry', 'emotion-enemy', can be adjusted by pulling the green ring cursor on the color belts left and right. The design man-machine interaction interface comprises, but is not limited to, a one-dimensional, two-dimensional screen, a three-dimensional space and even a multi-dimensional space interaction interface, and the VEM parameters comprise, but are not limited to, a dragging cursor type, a numerical value input type and a color editing type, and also comprise a 'what you see is what you hear' real-time monitoring mode. The editing result and the editing mode are stored in a VEM library in a centralized way.

Because the invention is based on a VEM-Token model, the mode can be understood as an 'audio-visual' mode, so that a subsequent 'visual map' model can be connected, the text information Token generated by 'audio-visual' can drive the 'visual map' to generate static images and dynamic videos, and a user can also manually connect the videos to generate video magic changes.

The free magic modification comprises the magic modification of the singing voice of the user, the accompaniment voice of the user and the video picture according to the intention and the intention of the user, and the magic modification mode comprises manual and semi-automatic.

The multi-sample magic change is realized on the basis of more than one sample file, for example, a sample file of a folk song style and a sample file of a more elegant male high-pitched style are referred to in the singing voice of a user song, and in addition, accompaniment sounds refer to the sample files of accompaniment of a national musical instrument, that is, 3 sample files are referred to, so that multi-sample magic change is realized.

The real-time monitoring is an important function, is beneficial to improving the efficiency of magic modification for users, and is used for monitoring and watching the magic modification result in real time while the users conduct magic modification, so as to achieve the effect of listening.

The user is particularly reminded of the invention that designing human-machine interaction interfaces includes but is not limited to one-dimensional, two-dimensional screen, three-dimensional space and even multi-dimensional space interaction interfaces, and VEM parameters include but are not limited to dragging cursors, numerical input, color editing, and real-time listening modes of 'what you see is what you hear'. The editing result and the editing mode are stored in a VEM library in a centralized way.

9. Member management

The system is beneficial to realizing cloud mode + fixed terminal or mobile terminal, wherein the VEM library and VEM parameters can be stored in the cloud center and terminal, and can also be stored in the nodes of the blockchain in a blockchain mode. Thus, users are managed in a membership manner, providing convenience in intellectual property management for users.

10. Others

Based on the foregoing solution, the magic model of the present invention further includes, but is not limited to, the following expansion steps or methods:

Wherein MIDI is a music digital interface (Musical Instrument DIGITAL INTERFACE), MSC is MIDI display control (MIDI Show Control), OSC is open sound control (Open Sound Control), AES3/PDIF is a point-to-point digital audio transmission standard, MADI is a multi-channel audio digital interface (Multichannel Audio DIGITAL INTERFACE), dante is an auto-discovery, low-delay, high-channel number, co-student protocol developed by Audinate, AVB/TSN is an audio-video low-delay transmission protocol, and AES67 is an interoperability standard protocol based on existing network technology.

It is specifically stated that in this specification, with respect to a variable name (e.g., VEM-Token with a numerical suffix), the same variable (e.g., VEM-Token 1.1) is inconsistent according to the conventions noted in the computer arts, since the calculation of the variable name varies with the step, as will be understood by those skilled in the art.

Claims

1. The method for constructing a VEM-Token vocal emotion multimodal modification model is characterized by including:

ST100: Collect sample files and user files. Based on the VEM-Token model, use beat capture and VEM-Token segmentation to obtain the VEM-Token1 sequence of the sample file and the VEM-Token2 sequence of the user file. Based on the VEM-Token1 sequence, perform beat alignment on all VEM-Token2 sequences to generate the VEM-Token2 sequence.

ST200: Based on the VEM parameters included in the VEM-Token model, identify the VEM parameters of the VEM-Token1 sequence, determine the modification plan by the user, process the VEM parameters of the VEM-Token2 sequence, and modify the sequence to generate a VEM-Token2 sequence that matches the style of the VEM-Token1 sequence and the result file of the modification plan.

2. The method according to claim 1, characterized in that it specifically comprises:

ST110: For sample files that do not include loop segments, use beat capture to mark the start and end points of the beats and generate the entire VEM-Token1 sequence;

ST120: For sample files that include loop segments, starting from the second loop segment and continuing until all loop segments have completed, after beat capture, perform the start point fine-tuning step and the end point fine-tuning step to achieve beat alignment within the loop segment and generate the entire VEM-Token1 sequence.

ST130: Based on the VEM-Token1 sequence, perform the beat capture and beat alignment steps, including the start fine-tuning step and the end fine-tuning step, in all VEM-Token2 sequences to process the corresponding beats and generate a VEM-Token2 sequence;

ST140: The user files specifically include: user files generated by the user imitating the singing of the sample file, user files generated by collecting the user's vocal characteristics and cloning them completely according to the sample file, and result files generated by mixing the user files generated by partial singing and the user files generated by partial cloning.

3. The method according to claim 2, further comprising:

ST210: The VEM-Token model also includes a VEM library and a VEM processor;

ST211: VEM parameters include VEM classification, VEM function and established VEM coordinate system for recording multiple modes of vocal emotions;

ST212: Modalities include lyrics, singing, accompaniment, commentary, vocal style, music, emotional basis, accompanying instruments, video, images, and ambient sound, or any combination thereof;

ST213: Emotional foundations include one or a combination of joy, sorrow, sadness, anger, fear, disgust, surprise, calmness, longing, anticipation, trust, love, hate, affection, and hatred;

ST214: Vocal styles include folk song singing, popular song singing, rock song singing, Western song singing, pop song singing, original song singing, and opera singing, or a combination thereof;

ST215: Vocal style includes the user's unique vocal cords and the fundamental frequency and overtones emitted by the resonance cavity when singing, speaking, and reciting. This is the unique characteristic that distinguishes the user from others.

ST216: VEM classification collects a variety of vocal files. Human vocal experts or learning algorithms perform emotional evaluations on the vocals and accompaniment of the vocal files. Supervised learning and deep learning are used to train VEM functions to obtain VEM parameters, which are then added to the VEM library.

ST220: The VEM processor provides an operating interface for user-machine interaction, completing the specific implementation of the magic modification solution.

4. The method according to claim 3, wherein the magic modification scheme includes the steps of modifying the singing voice, modifying the accompaniment sound, modifying the emotional overtones, and modifying the emotional fluctuations, specifically comprising:

ST310: The modification plan includes vocal modification, specifically including:

ST311: Use the VEM processor to pre-process the sample file and user file, set the vocal filter, convert the sample file and user file into spectral format files, and separate the VEM-Token1.1 sequence of the vocals in the sample file and the VEM-Token2.1 sequence of the vocals in the user file;

ST312: Based on the beat capture and beat alignment models, capture the beat start and beat end of all VEM-Token 1.1 sequences and all VEM-Token 2.1 sequences respectively, and use the start fine-tuning model and the end fine-tuning model to align the beat start and beat end of the VEM-Token 2.1 sequence with the beat start and beat end of the VEM-Token 1.1 sequence; and/or,

ST320: The modification plan also includes modification of the accompaniment sound, including:

ST321: Use the VEM processor to pre-process the sample file and user file, set the accompaniment filter, convert the sample file and user file into spectral format files, and separate the VEM-Token1.2 sequence of the accompaniment sound of the sample file and the VEM-Token2.2 sequence of the accompaniment sound of the user file;

ST322: Based on the beat capture and beat alignment models, capture the beat start and beat end of all VEM-Token 1.2 sequences and all VEM-Token 2.2 sequences respectively, and use the start fine-tuning model and the end fine-tuning model to align the beat start and beat end of the VEM-Token 2.2 sequence with the VEM-Token 1.2 sequence; and/or,

ST330: The modification plan also includes emotional overtone modification, specifically including:

ST331: Use the VEM processor to pre-process the sample file and user file, set the emotional overtone filter, convert the sample file and user file into spectral format files, and separate the VEM-Token1.3 sequence of the emotional overtone of the sample file and the VEM-Token2.3 sequence of the emotional overtone of the user file;

ST332: Based on the beat capture and beat alignment models, capture the beat start and beat end of all VEM-Token 1.3 sequences and all VEM-Token 2.3 sequences respectively, and use the start fine-tuning model and the end fine-tuning model to align the beat start and beat end of the VEM-Token 2.3 sequence with the VEM-Token 1.3 sequence; and/or,

ST340: The magic modification plan also includes mood swing magic modification, specifically including:

ST341: Use the VEM processor to pre-process the sample file and user file, set the emotion fluctuation filter, convert the sample file and user file into spectrum format files, and separate the VEM-Token1.4 sequence of the emotion fluctuation of the sample file and the VEM-Token2.4 sequence of the emotion fluctuation of the user file;

ST342: Based on the beat capture and beat alignment models, the beat start and beat end points of all VEM-Token1.4 sequences and all VEM-Token2.4 sequences are captured respectively. The start point fine-tuning model and the end point fine-tuning model are used to align the beat start and beat end points of the VEM-Token2.4 sequence with the VEM-Token1.4 sequence.

5. The method according to claim 4, wherein the magic modification scheme further comprises a learning-to-sing magic modification step, specifically comprising:

ST410: Learning to sing magic modification includes simple learning to sing, specifically including:

ST411: Using the VEM processor, based on all or part of the sample file, the user learns to sing more than once, records, and converts the results into multiple VEM-Token 2.1 sequences in units of multiple beats. The user then selects the VEM-Token 2.1 sequence that best matches the VEM-Token 1.1 sequence among the multiple sequences as the selected VEM-Token 2.1 sequence.

ST412: Using steps ST310, ST311, and ST312, generate a beat-captured and aligned VEM-Token2.1 sequence from the selected VEM-Token2.1 sequence; and/or,

ST420: Learning to sing magic also includes mixed learning to sing, specifically including:

ST421: Set weights A and B for the VEM-Token1.1 sequence and the VEM-Token2.1 sequence. Use the formula A × VEM-Token1.1 sequence + B × VEM-Token2.1 sequence to calculate the VEM-Token2.1 sequence after mixed learning, where A is less than 0.3 and A + B = 1.0.

6. The method according to claim 4, wherein the modification scheme further comprises a voice cloning modification step, specifically comprising:

ST430: The modification plan also includes voice cloning modification, specifically including:

ST431: query the VEM database, and if the VEM parameters of the client's voice style exist, obtain the VEM parameters of the client's voice style; or,

If the VEM parameters of the customer's vocal style do not exist, or the existing VEM parameters of the customer's vocal style do not meet the user's requirements, a recording of the user singing an etude, speaking, or reciting is collected. The VEM processor uses a singing filter, an emotional overtone filter, and an emotional fluctuation filter to convert the recording into a spectral format file containing the user's vocal style. Based on the VEM classification, the VEM parameters of the user's vocal style are obtained and stored in the VEM library.

ST432: Based on the VEM-Token 1.1 sequence, use the speech recognition included in the VEM-Token model to identify the lyrics spectrum 1 in the sample file. Lyric spectrum 1 includes the lyrics and the start and end positions of the beats in which the lyrics are located;

ST433: Copy the sample file's Lyrics Spectrum 1 to become the user file's Lyrics Spectrum 2. Based on the VEM parameters of the user's vocal style, use a speech synthesizer to clone the user's voice word by word according to the user file's Lyrics Spectrum 2 and its start and end positions in the beat, creating a VEM-Token 2.1 sequence.

7. The method according to claim 6, wherein the modification scheme further comprises a lyrics modification step, specifically comprising:

ST500: When Lyrics Score 2 is inconsistent with Lyrics Score 1 or the user needs to modify it, perform lyrics modification, including:

ST510: Decompose Lyric Spectrum 2 and Lyric Spectrum 1 into Lyric Sentence 2 and Lyric Sentence 1 based on semantic grammar. Execute the following steps:

ST511: Lyrics Sentence 2 has the same number of words as Lyrics Sentence 1. Lyrics Score 2 is copied to Lyrics Score 1 according to the beat. Each word in Lyrics Sentence 2 is filled in at the corresponding beat position. Alternatively, the user can manually modify the Lyrics Score 2 to create the modified Lyrics Score 2.

ST512: Lyric Sentence 2 has a different number of words than Lyric Sentence 1. The user manually modifies Lyric Sentence 2 by filling in the blanks at the corresponding beat positions, creating the modified Lyric Score 2.

ST513: Using a speech synthesizer, based on the VEM parameters of the user's vocal style, the lyrics spectrum 2 and the start and end positions of the beat, the user's voice is cloned and modified word by word to become a VEM-Token2.1 sequence.

8. The method according to claim 5 or 7, characterized in that the magic modification scheme further includes the steps of pitch calibration magic modification, ornamentation magic modification, beat length magic modification, rhythm speed magic modification, beat strength magic modification, timbre magic modification, emotion magic modification, video magic modification, free magic modification, multi-sample magic modification, and real-time monitoring of what you hear is what you get, which specifically include:

ST520: Pitch calibration magic change: According to the twelve-tone equal temperament of music theory, the fundamental frequency of the sound must be equal to the node frequency of the twelve-tone equal temperament. The sound frequency between two adjacent node frequencies must be raised or lowered to the node frequency;

ST530: Pre-beat modification: In a beat, when the beat start point of VEM-Token2.1 lags behind the beat start point of the corresponding VEM-Token1.1 by less than 1/2 beat on the timeline, a grace note is added before the beat start point of VEM-Token2.1 to compensate for the aligning of the start points. Grace notes include trills, glissando, linguistic notes, and breath sounds.

ST540: Post-ornament modification: In a beat, when the beat end point of VEM-Token2.1 is within 1/2 beat of the corresponding beat end point of VEM-Token1.1 on the time axis, an ornament or rest is used to compensate after the beat end point of VEM-Token2.1 to align the end points.

ST550: Beat length modification: When the beat length of VEM-Token2.1 does not match the beat length of the corresponding VEM-Token1.1, use the time stretching algorithm or the grace note modification step to compress or extend the beat of VEM-Token2.1 so that it is aligned with the beat of the corresponding VEM-Token1.1;

ST560: Tempo Modification: When the overall tempo of a user file needs to be accelerated or slowed down, a time stretching algorithm is used to synchronously compress or stretch the tempo of the user file's vocals and accompaniment.

ST570: Beat Modification: For VEM-Token 2.1 and/or VEM-Token 2.2, different processing based on the basic layer and emotional layer is adopted according to user needs, including:

ST571: The base layer includes volume dynamics processing and envelope shaping steps, and threshold adjustments are made for VEM-Token 2.1 and VEM-Token 2.2.

ST572: The emotion layer includes intelligent dynamic control based on AI/machine learning. Specifically, it trains a model to intelligently identify beats and instruments in the audio and automatically generates dynamic processing parameters based on preset emotion labels or target loudness curves. The training results are stored in the VEM library.

ST580: When the beat length of VEM-Token 2.1 changes, the beat length of VEM-Token 2.2 must be checked synchronously. If the beat lengths are inconsistent, time stretching must be used to capture and align VEM-Token 2.1 and VEM-Token 2.2.

ST590: Sound modification, including:

ST591: For VEM-Token1.1 and VEM-Token2.1 sequences, filters are used to decompose multiple groups of higher harmonics of the fundamental frequency into the following frequencies: F1, F2, F3, ..., Fn, and the overtone amplitudes of the higher harmonics: A1, A2, A3, ..., An, where n is the harmonic order and is less than 50.

ST592: Using volume dynamics processing and envelope shaping steps, respectively amplify or reduce the overtone amplitude of one or more specified harmonic components among A1, A2, A3, ..., An in the VEM-Token2.1 sequence to change the timbre of the user file; or

ST593: Query the AI/machine learning sound dynamic processing parameters in the VEM library and adjust the dynamic processing parameters to change the sound of the user file;

ST5A0: Emotional modification, specifically including: querying the VEM library based on the user's emotional modification needs, using the VEM-Token2.1 sequence as the independent variable, and adjusting the VEM parameters including emotional modification, vocal style, and voice style requirements to obtain the emotional modification results;

ST5B0: Video Magic Modification. When the user file needs to be adapted to the video, the content and rhythm of the video screen are adjusted according to the VEM parameters and rhythm to meet the needs of the user file.

ST5C0: Free modification. Users can modify VEM-Token2.1, VEM-Token2.2, and the content and rhythm of the video according to VEM parameters and one or more modes to suit the needs of the user's file.

ST5D0: Multi-sample modification. The user selects one or more sample files and selects some parameters in the VEM parameters to correspond to some sample files, and selects other parameters to correspond to other sample files. The content and rhythm of VEM-Token2.1, VEM-Token2.2 and video images are modified to meet the needs of the user's files.

ST5E0: What you hear is what you get real-time monitoring, specifically including real-time monitoring, evaluation, and scoring of the modification results by the user, and submitting the VEM-Token2 sequence of the modification process and the result file generated by the modification to the VEM library for storage.

9. The method according to claim 8, wherein the magic modification model further includes member management, specifically including:

ST600: Based on user needs, apply for membership from the Modified Model, create membership files, and store them in the VEM library;

ST610: Member files include user information, sample file information, user file information, VEM parameters, voiceprint encryption, and voiceprint decryption. The encryption and decryption keys include the member signature, member image, member video, and member VEM parameters.

ST620: Member management includes forward, backward, rollback, add, delete, query, modify, store, and persist operations during the modification process;

ST630: Member management also includes instant modification of user files, instant monitoring, real-time scoring, supervised learning, reinforcement learning, rewards, and penalties, and the results are stored in the VEM library.

10. The method according to claim 9, wherein the modifying the model further comprises:

ST700: The modified model also includes mobile application systems and PC application systems, as well as cloud-based application systems and blockchain application systems.

ST800: The modified model also includes supporting hardware systems, including communication interfaces, recording modules, tuning modules, playback modules, encryption modules, and decryption modules. It also includes interfaces for the Douyin system, Face Video Number interfaces, and supports the AI karaoke system.

ST900: Provides synchronization signals to subsequent large-scale model applications and connects to AI systems including DeepSeek, Kimi.AI, and ChatGPT to form an AI agent.

STA00: The modified model also includes interface protocols, providing hardware- and network-based MIDI protocols, MSC extension protocols, OSC network protocols, AES3/PDIF protocols and MADI protocols based on the transport layer, and network audio transmission protocols such as Dante protocol, AVB/TSN protocol, and AES67 protocol.