[go: up one dir, main page]

CN120853611A - The construction method of VEM-Token vocal emotion multimodal magic modification model - Google Patents

The construction method of VEM-Token vocal emotion multimodal magic modification model

Info

Publication number
CN120853611A
CN120853611A CN202511340091.8A CN202511340091A CN120853611A CN 120853611 A CN120853611 A CN 120853611A CN 202511340091 A CN202511340091 A CN 202511340091A CN 120853611 A CN120853611 A CN 120853611A
Authority
CN
China
Prior art keywords
vem
beat
user
modification
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202511340091.8A
Other languages
Chinese (zh)
Other versions
CN120853611B (en
Inventor
丁贤根
丁远彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbour Star Health Biology Shenzhen Co ltd
Original Assignee
Harbour Star Health Biology Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbour Star Health Biology Shenzhen Co ltd filed Critical Harbour Star Health Biology Shenzhen Co ltd
Priority to CN202511340091.8A priority Critical patent/CN120853611B/en
Priority claimed from CN202511340091.8A external-priority patent/CN120853611B/en
Publication of CN120853611A publication Critical patent/CN120853611A/en
Application granted granted Critical
Publication of CN120853611B publication Critical patent/CN120853611B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Electrophonic Musical Instruments (AREA)

Abstract

VEM‑Token声乐情绪多模态魔改模型的建构方法,区别于自然语言处理模型NLP‑Token通过文字解释音乐,而是自带声乐情绪多模态信息的VEM‑Token音生文的创新模型。模型对样本歌曲和用户学唱歌曲在音乐节拍上捕捉并对齐,识别声乐情绪多种模态,依据节拍划分文件的词元,采用监督学习和强化学习获得VEM参数,将歌曲分解为歌声、伴奏声和情绪,魔改模型提供多模态的模仿样本歌曲的魔改方法,包括:歌声、伴奏声、情绪泛音、情绪波动、学唱、嗓音克隆、歌词、音高校准、装饰音、节拍长短、节奏快慢、节拍强弱、自由和多样本魔改,提供会员管理、移动端和PC端应用系统、专用支撑硬件和包括MIDI等通信协议,便于接入流行的AI大模型,降低模型幻觉,形成AI声乐智能体Agent和AI卡拉OK。

The VEM-Token multimodal vocal emotion modification model, unlike natural language processing (NLP-Token) models that interpret music through text, is an innovative model that inherently incorporates multimodal vocal emotion information. The model captures and aligns sample songs and user-learned songs on a tempo basis, identifying multiple modalities of vocal emotion. It then divides the file's tokens by tempo, using supervised and reinforcement learning to obtain VEM parameters, decomposing the song into vocals, instrumental, and emotion. The modification model offers multimodal modification methods for mimicking sample songs, including vocals, instrumental, emotional overtones, mood swings, learning to sing, voice cloning, lyrics, pitch calibration, embellishments, beat length, tempo, and strength, as well as free and multi-sample modification. The model also provides membership management, mobile and PC application systems, dedicated support hardware, and communication protocols such as MIDI, facilitating integration with popular AI models and reducing model hallucinations, resulting in AI vocal agents and AI karaoke.

Description

Construction method of VEM-Token vocal emotion multi-mode magic model
Technical Field
The invention relates to the field of artificial intelligence, in particular to model construction and processing of an AI Agent, AI music and speech recitation, and especially relates to the sub-field of a magic model of a 'sound-to-text' model for dividing a token by adopting a multi-mode method of music beat and vocal emotion. Specifically, when people simulate sample vocal music or sample recitation, beat capturing and beat alignment are carried out so as to realize and optimize the simulation effect, and a construction method of the VEM-Token vocal emotion multi-mode magic model is constructed.
Background
Currently, in the field of artificial intelligence, the decomposition of information is still based on Natural Language Token segmentation methods of NLP-Token (Natural-Language-Processing Token). If the model is based on a text information mode, the NLP-Token has natural advantages, and the large model has all books and web pages based on characters of human beings learned and memorized into the large model. For non-literal information modes, such as singing voice, emotion, music style and the like, the current large model still searches for literal descriptions in the model memory, which are taken from books and web pages learned in the past, and obtains 'explanation' and 'understanding' through the literal descriptions, that is, the information mode based on the literal words, such as NLP-Token, is still adopted. Since we cannot guide where the large model is the corpus obtained at this time, and cannot predict the correctness of the corpora, creating the "illusion" of the large model is unavoidable.
Specifically, the prior art includes:
(1) Tools based on manual editing and Digital Audio Workstations (DAWs) rely heavily on manual operations. Representative techniques and software include Melodyne (Celemony) by a "note granule" technique, allowing the user to manually adjust pitch, duration, volume, tremolo, and even formants for each note. Auto-Tune (Antares Audio Technologies), originally designed as a real-time pitch correction effector, its "graphics mode" can also be used to manually fine-tune pitch lines. Waves Tune/iZotope Nectar, etc., to provide similar pitch and sound editing functions. Adobe Audition/Audacity/Logic Pro/Cubase and other DAW built-in tools provide basic functions such as compression, equalization (EQ), reverberation, volume envelope editing and the like.
(2) Automated methods based on rules and Digital Signal Processing (DSP) attempt to automatically complete part of the correction work by preset algorithm rules. Representative techniques and software are one-touch repair plug-ins such as Auto-Tune "Auto Mode" and Waves Tune Real-Time. A tempo alignment algorithm, such as the "WARP MARKING" technique in Ableton Live, may automatically analyze and stretch the audio to align the grid. Automatic pitch correction most DAWs and plug-ins provide a "one-key repair" function.
(3) Based on data-driven and Machine Learning (ML) approaches, much of these research efforts have focused on local applications of "timbre conversion" and "singing synthesis".
It can be seen that the model-wise understanding of the vocal emotion multi-mode is not solved based on the current traditional NLP-Token method.
The inventor group puts forward a brand new model design for the first time, which comprises a Chinese patent VEM-Token Vocal multi-mode Token converted singing voice and accompaniment deep learning method which is already authorized, CN120126506 (hereinafter referred to as a 'VEM-Token Vocal multi-mode model') and a method for capturing and constructing Ji Moxing of the invention patent VEM-Token beat under examination, 202511249168.0 (hereinafter referred to as a 'VEM-Token beat model') which successfully takes a music beat as an information word element, namely VEM-Token (voice-Emotion-Multimodal Token, word element of Vocal emotion multi-mode) to divide a music file, interprets and understand the meaning of music through the VEM parameter of the Vocal emotion multi-mode, and is commonly referred to as a 'music text', namely, a Vocal music/music generating word element Token. In the first two aspects, a certain number of songs with standard styles and classification are collected by a human music expert, the vocal files are subjected to frequency spectrum, beats are detected, the frequency spectrum vocal files are divided into VEM-Token sequences according to the vocal beats, a VEM coordinate system, a VEM function and a VEM library are established according to lyrics, singing voice, accompaniment, singer emotion, accompaniment emotion, videos, images and other multi-modes, VEM-Token identification is performed, song sound streams and accompaniment streams are separated, the vocal samples are subjected to multi-mode emotion scoring according to the vocal expert, VEM parameters are obtained by adopting supervised learning and deep learning algorithms, and the multi-mode emotion of the vocal samples is learned. For other vocal works, vocal multimodal moods, output songs collection of tunes of poems, VEM-Token song spectra, VEM-Token accompaniment spectra, and VEM-Token score can be identified. The invention can be used as patent pool patents of CN120126506 and 202511249168.0, can be connected into an AI system or an application system which is independently developed, and is developed into a vocal intelligent Agent which can listen to the music recognition spectrum of the singing. And performing supervised learning to obtain a VEM library.
Due to the existence of such a specialized and accurate library of learning VEMs generated by human expert supervised learning, the possibility of the late-stage large model generating "hallucinations" when applied is almost lost.
The VEM-Token concepts referred to in this application are based on the basic concepts and steps defined in the CN120126506, 202511249168.0 patent, unless otherwise emphasized or specifically defined by this application. Wherein, the references to "vocal files", "music files", "songs", "musical compositions" in the present document are the same unless specifically stated otherwise.
The invention is intended to be applied to the development and application of the current large models, such as OpenAI, deepSeek, google Gemini, kimi, bean-enclosed large models, religion, etc., which form various Agent agents after connecting the front end and the back end for some applications. The invention aims to access the large models and realize two-way communication with the large models, form artificial intelligence application based on music/vocal music, and even intelligent agent of music AI, and provide powerful innovation and support for the application of the expansion AI.
Shortcomings of the prior art methods
(1) Without vocal emotion modeling, the quantitative characterization capability of vocal emotion multimodality is lacking.
(2) Recognition and understanding based on vocal music and music emotion cannot be achieved.
(3) Automatic and intelligent vocal music modification based on emotion recognition and understanding cannot be realized.
(4) The magic method is based on manual operation, can only be used for manually adjusting parameters, and has low efficiency and no standardization and automation.
Disclosure of Invention
According to the defects of the prior art, the invention provides a brand-new vocal emotion multi-mode magic thinking and method. Achieving the objects and intents of the invention.
The aim and the intention of the invention are realized by adopting the following technical proposal and working steps:
1. VEM-Token magic model basic scheme implementation step
The invention is used as a construction method of a VEM-Token vocal emotion multi-mode magic model, which comprises the following steps:
ST100, collecting a sample file and a user file, capturing beats and dividing the beats by using a VEM-Token model to respectively obtain a VEM-Token1 sequence of the sample file and a VEM-Token2 sequence of the user file, aligning the beats of all the VEM-Token2 sequences by using the VEM-Token1 sequence, and generating the VEM-Token2 sequence.
ST200, identifying the VEM parameters of the VEM-Token1 sequence according to the VEM parameters included in the VEM-Token model, determining a magic solution by a user, processing the VEM parameters of the VEM-Token2 sequence, and generating the VEM-Token2 sequence of a result file conforming to the VEM-Token1 sequence style with the magic solution by magic solution.
2. Sample file and user file beat capture alignment step
In the foregoing basic aspect, the present invention provides a method or step of beat capture and alignment in terms of sample files and user file models, including, but not limited to, one or more of the following combinations:
ST110 for a sample file that does not include loop segments, beat capture is used to mark the start and end points of the beat, generating the complete VEM-Token1 sequence.
ST120, for a sample file comprising loop segments, starting from the second loop segment until all loop segments are finished, after beat capture, executing a starting point fine tuning step and an end point fine tuning step, realizing beat alignment inside the loop segments, and generating all VEM-Token1 sequences.
ST130 is to execute beat capturing and beat aligning steps including a starting point fine tuning step and an end point fine tuning step in all VEM-Token2 sequences according to the VEM-Token1 sequences to process corresponding beats to generate the VEM-Token2 sequences.
The ST140 user files specifically comprise user files generated by simulating sample file singing by a user, user files generated by collecting voice characteristics of the user and completely cloning according to the sample files, and user files generated by mixing local singing and user file generated result files generated by local cloning.
3. VEM-Token model
Based on the foregoing, the present invention provides steps or methods in terms of a VEM-Token model, including, but not limited to, a combination of one or more of the following vocal emotion multimodal models:
The ST210 is that the VEM-Token model also comprises a VEM library and a VEM processor.
The VEM parameters include VEM classification, VEM function and established VEM coordinate system recording multiple modalities in vocal emotion.
ST212 the modalities include one or a combination of lyrics, singing voice, accompaniment, illustration, vocal style, music, emotional basis, accompaniment instrument, video, image, and ambient sound.
ST213 emotional basis includes one or a combination of happiness, sadness, anger, fear, aversion, surprise, calm, mind, expectancy, trust, love, remoistening, emotion, enemy.
ST214 the vocal style comprises one or a combination of folk song playing method, popular song playing method, rock song playing method, western song playing method, popular song playing method, composed song playing method and opera playing method.
ST215 the voice style includes the combination of the natural singing, speaking, the vocal cords when reciting, the fundamental frequency and overtones of the resonant cavity, which are the natural features of the user to distinguish from other people.
ST216, collecting various vocal files by VEM classification, performing emotion judgment on singing and accompaniment on the vocal files by a vocal expert or a learning algorithm of a human, training a VEM function by adopting supervised learning and deep learning to obtain VEM parameters, and adding the VEM parameters to a VEM library.
ST220 is that the VEM processor provides an operation interface for the interaction of the user and the machine, and the specific implementation of the magic scheme is completed.
4. Magic scheme model
Based on the foregoing, the present invention provides a step or method in a magic solution model, including, but not limited to, one or more of the following combinations:
the magic scheme comprises the steps of singing magic, accompaniment magic, emotion overtone magic, and emotion fluctuation magic, and specifically comprises the following steps:
ST 310. The magic scheme comprises singing magic, and specifically comprises:
ST311, preprocessing the sample file and the user file by using a VEM processor, setting a singing filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.1 sequence of singing voice of the sample file and a VEM-token2.1 sequence of singing voice of the user file.
ST312, capturing beat starting points and beat end points of all the VEM-Token1.1 sequences and all the VEM-Token2.1 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.1 sequences with the beat starting points and the beat end points of the VEM-Token1.1 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
ST 320. The magic scheme also comprises accompaniment sound magic, and specifically comprises:
ST321, preprocessing the sample file and the user file by using a VEM processor, setting an accompaniment filter, converting the sample file and the user file into a frequency spectrum format file, separating a VEM-token1.2 sequence of accompaniment sounds of the sample file and a VEM-token2.2 sequence of accompaniment sounds of the user file.
ST322, capturing beat starting points and beat end points of all the VEM-Token1.2 sequences and all the VEM-Token2.2 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.2 sequences with the VEM-Token1.2 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
ST330 the magic scheme also comprises emotion overtone magic, which comprises the following steps:
ST331, preprocessing the sample file and the user file by using a VEM processor, setting an emotion overtone filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.3 sequence of emotion overtones of the sample file and a VEM-token2.3 sequence of emotion overtones of the user file.
ST332, capturing beat starting points and beat end points of all the VEM-Token1.3 sequences and all the VEM-Token2.3 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.3 sequences with the VEM-Token1.3 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
ST340 the magic scheme also comprises emotion fluctuation magic, which specifically comprises:
ST341, preprocessing the sample file and the user file by using a VEM processor, setting an emotion fluctuation filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.4 sequence for forming emotion fluctuation of the sample file and a VEM-token2.4 sequence for forming emotion fluctuation of the user file.
ST342, capturing beat starting points and beat end points of all the VEM-Token1.4 sequences and all the VEM-Token2.4 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.4 sequences with the VEM-Token1.4 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
5. Learning and singing magic improvement
Based on the foregoing, the present invention provides a method or method for learning a magic model for a sample song file by a user, including, but not limited to, one or more of the following steps:
The magic scheme also comprises a step of learning to sing the magic, and specifically comprises the following steps:
ST410 the learning magic singing comprises simple learning singing, and specifically comprises:
ST411, using a VEM processor, according to all or part of the sample file, taking the bar formed by combining a plurality of beats as a unit, using the user to learn to sing more than once and recording and converting into a plurality of groups of singing voice VEM-token2.1 sequences of the user file, and then using the user to select a group of singing voice VEM-token2.1 sequences which are most preferable to the corresponding singing voice VEM-token1.1 sequences of the sample file in the plurality of groups as the selected singing voice VEM-token2.1 sequences.
ST412, using ST310, ST311, ST312 steps, generating a beat captured and aligned VEM-token2.1 sequence from the selected singing voice VEM-token2.1 sequence.
ST420 is that the learning magic singing improvement also comprises the mixed learning singing, which comprises the following steps:
ST421, setting weights A and B for the VEM-Token1.1 sequence and the VEM-Token2.1 sequence, and calculating the VEM-Token2.1 sequence after mixed singing by adopting the sequence of A multiplied by VEM-Token1.1 and B multiplied by VEM-Token2.1, wherein A is less than 0.3 and A multiplied by B=1.0.
6. Vocal cloning
Based on the foregoing, the present invention provides a method or procedure for the preparation of a magic model of vocal cloning, including, but not limited to, one or more of the following combinations:
the magic scheme also comprises a voice cloning magic step, and specifically comprises the following steps:
ST430 the magic scheme also comprises voice clone magic, which comprises the following steps:
ST431, query the VEM library, if there is VEM parameter of customer's voice style, obtain VEM parameter of customer's voice style.
If the VEM parameters of the voice styles of the clients do not exist, or the VEM parameters of the voice styles of the existing clients do not meet the requirements of the users, collecting singing records of a section of exercise music, a section of speaking and reciting records of the user, converting the records into a frequency spectrum format file with the voice styles of the user by a VEM processor through a singing filter, a mood overtone filter and a mood fluctuation filter, and obtaining the VEM parameters of the voice styles of the user according to VEM classification. And stored in the VEM library.
ST432, recognizing a lyric spectrum 1 in the sample file by adopting voice recognition included in a VEM-Token model according to the singing voice VEM-Token1.1 sequence of the sample file, wherein the lyric spectrum 1 comprises lyrics and positions of a start point and an end point of a beat where the lyrics are positioned.
ST433 copies the lyrics spectrum 1 of the sample file to become the lyrics spectrum 2 of the user file, adopts a voice synthesizer according to the VEM parameters of the voice style of the user, and finishes the voice cloning magic change of the user word by word according to the lyrics spectrum 2 of the user file and the starting point and the end point position of the beat, thus becoming the singing voice VEM-token2.1 sequence of the user.
7. Lyric magic
Based on the foregoing, the present invention provides a model step or method on a lyric magic model, including, but not limited to, one or more of the following in combination:
the magic scheme also comprises a lyric magic step, and specifically comprises the following steps:
ST500, when lyrics spectrum 2 is inconsistent with song collection of tunes of poems 1 or user's needs to modify, executing lyrics magic, specifically including:
ST510, decomposing the song collection of tunes of poems and the lyrics spectrum 1 according to the semantic grammar to form a lyrics sentence 2 and a lyrics sentence 1, and executing according to the following steps:
ST511, wherein the words of the lyric sentence 2 and the lyric sentence 1 are the same, the song collection of tunes of poems is copied as the lyric spectrum 1 according to the beat, each word in the lyrics of the lyric sentence 2 is filled to the corresponding beat position one by one, or the words are manually modified by a user to form the lyric spectrum 2 after magic modification.
ST512, wherein the words of the lyric sentence 2 and the words of the lyric sentence 1 are different, the words of the lyric sentence 2 are manually modified by a user, and the lyrics are filled into the corresponding beat positions one by one to form a lyric spectrum 2 after magic modification.
ST513 is to use a voice synthesizer to complete the voice cloning magic change of the user word by word according to the VEM parameters of the voice style of the user, the song collection of tunes of poems and the starting point and the end point of the beat to become the singing voice VEM-token2.1 sequence of the user.
8. Magic pitch, decorative magic pitch, etc
On the basis of the foregoing, the present invention also includes, but is not limited to, steps or methods of one or more combinations of the following:
the magic scheme also comprises the steps of pitch calibration magic, decoration sound magic, beat length magic, rhythm speed magic, beat strength magic, tone magic, emotion magic, video magic, free magic, multiple sample magic and listening to the real-time monitoring, and the method specifically comprises the following steps:
ST520 pitch calibration magic, which is to adjust the fundamental frequency of sound to the node frequency of twelve-tone equal to the node frequency of two adjacent node frequencies up or down to the node frequency:
ST530 front decorative sound magic change: in one beat, when the beat start point of the singing voice VEM-token2.1 of the user file and the beat start point of the singing voice VEM-token1.1 of the corresponding sample file fall within 1/2 of the beat on the time axis, the decorative sound is adopted to compensate before the beat start point of the singing voice VEM-token2.1 of the user file so as to align the start points, the decorative sound including a tremolo, a slide sound, a extension sound, a breathing sound.
ST540 post-decorative sound magic change in one beat, when the beat end point of the singing voice VEM-token2.1 of the user file and the beat end point of the singing voice VEM-token1.1 of the corresponding sample file advance within 1/2 beat on the time axis, the decorative sound or rest sound is adopted to compensate after the beat end point of the singing voice VEM-token2.1 of the user file so as to align the end points.
ST550 is a beat length magic change, in which when the beat of the singing voice VEM-token2.1 of the user file is not consistent with the occurrence length of the beat of the singing voice VEM-token1.1 of the corresponding sample file, a time warping algorithm step or a decoration sound magic change step is adopted to compress or expand the beat of the singing voice VEM-token2.1 of the user file so as to be aligned with the beat of the singing voice VEM-token1.1 of the corresponding sample file.
ST560, quick and slow rhythm magic change, namely when the overall rhythm of the user file needs to be accelerated or slowed down, adopting a time expansion algorithm step to synchronously compress or extend the rhythm of singing and accompaniment of the user file.
ST570, beat strength magic, namely aiming at singing voice VEM-token2.1 of a user file and/or accompaniment voice VEM-token2.2 of the user file, adopting different treatments based on a base layer and an emotion layer according to the needs of a user, wherein:
ST571 the base layer comprises the steps of volume dynamic processing, envelope shaping, which are adjusted on threshold for VEM-Token2.1 and VEM-Token 2.2.
ST 572. The emotion layer comprises intelligent dynamic control based on AI/machine learning, specifically comprises training a model to intelligently identify beats and musical instruments in audio, automatically generating dynamic processing parameters according to preset emotion labels or target loudness curves, and storing training results in a VEM library.
ST580 is to synchronously check the beat length of VEM-Token2.2 when the beat length of VEM-Token2.1 is changed, and to capture by time stretching and align VEM-Token2.1 and VEM-Token2.2 when the beat length is inconsistent.
ST590, tone color magic change, specifically comprising:
ST591 for the singing voice VEM-token1.1 sequence of the sample file and the singing voice VEM-token2.1 sequence of the user file, a filter comprising a plurality of groups of higher harmonics of a fundamental frequency is adopted, the frequencies of the higher harmonics of the plurality groups are decomposed, namely F1, F2, F3, &, fn, the overtones of the higher harmonics are decomposed, namely A1, A2, A3, &, an, wherein n is the number of the harmonics, and n is less than 50.
ST592, adopting the steps of volume dynamic processing and envelope modeling to respectively amplify or reduce overtone amplitude of more than one appointed harmonic component in A1, A2, A3, the AN in the VEM-token2.1 sequence so as to change the tone of the user file.
ST593, inquiring the tone color dynamic processing parameters of AI/machine learning in the VEM library, and adjusting the dynamic processing parameters to change the tone color of the user file.
ST5A0 is emotion magic change, which specifically comprises inquiring a VEM library according to emotion magic change demands of a user, and respectively adjusting VEM parameters including emotion magic change, vocal style and vocal style demands by taking a VEM-token2.1 sequence as an independent variable to obtain emotion magic change results.
ST5B0, when the user file needs to be matched with the video, the content and the rhythm of the video picture are adjusted according to the VEM parameters and the rhythm so as to adapt to the requirement of the user file.
ST5C0, free magic, the user changes the content and rhythm of the VEM-token2.1, VEM-token2.2 and video pictures according to the VEM parameters and more than one mode, so as to adapt to the requirements of user files.
ST5D0, multiple sample magic, wherein the user selects more than one sample file, selects part of sample files corresponding to part of parameters in the VEM parameters respectively, selects other part of sample files corresponding to the other part of parameters, and generates the content and rhythm of the VEM-Token2.1, VEM-Token2.2 and video pictures by magic so as to adapt to the requirements of the user files.
ST5E0, namely the listening and getting real-time listening, specifically comprising the steps of real-time listening, judging and grading the magic result by a user, generating a VEM-Token2 sequence of a result file by the magic process and the magic result, and submitting and storing the VEM-Token2 sequence in a VEM library.
9. Member management
Based on the foregoing, the present invention further includes, but is not limited to, member management for users, specifically including one or more of the following steps or methods in combination:
The magic model also comprises member management, and specifically comprises the following steps:
ST600, applying for members to the magic model according to the needs of the user, establishing a member file, and storing the member file in a VEM library.
ST610, the member archive comprises user information, sample file information, user file information, VEM parameters, voiceprint encryption and voiceprint decryption, wherein the encryption and decryption keys comprise member signatures, member images, member videos and member VEM parameters.
ST620, member management includes advancing, backing, rolling back, adding, deleting, inquiring, modifying, storing and maintaining operation steps in the magic process.
ST630, member management also comprises instant modification, instant monitoring, real-time scoring, supervised learning, reinforcement learning, rewarding and punishment of user files, and the results are stored in a VEM library.
10. Others
Based on the foregoing, the magic model of the present invention further includes, but is not limited to, the following steps or methods:
ST700 is that the magic model also comprises an application system of a mobile terminal and an application system of a PC terminal, and also comprises an application system of a cloud mode and a blockchain application system.
ST800 the magic model also comprises a supporting hardware system, comprising a communication interface, a recording module, a tuning module, a playback module, an encryption module and a decryption module, comprising a tremble system interface, a face-to-face video number interface and an AI karaoke supporting system.
ST900, according to the synchronization signal provided for the subsequent large model application, accessing an AI system comprising DeepSeek, kimi.AI and ChatGPT to form an AI Agent.
STA00 magic model also includes interface protocol, providing hardware and network based MIDI protocol, MSC extension protocol, OSC network protocol, providing transport layer based AES3/PDIF protocol, MADI protocol, providing network audio transport protocol such as Dante protocol, AVB/TSN protocol, AES67 protocol.
11. Object and intent of the invention
The invention relates to a method for constructing a VEM-Token vocal emotion multi-mode magic model, which aims at and aims at:
realizing 'sound and student' and realizing recognition and understanding based on vocal music and music emotion.
An automatic and intelligent vocal AI magic model for emotion recognition and understanding of music files is created, and VEM parameters are introduced.
An automated magic method and steps for creating a user's learning from a sample music file.
Greatly improves the probability of automatic parameter adjustment in the magic process, and realizes standardization and automation.
A model facing music/vocal music is innovated, and a large model is connected to form an intelligent Agent for music AI application or a special application system, so that powerful support is provided for application of the wide AI.
The modeling of the vocal emotion is realized, and the quantitative characterization capability of the vocal emotion in a multi-mode is realized.
12. Advantageous effects of the invention
(1) "Phonological" is achieved, enabling AI to recognize and understand music/sound moods.
(2) The modeling of the vocal emotion is solved, and the quantitative characterization capability of the vocal emotion in a multi-mode is established.
(3) Realizing the automatic and intelligent vocal music modification based on emotion understanding.
(4) The high-efficiency magic method is realized by adopting automation, semi-automation and artificial intelligence to replace the artificial parameter adjustment.
(5) The "illusion" of AI operation is greatly reduced.
Drawings
List of drawings:
FIG. 1A schematic diagram of a VEM-Token magic model
FIG. 2 schematic representation of three-dimensional space-related emotions
FIG. 3 is a schematic diagram of a partial user operation interface
Detailed description of the drawings:
see the specific examples for details.
Detailed Description
The invention is applied as the granted Chinese invention patent VEM-Token vocal emotion multimode Token-based singing and accompaniment deep learning method, CN120126506 and a patent pool patent of the invention patent VEM-Token beat capturing and Ji Moxing constructing method, 202511249168.0, which are under examination, and is focused on the imitation and magic-modified invention way when a user learns a sample song, and further basic innovation is made.
The objects and intentions of the present invention can be achieved by the following specific examples. It should be noted herein that the specific embodiments have specific application and industrial applicability. Therefore, the embodiments do not include all of the features and steps of the present invention, nor are they intended to be limiting of the present invention. The description of the claims of the present invention is a summary of the invention.
This example is one example of the present invention.
Specific embodiments of the present invention are exemplified as follows:
novel construction method of VEM-Token vocal multi-mode magic model-an AI learning intelligent agent and AI karaoke system
Description of the drawings
The contents of this embodiment mainly include, but are not limited to, those composed of the following main schematic drawings, which are fig. 1 to 3.
Description of the implementation procedure
The method steps of the present embodiment mainly include steps 1 to 10. Wherein each of the 10 parts comprises a number of sub-steps. Unless specifically stated otherwise, these sub-steps are not all required, nor are their sequencing required unless otherwise stated, but are optimized and further selected by the patent practitioner according to the needs of some specific tasks.
The specific working steps are as follows:
1. VEM-Token magic model basic scheme implementation step
The invention is used as a construction method of a VEM-Token vocal emotion multi-mode magic model, which comprises the following steps:
ST100, collecting a sample file and a user file, capturing beats and dividing the beats by using a VEM-Token model to respectively obtain a VEM-Token1 sequence of the sample file and a VEM-Token2 sequence of the user file, aligning the beats of all the VEM-Token2 sequences by using the VEM-Token1 sequence, and generating the VEM-Token2 sequence.
ST200, identifying the VEM parameters of the VEM-Token1 sequence according to the VEM parameters included in the VEM-Token model, determining a magic solution by a user, processing the VEM parameters of the VEM-Token2 sequence, and generating the VEM-Token2 sequence of a result file conforming to the VEM-Token1 sequence style with the magic solution by magic solution.
Wherein, the model of the VEM-Token vocal emotion multi-mode refers to the model in the VEM-Token vocal emotion multi-mode Token singing and accompaniment deep learning method, CN120126506, which specifically comprises the steps of 1-4:
(1) Recording emotion by adopting more than one mode, marking vocal emotion multi-mode as VEM, constructing VEM classification, VEM coordinate system, VEM function and VEM library, wherein the vocal emotion comprises one or combination of happiness, sadness, anger, fear, aversion, surprise, calm, expectancy, trust, love, hawk, emotion and enemy, and the multi-mode comprises one or combination of lyrics, singing voice, accompaniment, vocal style, music, emotion foundation, accompaniment instrument, video and image, and the VEM coordinate system comprises a coordinate axis system established according to independent emotion, opposite emotion pairs and associated opposite emotion groups.
(2) Collecting vocal samples according to VEM classification, performing emotion judgment on the vocal sounds and accompaniments by a vocal music expert of a human being, training a VEM function by adopting supervised learning and deep learning to obtain VEM parameters, and adding the VEM parameters to a VEM library.
(3) And (3) carrying out beat calibration on the vocal files by adopting a VEM processor, separating out song sound streams and accompaniment streams, carrying out VEM-Token segmentation on the vocal files according to beats, converting the song sound streams into VEM-Token1 sequences, converting the accompaniment streams into VEM-Token2 sequences, and adding the VEM-Token2 sequences into a preprocessing library.
(4) Deep learning is adopted to respectively generate song collection of tunes of poems, VEM-Token song sound spectrum, VEM-Token accompaniment spectrum and VEM-Token music score.
The model of VEM-Token beat capture and beat alignment refers to a model in a method of VEM-Token beat capture and construction Ji Moxing, 202511249168.0, and specifically comprises the steps of 5-7:
(5) For a vocal file, setting a beat model comprising beat capturing and beat alignment according to a VEM-Token vocal emotion multi-mode model, capturing the beat of the vocal file, dividing the vocal file into VEM-Token sequences according to the beat, and marking the positions of the start point and the end point of the beat in each VEM-Token;
(6) Setting a starting point alignment model, comprising:
dividing a sample file included in the vocal music file and a user file generated by singing the user simulated sample file into a VEM-Token1 sequence and a VEM-Token2 sequence respectively, adopting a starting point fine tuning step according to the starting point of each VEM-Token1, and adjusting the starting points of the VEM-tokens 2 at corresponding positions one by one so as to align with the starting points of the VEM-tokens 1 at the corresponding positions;
For each segment of the circulating segments included in the vocal music file, starting from the second segment by taking the first segment as a reference, adopting a starting point fine adjustment step to adjust the starting point of each VEM-Token of each segment one by one, so that the starting point of each VEM-Token at the corresponding position of the first segment is aligned until all circulating segments are ended;
(7) Setting an endpoint alignment model, comprising:
According to the end point of each VEM-Token1, adopting an end point fine tuning step to adjust the end points of the VEM-Token2 at the corresponding positions one by one so as to align with the end points of the VEM-Token1 at the corresponding positions;
And for each segment of the circulating segments, taking the first segment as a reference, starting from the second segment, adopting an end point fine tuning step to adjust the end point of each VEM-Token of each segment one by one, so that the end points of the VEM-tokens at the positions corresponding to the first segment are aligned until all circulating segments are ended.
In the present application, the magic model is an algorithm that includes a user file modeled as a sample file.
It should be noted here that the sample file here includes more than one.
If the user only needs to imitate a single sample file, the sample file at this time is the only song, and only the VEM parameters of the song are needed. For example, the user only wants to imitate the "sease song" of the dao, then the sample file is the song file of the dao version, and all VEM parameters of the song are selected.
If the user needs to select different VEM parameters in two sample files, then the sample files are two. For example, if the user likes a part of the "sease" which simulates the knife man and another part of the "sease" which is a sub-man, then the two singers will need to select the "sease" and VEM parameters thereof.
FIG. 1A schematic diagram of a VEM-Token magic model
In fig. 1, one or more sample files and user files are input to a VEM processor, and sent to a VEM library query by a path 101, and if the VEM library finds that the sample files and even the user files are stored, the VEM parameters of the files and/or the VEM-Token1 sequence/VEM-Token 2 sequence are sent to a beat capture and alignment module by a path 103. According to the needs of the user, the VEM parameters and the VEM-Token1 sequence of the sample file can be sent to the VEM-Token magic matrix through a 102 path. If no sample file exists in the VEM parameter library, the VEM parameters of the sample file and/or the user file are parsed by the VEM processor.
In the beat capturing and alignment module, the data from the VEM processor includes at least a VEM-Token2 sequence of the user file, a VEM-Token1 sequence of the sample file further including 103 paths, or a VEM-Token1 sequence of the sample file and a VEM-Token2 sequence of the user file from the VEM processor, and the beat capturing and beat alignment steps are performed. Then the voice and accompaniment sound are sent to a multi-mode learning analysis module to be further decomposed into a singing voice and accompaniment voice of a sample file and a user file, a VEM-token1.1 sequence, a VEM-token1.2 sequence, a VEM-token2.1 sequence and a VEM-token2.2 sequence. In addition, according to the needs of the user, the emotion overtones and emotion fluctuation are continuously decomposed into a sample emotion overtone VEM-token1.3, a sample emotion fluctuation VEM-token1.4, and emotion overtones VEM-token2.3 and sample emotion fluctuation VEM-token2.4 of the user file.
In fig. 1, for convenience of illustration, a VEM-Token magic matrix is provided, and all the magic items in the present invention are incorporated into the VEM-Token magic matrix for description. It should be understood by the user of this patent that this is only an illustrative description and is not necessarily a module or hardware nor a limitation of the invention.
The information sources of the VEM-Token magic matrix comprise 2 paths, which are respectively:
All information originates from the multi-module learning analysis module, i.e. VEM-Token1.1, VEM-Token1.2, VEM-Token1.3, VEM-Token1.4 comprising sample files, and VEM-Token2.1, VEM-Token2.2, VEM-Token2.3, VEM-Token2.4 comprising user files, in which case no content in the sample files is stored in the VEM library.
The overall information originates from the multi-module learning analysis module, which is the content of the user file, and from the VEM library through the 104 path, which is the content of the sample file. Here because the contents of the sample file are already stored in the VEM library.
The information output of the VEM-Token magic matrix comprises the magic result of the user file, wherein the magic result of the user file is stored in the VEM library according to the wish of the user.
2. Sample file and user file beat capture alignment step
In the foregoing basic aspect, the present invention provides a method or step of beat capture and alignment in terms of sample files and user file models, including, but not limited to, one or more of the following combinations:
ST110 for a sample file that does not include loop segments, beat capture is used to mark the start and end points of the beat, generating the complete VEM-Token1 sequence.
ST120, for a sample file comprising loop segments, starting from the second loop segment until all loop segments are finished, after beat capture, executing a starting point fine tuning step and an end point fine tuning step, realizing beat alignment inside the loop segments, and generating all VEM-Token1 sequences.
ST130 is to execute beat capturing and beat aligning steps including a starting point fine tuning step and an end point fine tuning step in all VEM-Token2 sequences according to the VEM-Token1 sequences to process corresponding beats to generate the VEM-Token2 sequences.
The ST140 user files specifically comprise user files generated by simulating sample file singing by a user, user files generated by collecting voice characteristics of the user and completely cloning according to the sample files, and user files generated by mixing local singing and user file generated result files generated by local cloning.
It is emphasized here that the loop segments are frequent in a song, e.g. a three-segment loop. In a typical case, the beats in the loop segments are aligned, so that the start and end fine adjustments need to be performed. However, with the emotional processing of singers or different styles of songs, the beats of individual bars are not absolutely aligned in the circulation period. In this regard, the user of the present patent needs to perform different treatments according to songs.
3. VEM-Token model
Based on the foregoing, the present invention provides steps or methods in terms of a VEM-Token model, including, but not limited to, a combination of one or more of the following vocal emotion multimodal models:
The ST210 is that the VEM-Token model also comprises a VEM library and a VEM processor.
The VEM parameters include VEM classification, VEM function and established VEM coordinate system recording multiple modalities in vocal emotion.
ST212 the modalities include one or a combination of lyrics, singing voice, accompaniment, illustration, vocal style, music, emotional basis, accompaniment instrument, video, image, and ambient sound.
ST213 emotional basis includes one or a combination of happiness, sadness, anger, fear, aversion, surprise, calm, mind, expectancy, trust, love, remoistening, emotion, enemy.
ST214 the vocal style comprises one or a combination of folk song playing method, popular song playing method, rock song playing method, western song playing method, popular song playing method, composed song playing method and opera playing method.
ST215 the voice style includes the combination of the natural singing, speaking, the vocal cords when reciting, the fundamental frequency and overtones of the resonant cavity, which are the natural features of the user to distinguish from other people.
ST216, collecting various vocal files by VEM classification, performing emotion judgment on singing and accompaniment on the vocal files by a vocal expert or a learning algorithm of a human, training a VEM function by adopting supervised learning and deep learning to obtain VEM parameters, and adding the VEM parameters to a VEM library.
ST220 is that the VEM processor provides an operation interface for the interaction of the user and the machine, and the specific implementation of the magic scheme is completed.
It should be noted here that the VEM-Token model is a model of vocal emotion multi-mode vocabulary, and is different from the vocabulary processed by natural language of the existing NLP-Token, and the NLP-Token is based on direct interpretation of words and Chinese characters. The VEM-Token is a brand new explanation taking the music beat as a word element, and meanwhile, the VEM-Token is also related to VEM parameters of vocal emotion multi-mode, and vector data of various expression modes, such as love, hawk, emotion and enemy, as well as directions and scales of the VEM parameters are contained in the VEM parameters. In addition, a VEM library of supervised learning of typical songs by human vocal specialists is included in the model. Therefore, in this case, the accuracy of the VEM parameters of the songs in the VEM library is high, so that the reliability of the result of the operation is high, and the probability of "illusion" such as the NLP-Token model is low.
4. Magic scheme model
Based on the foregoing, the present invention provides a step or method in a magic solution model, including, but not limited to, one or more of the following combinations:
the magic scheme comprises the steps of singing magic, accompaniment magic, emotion overtone magic, and emotion fluctuation magic, and specifically comprises the following steps:
ST 310. The magic scheme comprises singing magic, and specifically comprises:
ST311, preprocessing the sample file and the user file by using a VEM processor, setting a singing filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.1 sequence of singing voice of the sample file and a VEM-token2.1 sequence of singing voice of the user file.
ST312, capturing beat starting points and beat end points of all the VEM-Token1.1 sequences and all the VEM-Token2.1 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.1 sequences with the beat starting points and the beat end points of the VEM-Token1.1 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
ST 320. The magic scheme also comprises accompaniment sound magic, and specifically comprises:
ST321, preprocessing the sample file and the user file by using a VEM processor, setting an accompaniment filter, converting the sample file and the user file into a frequency spectrum format file, separating a VEM-token1.2 sequence of accompaniment sounds of the sample file and a VEM-token2.2 sequence of accompaniment sounds of the user file.
ST322, capturing beat starting points and beat end points of all the VEM-Token1.2 sequences and all the VEM-Token2.2 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.2 sequences with the VEM-Token1.2 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
ST330 the magic scheme also comprises emotion overtone magic, which comprises the following steps:
ST331, preprocessing the sample file and the user file by using a VEM processor, setting an emotion overtone filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.3 sequence of emotion overtones of the sample file and a VEM-token2.3 sequence of emotion overtones of the user file.
ST332, capturing beat starting points and beat end points of all the VEM-Token1.3 sequences and all the VEM-Token2.3 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.3 sequences with the VEM-Token1.3 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
ST340 the magic scheme also comprises emotion fluctuation magic, which specifically comprises:
ST341, preprocessing the sample file and the user file by using a VEM processor, setting an emotion fluctuation filter, converting the sample file and the user file into a frequency spectrum format file, and separating a VEM-token1.4 sequence for forming emotion fluctuation of the sample file and a VEM-token2.4 sequence for forming emotion fluctuation of the user file.
ST342, capturing beat starting points and beat end points of all the VEM-Token1.4 sequences and all the VEM-Token2.4 sequences respectively according to beat capturing and beat aligning models, and aligning the beat starting points and the beat end points of the VEM-Token2.4 sequences with the VEM-Token1.4 sequences by adopting a starting point fine tuning model and an end point fine tuning model.
The 4 magic schemes of singing magic change, accompaniment magic change, emotion overtone magic change and emotion fluctuation magic change are the magic schemes of the foundation of a user learning singing file facing a sample song file, and one of the important points is capturing and aligning beats of the two. Secondly, the 4 magic schemes can be combined, for example, in many cases, accompaniment sounds can be directly selected from accompaniment tracks of a sample file without magic. It is particularly emphasized that in beat trimming, after each start trimming, an end trimming is recommended to ensure that the length of the beat is fixed.
5. Learning and singing magic improvement
Based on the foregoing, the present invention provides a method or method for learning a magic model for a sample song file by a user, including, but not limited to, one or more of the following steps:
The magic scheme also comprises a step of learning to sing the magic, and specifically comprises the following steps:
ST410 the learning magic singing comprises simple learning singing, and specifically comprises:
ST411, using a VEM processor, according to all or part of the sample file, taking the bar formed by combining a plurality of beats as a unit, using the user to learn to sing more than once and recording and converting into a plurality of groups of singing voice VEM-token2.1 sequences of the user file, and then using the user to select a group of singing voice VEM-token2.1 sequences which are most preferable to the corresponding singing voice VEM-token1.1 sequences of the sample file in the plurality of groups as the selected singing voice VEM-token2.1 sequences.
ST412, using ST310, ST311, ST312 steps, generating a beat captured and aligned VEM-token2.1 sequence from the selected singing voice VEM-token2.1 sequence.
ST420 is that the learning magic singing improvement also comprises the mixed learning singing, which comprises the following steps:
ST421, setting weights A and B for the VEM-Token1.1 sequence and the VEM-Token2.1 sequence, and calculating the VEM-Token2.1 sequence after mixed singing by adopting the sequence of A multiplied by VEM-Token1.1 and B multiplied by VEM-Token2.1, wherein A is less than 0.3 and A multiplied by B=1.0.
The learning and singing magic is the first step that a user imitates a sample file, and in general, each sentence of lyrics and each beat of learning and singing need to be repeatedly learned and recorded many times, and then the most satisfactory one is selected in the learning and singing magic. Next, for the selected recording, the beat start and end points of the VEM-token2.1 sequence are aligned with the beat start and end points of the VEM-token1.1 sequence.
For mixed learning singing, after capturing and aligning beats, the data of the VEM-token1.1 sequence of the corresponding beats of the sample file are discounted, for example, 5% of the data, and the data of the VEM-token2.1 sequence of the user learning singing result are discounted by 95%, and the data are synthesized to be used as the final user learning singing file. The purpose of this step is to include a small portion of the sample song file in the user's singing results so as to bring the magic wand closer to the sample song. It should be noted here that, in particular, 5% or more is required, it should be determined in the monitoring that the ratio cannot be too large, otherwise the effect is poor due to too large difference between the sample and the user.
6. Vocal cloning
Based on the foregoing, the present invention provides a method or procedure for the preparation of a magic model of vocal cloning, including, but not limited to, one or more of the following combinations:
the magic scheme also comprises a voice cloning magic step, and specifically comprises the following steps:
ST430 the magic scheme also comprises voice clone magic, which comprises the following steps:
ST431, query the VEM library, if there is VEM parameter of customer's voice style, obtain VEM parameter of customer's voice style.
If the VEM parameters of the voice styles of the clients do not exist, or the VEM parameters of the voice styles of the existing clients do not meet the requirements of the users, collecting singing records of a section of exercise music, a section of speaking and reciting records of the user, converting the records into a frequency spectrum format file with the voice styles of the user by a VEM processor through a singing filter, a mood overtone filter and a mood fluctuation filter, and obtaining the VEM parameters of the voice styles of the user according to VEM classification. And stored in the VEM library.
ST432, recognizing a lyric spectrum 1 in the sample file by adopting voice recognition included in a VEM-Token model according to the singing voice VEM-Token1.1 sequence of the sample file, wherein the lyric spectrum 1 comprises lyrics and positions of a start point and an end point of a beat where the lyrics are positioned.
ST433 copies the lyrics spectrum 1 of the sample file to become the lyrics spectrum 2 of the user file, adopts a voice synthesizer according to the VEM parameters of the voice style of the user, and finishes the voice cloning magic change of the user word by word according to the lyrics spectrum 2 of the user file and the starting point and the end point position of the beat, thus becoming the singing voice VEM-token2.1 sequence of the user.
The human voice is like a human fingerprint, has personalized characteristics, is mainly represented by the structural characteristics of a sounding cavity of the human voice, is represented acoustically, namely, the fundamental frequency, the overtone type and the respective amplitude of the voice in a frequency spectrum format file are included in VEM parameters, and can find out the characteristics of the voice by analyzing the fundamental frequency, the overtone type and the respective amplitude.
In principle, as long as the voice VEM parameters of the user are in the VEM library, the voice VEM-token2.1 sequence of the user can be constructed by adding the singing voice VEM-token1.1 sequence of the sample file and the lyric spectrum 2 of the user by adopting a voice synthesizer, and further the voice clone magic of the user is realized. Even if the VEM database does not have the VEM parameters of the user, the VEM processor is used for collecting voice characteristics of the user and converting the voice characteristics into the VEM parameters, so that the magic scheme of voice cloning can be realized.
7. Lyric magic
Based on the foregoing, the present invention provides a model step or method on a lyric magic model, including, but not limited to, one or more of the following in combination:
the magic scheme also comprises a lyric magic step, and specifically comprises the following steps:
ST500, when lyrics spectrum 2 is inconsistent with song collection of tunes of poems 1 or user's needs to modify, executing lyrics magic, specifically including:
ST510, decomposing the song collection of tunes of poems and the lyrics spectrum 1 according to the semantic grammar to form a lyrics sentence 2 and a lyrics sentence 1, and executing according to the following steps:
ST511, wherein the words of the lyric sentence 2 and the lyric sentence 1 are the same, the song collection of tunes of poems is copied as the lyric spectrum 1 according to the beat, each word in the lyrics of the lyric sentence 2 is filled to the corresponding beat position one by one, or the words are manually modified by a user to form the lyric spectrum 2 after magic modification.
ST512, wherein the words of the lyric sentence 2 and the words of the lyric sentence 1 are different, the words of the lyric sentence 2 are manually modified by a user, and the lyrics are filled into the corresponding beat positions one by one to form a lyric spectrum 2 after magic modification.
ST513 is to use a voice synthesizer to complete the voice cloning magic change of the user word by word according to the VEM parameters of the voice style of the user, the song collection of tunes of poems and the starting point and the end point of the beat to become the singing voice VEM-token2.1 sequence of the user.
The lyric magic is a magic item commonly used in a magic model, and is also a magic item commonly used in a sample song by a user, and in most cases, the modification of the song collection of tunes of poems is only a local modification to an original document, and the meaning of lyrics and filling words after rhyme are comprehensively considered, rather than an impulse modification.
8. Magic pitch, decorative magic pitch, etc
On the basis of the foregoing, the present invention also includes, but is not limited to, steps or methods of one or more combinations of the following:
the magic scheme also comprises the steps of pitch calibration magic, decoration sound magic, beat length magic, rhythm speed magic, beat strength magic, tone magic, emotion magic, video magic, free magic, multiple sample magic and listening to the real-time monitoring, and the method specifically comprises the following steps:
ST520 pitch calibration magic cube according to the twelve-tone law of music, the fundamental frequency of sound must be equal to the node frequency of the twelve-tone law, and the sound frequency between two adjacent node frequencies must be adjusted up or down to the node frequency.
Pitch calibration, which is important for a non-professional singer. In music theory, pitch is the frequency of the pitch, also called intonation, which is basically too much by professional singers due to long-term training. The pitch calibration magic here may include two aspects, one is the calibration of the automatic overall pitch level, and the other is the pronunciation calibration in the beat selected autonomously by the user.
ST530 front decorative sound magic change: in one beat, when the beat start point of the singing voice VEM-token2.1 of the user file and the beat start point of the singing voice VEM-token1.1 of the corresponding sample file fall within 1/2 of the beat on the time axis, the decorative sound is adopted to compensate before the beat start point of the singing voice VEM-token2.1 of the user file so as to align the start points, the decorative sound including a tremolo, a slide sound, a extension sound, a breathing sound.
ST540 post-decorative sound magic change in one beat, when the beat end point of the singing voice VEM-token2.1 of the user file and the beat end point of the singing voice VEM-token1.1 of the corresponding sample file advance within 1/2 beat on the time axis, the decorative sound or rest sound is adopted to compensate after the beat end point of the singing voice VEM-token2.1 of the user file so as to align the end points.
Regarding the selection of the types of front and rear decorative sounds, the specific conditions such as the length of the compensation time difference, the emotion VEM parameter of the beat and the like are generally seen, and the selection is performed by a user according to the understanding of the user on songs.
ST550 is a beat length magic change, in which when the beat of the singing voice VEM-token2.1 of the user file is not consistent with the occurrence length of the beat of the singing voice VEM-token1.1 of the corresponding sample file, a time warping algorithm step or a decoration sound magic change step is adopted to compress or expand the beat of the singing voice VEM-token2.1 of the user file so as to be aligned with the beat of the singing voice VEM-token1.1 of the corresponding sample file.
ST560, quick and slow rhythm magic change, namely when the overall rhythm of the user file needs to be accelerated or slowed down, adopting a time expansion algorithm step to synchronously compress or extend the rhythm of singing and accompaniment of the user file.
ST570, beat strength magic, namely aiming at singing voice VEM-token2.1 of a user file and/or accompaniment voice VEM-token2.2 of the user file, adopting different treatments based on a base layer and an emotion layer according to the needs of a user, wherein:
ST571 the base layer comprises the steps of volume dynamic processing, envelope shaping, which are adjusted on threshold for VEM-Token2.1 and VEM-Token 2.2.
ST 572. The emotion layer comprises intelligent dynamic control based on AI/machine learning, specifically comprises training a model to intelligently identify beats and musical instruments in audio, automatically generating dynamic processing parameters according to preset emotion labels or target loudness curves, and storing training results in a VEM library.
ST580 is to synchronously check the beat length of VEM-Token2.2 when the beat length of VEM-Token2.1 is changed, and to capture by time stretching and align VEM-Token2.1 and VEM-Token2.2 when the beat length is inconsistent.
Beats and rhythms are basic physical quantities in music and are the most basic elements of music invented by human beings. Because music is a rhythmic number originating from a collective human labor, it is developed in synchronized dance steps expressing happiness. Therefore, magic of the beat rhythm is very important.
It should be noted that the time warping algorithm is a calculation method that only changes the length of audio, but does not change the pitch fundamental frequency, and does not change the structure of the harmonic components of sound. The time scaling algorithm specifically includes, but is not limited to, the following three:
(1) Time domain algorithm, SOLA (Synchronous Overlap-Add) and its variants (e.g., WSOLA, SOLA-FS),
The basic principle is to divide the input signal into short segments (segments) with overlap, time scale each segment (by compressing or expanding the overlap region between segments), find the best crossing point (based on waveform similarity) to minimize distortion when splicing, and re-superimpose the processed segments into the output signal.
The method has the basic characteristics of high calculation speed, suitability for real-time and low-resource application (such as an old tape simulator and a simple tone changer), poor transient (such as a drum point and a sound head) processing capability, easiness in generating click sound and reverberation feeling, unsatisfactory music processing tone quality, representing application of early telephone voice message speed control and some simple audio plug-ins.
(2) The frequency domain algorithm (parameterized/phase vocoder), which is currently the most popular and widely used type of method, is based on Short-Time Fourier Transform (STFT). Mainly includes but is not limited to:
The phase vocoder (Phase Vocoder) has the advantages of much better sound quality than the time domain method, and is especially suitable for audio with rich harmonic waves (such as human voice and piano). The disadvantage is that the processing of the percussive audio (e.g. snare drum) is still imperfect, producing characteristic "robot sound" or "reverberant feel" artifacts.
Enhanced phase vocoder based on transient processing. The processing quality of music such as drums, bass and the like is greatly improved. Representative, "Complex" and "Complex Pro" modes of iZotope Radius, serato Pitch' nTime, ableton Live.
(3) Machine learning/deep learning based data driven algorithms, which are the current leading research direction, aim to generate more natural time-warping results by learning large amounts of data. Mainly includes but is not limited to:
Based on the method of generating the model, the principle is that the model learns the potential distribution of the audio signal. Given an input audio, the model generates the most audible, best-matching output audio based on the target time duration. The advantage is the great potential, theoretically producing the most natural, minimal artifacts, and even "imagining" reasonable details to fill in the stretched time. Disadvantages are the need for massive data training, the huge consumption of computing resources, the difficulty in real time, and the possibility of model creation of uncontrollable illusions (ARTIFACTS).
The principle of the neural vocoder-based method is that waveforms are not directly processed, but intermediate characterization of audio (such as mel spectrogram, F0 pitch, harmonic information) is extracted first. The advantage is that the tone quality is generally far better than that of a traditional phase vocoder, because the neural vocoder is trained with a lot of high quality audio, and can reconstruct more natural sound textures. The disadvantage is that it depends on the quality of the nerve vocoder and is also generally computationally intensive.
ST590, tone color magic change, specifically comprising:
ST591 for the singing voice VEM-token1.1 sequence of the sample file and the singing voice VEM-token2.1 sequence of the user file, a filter comprising a plurality of groups of higher harmonics of a fundamental frequency is adopted, the frequencies of the higher harmonics of the plurality groups are decomposed, namely F1, F2, F3, &, fn, the overtones of the higher harmonics are decomposed, namely A1, A2, A3, &, an, wherein n is the number of the harmonics, and n is less than 50.
ST592, adopting the steps of volume dynamic processing and envelope modeling to respectively amplify or reduce overtone amplitude of more than one appointed harmonic component in A1, A2, A3, the AN in the VEM-token2.1 sequence so as to change the tone of the user file.
ST593, inquiring the tone color dynamic processing parameters of AI/machine learning in the VEM library, and adjusting the dynamic processing parameters to change the tone color of the user file.
The tone color is mainly the personalized characteristic of the own pronunciation of the user, and is similar to the voice fingerprint. The characteristic of the voice is characterized by individuation, namely the structural characteristic of a resonance cavity of human voice is mainly represented on acoustics, namely the fundamental frequency, the overtone type and the respective amplitude of the voice in a frequency spectrum format file are included in VEM parameters, and the voice characteristic can be found by analyzing the fundamental frequency, the overtone type and the respective amplitude.
In principle, as long as the VEM parameters of the user are in the VEM library, the singing voice VEM-token2.1 sequence of the user can be constructed according to the singing voice VEM-token1.1 sequence of the sample file, so that the tone magic change of the user is realized. Even if the VEM parameters of the user are not in the VEM library, the VEM processor is used for collecting voice characteristics of the user and converting the voice characteristics into the VEM parameters, so that the magic scheme of tone can be realized.
ST5A0 is emotion magic change, which specifically comprises inquiring a VEM library according to emotion magic change demands of a user, and respectively adjusting VEM parameters including emotion magic change, vocal style and vocal style demands by taking a VEM-token2.1 sequence as an independent variable to obtain emotion magic change results.
Emotion magic is the key point of the invention, based on a VEM-Token model, emotion has a plurality of classifications, namely VEM classification, and emotion basis degree is counted into VEM parameters, and emotion points are included in corresponding coordinates.
For example, the "love-and-remount, emotion-enemy and happy-and-apprehension" emotion elements which are opposite and related are respectively represented by a X, Y, Z orthogonal three-dimensional spherical coordinate system, the coordinate intervals are respectively X-axis "love-and-remount" -100% to 100%, Y-axis "happy-and-apprehension" -100% to 100%, and Z-axis "emotion-enemy" -100% to 100%, and the three coordinate axes are mutually related in an orthogonal manner, as shown in fig. 2. For example, the emotional parameters recorded at the emotional points E (X, Y, Z) of a beat are (80%, -20%, 50%). By adopting the method, emotion is accurately recorded.
The user of the present invention is specifically alerted here that the VEM coordinate system includes, but is not limited to:
Independent emotions comprise independent and uncorrelated emotions, a unidirectional one-dimensional coordinate axis is constructed, the lowest point of the emotion is taken as a coordinate 0 point, and the highest point of the emotion is taken as a coordinate maximum point.
The opposite emotion pair comprises two mutually opposite emotion components, and a bidirectional one-dimensional coordinate axis is constructed, wherein the midpoint of the opposite emotion pair is taken as a coordinate 0 point, the highest point of the positive emotion is taken as a coordinate positive maximum point, and the highest point of the negative emotion is taken as a coordinate negative maximum point.
The related opposite emotion groups comprise more than one group of opposite emotion pairs which are related with each other, wherein 0 points of the respective coordinates are aligned in a superposition way, the respective two-way one-dimensional coordinate axes are super-orthogonal, the super-plane demarcation is carried out, the respective positive emotions are arranged on the same side of the super-plane, the respective negative emotions are arranged on the opposite side of the super-plane, a super-orthogonal coordinate system is constructed, the highest point of the respective positive emotions is taken as the positive maximum point of the super-orthogonal coordinate, and the highest point of the respective negative emotions is taken as the negative maximum point of the super-orthogonal coordinate.
It should be noted that independent emotions, opposite emotion pairs, and associated opposite emotion sets, sometimes relative, are not always the same classification of modalities for different styles of vocal works, different cultural backgrounds of vocal works. Thus, independent emotions, opposite emotion pairs, and associated opposite emotion groups are all, in extreme cases, pointers to a piece of vocal work.
It should be emphasized that by "super-orthogonal coordinate system" we mean that the rectangular coordinate axes of more than 3 dimensions are arranged in the same system, e.g. 4-dimensional, 5-dimensional or more, and that we describe and record by means of mathematically super-space, since we cannot draw directly on the paper. Moreover, these hyperdimensions are not uniquely orthogonal, but rather intersect mathematically, rather than orthogonally, and even based on the curved coordinates of the Riemann geometry.
FIG. 3 is a schematic diagram of a partial user operation interface. In the figure, 3 color belts respectively represent the interface of the adjusting section of 3 groups of emotion pairs, a green ring on the color belts is an adjusting sliding cursor of emotion VEM parameters, and the emotion values of the 3 groups of emotion pairs, namely 'love-remount', 'happy-worry', 'emotion-enemy', can be adjusted by pulling the green ring cursor on the color belts left and right. The design man-machine interaction interface comprises, but is not limited to, a one-dimensional, two-dimensional screen, a three-dimensional space and even a multi-dimensional space interaction interface, and the VEM parameters comprise, but are not limited to, a dragging cursor type, a numerical value input type and a color editing type, and also comprise a 'what you see is what you hear' real-time monitoring mode. The editing result and the editing mode are stored in a VEM library in a centralized way.
ST5B0, when the user file needs to be matched with the video, the content and the rhythm of the video picture are adjusted according to the VEM parameters and the rhythm so as to adapt to the requirement of the user file.
Because the invention is based on a VEM-Token model, the mode can be understood as an 'audio-visual' mode, so that a subsequent 'visual map' model can be connected, the text information Token generated by 'audio-visual' can drive the 'visual map' to generate static images and dynamic videos, and a user can also manually connect the videos to generate video magic changes.
ST5C0, free magic, the user changes the content and rhythm of the VEM-token2.1, VEM-token2.2 and video pictures according to the VEM parameters and more than one mode, so as to adapt to the requirements of user files.
The free magic modification comprises the magic modification of the singing voice of the user, the accompaniment voice of the user and the video picture according to the intention and the intention of the user, and the magic modification mode comprises manual and semi-automatic.
ST5D0, multiple sample magic, wherein the user selects more than one sample file, selects part of sample files corresponding to part of parameters in the VEM parameters respectively, selects other part of sample files corresponding to the other part of parameters, and generates the content and rhythm of the VEM-Token2.1, VEM-Token2.2 and video pictures by magic so as to adapt to the requirements of the user files.
The multi-sample magic change is realized on the basis of more than one sample file, for example, a sample file of a folk song style and a sample file of a more elegant male high-pitched style are referred to in the singing voice of a user song, and in addition, accompaniment sounds refer to the sample files of accompaniment of a national musical instrument, that is, 3 sample files are referred to, so that multi-sample magic change is realized.
ST5E0, namely the listening and getting real-time listening, specifically comprising the steps of real-time listening, judging and grading the magic result by a user, generating a VEM-Token2 sequence of a result file by the magic process and the magic result, and submitting and storing the VEM-Token2 sequence in a VEM library.
The real-time monitoring is an important function, is beneficial to improving the efficiency of magic modification for users, and is used for monitoring and watching the magic modification result in real time while the users conduct magic modification, so as to achieve the effect of listening.
The user is particularly reminded of the invention that designing human-machine interaction interfaces includes but is not limited to one-dimensional, two-dimensional screen, three-dimensional space and even multi-dimensional space interaction interfaces, and VEM parameters include but are not limited to dragging cursors, numerical input, color editing, and real-time listening modes of 'what you see is what you hear'. The editing result and the editing mode are stored in a VEM library in a centralized way.
9. Member management
Based on the foregoing, the present invention further includes, but is not limited to, member management for users, specifically including one or more of the following steps or methods in combination:
The magic model also comprises member management, and specifically comprises the following steps:
ST600, applying for members to the magic model according to the needs of the user, establishing a member file, and storing the member file in a VEM library.
ST610, the member archive comprises user information, sample file information, user file information, VEM parameters, voiceprint encryption and voiceprint decryption, wherein the encryption and decryption keys comprise member signatures, member images, member videos and member VEM parameters.
ST620, member management includes advancing, backing, rolling back, adding, deleting, inquiring, modifying, storing and maintaining operation steps in the magic process.
ST630, member management also comprises instant modification, instant monitoring, real-time scoring, supervised learning, reinforcement learning, rewarding and punishment of user files, and the results are stored in a VEM library.
The system is beneficial to realizing cloud mode + fixed terminal or mobile terminal, wherein the VEM library and VEM parameters can be stored in the cloud center and terminal, and can also be stored in the nodes of the blockchain in a blockchain mode. Thus, users are managed in a membership manner, providing convenience in intellectual property management for users.
10. Others
Based on the foregoing solution, the magic model of the present invention further includes, but is not limited to, the following expansion steps or methods:
ST700 is that the magic model also comprises an application system of a mobile terminal and an application system of a PC terminal, and also comprises an application system of a cloud mode and a blockchain application system.
ST800 the magic model also comprises a supporting hardware system, comprising a communication interface, a recording module, a tuning module, a playback module, an encryption module and a decryption module, comprising a tremble system interface, a face-to-face video number interface and an AI karaoke supporting system.
ST900, according to the synchronization signal provided for the subsequent large model application, accessing an AI system comprising DeepSeek, kimi.AI and ChatGPT to form an AI Agent.
STA00 magic model also includes interface protocol, providing hardware and network based MIDI protocol, MSC extension protocol, OSC network protocol, providing transport layer based AES3/PDIF protocol, MADI protocol, providing network audio transport protocol such as Dante protocol, AVB/TSN protocol, AES67 protocol.
Wherein MIDI is a music digital interface (Musical Instrument DIGITAL INTERFACE), MSC is MIDI display control (MIDI Show Control), OSC is open sound control (Open Sound Control), AES3/PDIF is a point-to-point digital audio transmission standard, MADI is a multi-channel audio digital interface (Multichannel Audio DIGITAL INTERFACE), dante is an auto-discovery, low-delay, high-channel number, co-student protocol developed by Audinate, AVB/TSN is an audio-video low-delay transmission protocol, and AES67 is an interoperability standard protocol based on existing network technology.
It is specifically stated that in this specification, with respect to a variable name (e.g., VEM-Token with a numerical suffix), the same variable (e.g., VEM-Token 1.1) is inconsistent according to the conventions noted in the computer arts, since the calculation of the variable name varies with the step, as will be understood by those skilled in the art.

Claims (10)

1.VEM-Token声乐情绪多模态魔改模型的建构方法,其特征在于,包括:1. The method for constructing a VEM-Token vocal emotion multimodal modification model is characterized by including: ST100:采集样本文件和用户文件,依据VEM-Token模型,采用节拍捕捉和VEM-Token分割,分别获得样本文件的VEM-Token1序列和用户文件的VEM-Token2序列,依据VEM-Token1序列,对全部VEM-Token2序列,采用节拍对齐,生成VEM-Token2序列;ST100: Collect sample files and user files. Based on the VEM-Token model, use beat capture and VEM-Token segmentation to obtain the VEM-Token1 sequence of the sample file and the VEM-Token2 sequence of the user file. Based on the VEM-Token1 sequence, perform beat alignment on all VEM-Token2 sequences to generate the VEM-Token2 sequence. ST200:依据VEM-Token模型包括的VEM参数,识别VEM-Token1序列的VEM参数,由用户确定魔改方案,处理VEM-Token2序列的VEM参数,魔改生成与VEM-Token1序列风格符合魔改方案的结果文件的VEM-Token2序列。ST200: Based on the VEM parameters included in the VEM-Token model, identify the VEM parameters of the VEM-Token1 sequence, determine the modification plan by the user, process the VEM parameters of the VEM-Token2 sequence, and modify the sequence to generate a VEM-Token2 sequence that matches the style of the VEM-Token1 sequence and the result file of the modification plan. 2.根据权利要求1的方法,其特征在于,具体包括:2. The method according to claim 1, characterized in that it specifically comprises: ST110:对于不包括循环段的样本文件,采用节拍捕捉用以标记节拍的起点和终点,生成全部VEM-Token1序列;ST110: For sample files that do not include loop segments, use beat capture to mark the start and end points of the beats and generate the entire VEM-Token1 sequence; ST120:对于包括循环段的样本文件,从第二个循环段开始直到全部循环段结束,在节拍捕捉以后,执行起点微调步骤和终点微调步骤,实现循环段内部的节拍对齐,生成全部VEM-Token1序列;ST120: For sample files that include loop segments, starting from the second loop segment and continuing until all loop segments have completed, after beat capture, perform the start point fine-tuning step and the end point fine-tuning step to achieve beat alignment within the loop segment and generate the entire VEM-Token1 sequence. ST130:依据VEM-Token1序列,在全部VEM-Token2序列中,均执行包括起点微调步骤和终点微调步骤的节拍捕捉和节拍对齐步骤,以处理对应的节拍,生成VEM-Token2序列;ST130: Based on the VEM-Token1 sequence, perform the beat capture and beat alignment steps, including the start fine-tuning step and the end fine-tuning step, in all VEM-Token2 sequences to process the corresponding beats and generate a VEM-Token2 sequence; ST140:用户文件具体包括:用户模仿样本文件演唱产生的用户文件,采集用户嗓音特色完全按照样本文件克隆产生的用户文件,混合局部演唱产生的用户文件和局部克隆产生的用户文件产生结果文件。ST140: The user files specifically include: user files generated by the user imitating the singing of the sample file, user files generated by collecting the user's vocal characteristics and cloning them completely according to the sample file, and result files generated by mixing the user files generated by partial singing and the user files generated by partial cloning. 3.根据权利要求2的方法,其特征在于,具体还包括:3. The method according to claim 2, further comprising: ST210:VEM-Token模型中还包括VEM库、VEM处理器;ST210: The VEM-Token model also includes a VEM library and a VEM processor; ST211:VEM参数包括记录声乐情绪中多种模态的VEM分类、VEM函数和建立的VEM坐标系;ST211: VEM parameters include VEM classification, VEM function and established VEM coordinate system for recording multiple modes of vocal emotions; ST212:模态包括歌词、歌声、伴奏、解说、声乐风格、音乐、情绪基础、伴奏乐器、视频、图像、环境音响的其中之一或组合;ST212: Modalities include lyrics, singing, accompaniment, commentary, vocal style, music, emotional basis, accompanying instruments, video, images, and ambient sound, or any combination thereof; ST213:情绪基础包括喜悦、忧愁、悲伤、愤怒、恐惧、厌恶、惊讶、平静、思念、期待、信任、爱、恨、情、仇的其中之一或组合;ST213: Emotional foundations include one or a combination of joy, sorrow, sadness, anger, fear, disgust, surprise, calmness, longing, anticipation, trust, love, hate, affection, and hatred; ST214:声乐风格包括民族歌曲唱法、通俗歌曲唱法、摇滚歌曲唱法、西洋歌曲唱法、流行歌曲唱法、创作歌曲唱法、歌剧唱法中的之一或组合;ST214: Vocal styles include folk song singing, popular song singing, rock song singing, Western song singing, pop song singing, original song singing, and opera singing, or a combination thereof; ST215:嗓音风格包括用户固有的唱歌发声、说话发声、朗诵发声时的声带、共鸣腔发出的基频、泛音的组合,是用户区别于其它人的固有特征;ST215: Vocal style includes the user's unique vocal cords and the fundamental frequency and overtones emitted by the resonance cavity when singing, speaking, and reciting. This is the unique characteristic that distinguishes the user from others. ST216:VEM分类收集有多种声乐文件,由人类的声乐专家或学习算法对声乐文件在歌声和伴奏上进行情绪评判,采用监督学习和深度学习,训练VEM函数以获取VEM参数,添加到VEM库;ST216: VEM classification collects a variety of vocal files. Human vocal experts or learning algorithms perform emotional evaluations on the vocals and accompaniment of the vocal files. Supervised learning and deep learning are used to train VEM functions to obtain VEM parameters, which are then added to the VEM library. ST220:VEM处理器提供用户和机器交互的操作界面,完成魔改方案的具体实现。ST220: The VEM processor provides an operating interface for user-machine interaction, completing the specific implementation of the magic modification solution. 4.根据权利要求3的方法,其特征在于,魔改方案包括歌声魔改、伴奏声魔改、情绪泛音魔改、情绪波动魔改的步骤,具体包括:4. The method according to claim 3, wherein the magic modification scheme includes the steps of modifying the singing voice, modifying the accompaniment sound, modifying the emotional overtones, and modifying the emotional fluctuations, specifically comprising: ST310:魔改方案包括歌声魔改,具体包括:ST310: The modification plan includes vocal modification, specifically including: ST311:采用VEM处理器对于样本文件和用户文件进行预处理,设定歌声滤波器,将样本文件和用户文件转换成频谱格式文件,分离形成样本文件的歌声的VEM-Token1.1序列和用户文件的歌声的VEM-Token2.1序列;ST311: Use the VEM processor to pre-process the sample file and user file, set the vocal filter, convert the sample file and user file into spectral format files, and separate the VEM-Token1.1 sequence of the vocals in the sample file and the VEM-Token2.1 sequence of the vocals in the user file; ST312:依据节拍捕捉和节拍对齐的模型,分别捕捉全部VEM-Token1.1序列和全部VEM-Token2.1序列的节拍起点和节拍终点,采用起点微调模型和终点微调模型,将VEM-Token2.1序列的节拍起点和节拍终点与VEM-Token1.1序列的节拍起点和节拍终点对齐;和/或,ST312: Based on the beat capture and beat alignment models, capture the beat start and beat end of all VEM-Token 1.1 sequences and all VEM-Token 2.1 sequences respectively, and use the start fine-tuning model and the end fine-tuning model to align the beat start and beat end of the VEM-Token 2.1 sequence with the beat start and beat end of the VEM-Token 1.1 sequence; and/or, ST320:魔改方案还包括伴奏声魔改,具体包括:ST320: The modification plan also includes modification of the accompaniment sound, including: ST321:采用VEM处理器对于样本文件和用户文件进行预处理,设定伴奏滤波器,将样本文件和用户文件转换成频谱格式文件,分离形成样本文件的伴奏声的VEM-Token1.2序列,和,用户文件的伴奏声的VEM-Token2.2序列;ST321: Use the VEM processor to pre-process the sample file and user file, set the accompaniment filter, convert the sample file and user file into spectral format files, and separate the VEM-Token1.2 sequence of the accompaniment sound of the sample file and the VEM-Token2.2 sequence of the accompaniment sound of the user file; ST322:依据节拍捕捉和节拍对齐的模型,分别捕捉全部VEM-Token1.2序列和全部VEM-Token2.2序列的节拍起点和节拍终点,采用起点微调模型和终点微调模型,将VEM-Token2.2序列的节拍起点和节拍终点与VEM-Token1.2序列对齐;和/或,ST322: Based on the beat capture and beat alignment models, capture the beat start and beat end of all VEM-Token 1.2 sequences and all VEM-Token 2.2 sequences respectively, and use the start fine-tuning model and the end fine-tuning model to align the beat start and beat end of the VEM-Token 2.2 sequence with the VEM-Token 1.2 sequence; and/or, ST330:魔改方案还包括情绪泛音魔改,具体包括:ST330: The modification plan also includes emotional overtone modification, specifically including: ST331:采用VEM处理器对于样本文件和用户文件进行预处理,设定情绪泛音滤波器,将样本文件和用户文件转换成频谱格式文件,分离形成样本文件的情绪泛音的VEM-Token1.3序列和用户文件的情绪泛音的VEM-Token2.3序列;ST331: Use the VEM processor to pre-process the sample file and user file, set the emotional overtone filter, convert the sample file and user file into spectral format files, and separate the VEM-Token1.3 sequence of the emotional overtone of the sample file and the VEM-Token2.3 sequence of the emotional overtone of the user file; ST332:依据节拍捕捉和节拍对齐的模型,分别捕捉全部VEM-Token1.3序列和全部VEM-Token2.3序列的节拍起点和节拍终点,采用起点微调模型和终点微调模型,将VEM-Token2.3序列的节拍起点和节拍终点与VEM-Token1.3序列对齐;和/或,ST332: Based on the beat capture and beat alignment models, capture the beat start and beat end of all VEM-Token 1.3 sequences and all VEM-Token 2.3 sequences respectively, and use the start fine-tuning model and the end fine-tuning model to align the beat start and beat end of the VEM-Token 2.3 sequence with the VEM-Token 1.3 sequence; and/or, ST340:魔改方案还包括情绪波动魔改,具体包括:ST340: The magic modification plan also includes mood swing magic modification, specifically including: ST341:采用VEM处理器对于样本文件和用户文件进行预处理,设定情绪波动滤波器,将样本文件和用户文件转换成频谱格式文件,分离形成样本文件的情绪波动的VEM-Token1.4序列和用户文件的情绪波动的VEM-Token2.4序列;ST341: Use the VEM processor to pre-process the sample file and user file, set the emotion fluctuation filter, convert the sample file and user file into spectrum format files, and separate the VEM-Token1.4 sequence of the emotion fluctuation of the sample file and the VEM-Token2.4 sequence of the emotion fluctuation of the user file; ST342:依据节拍捕捉和节拍对齐的模型,分别捕捉全部VEM-Token1.4序列和全部VEM-Token2.4序列的节拍起点和节拍终点,采用起点微调模型和终点微调模型,将VEM-Token2.4序列的节拍起点和节拍终点与VEM-Token1.4序列对齐。ST342: Based on the beat capture and beat alignment models, the beat start and beat end points of all VEM-Token1.4 sequences and all VEM-Token2.4 sequences are captured respectively. The start point fine-tuning model and the end point fine-tuning model are used to align the beat start and beat end points of the VEM-Token2.4 sequence with the VEM-Token1.4 sequence. 5.根据权利要求4的方法,其特征在于,魔改方案还包括学唱魔改步骤,具体包括:5. The method according to claim 4, wherein the magic modification scheme further comprises a learning-to-sing magic modification step, specifically comprising: ST410:学唱魔改包括单纯学唱,具体包括:ST410: Learning to sing magic modification includes simple learning to sing, specifically including: ST411:采用VEM处理器,依据样本文件的全部或局部,以多个节拍组合而成的小节为单位,由用户学唱一次以上并且录音和转换成为用户文件的多组VEM-Token2.1序列,再由用户选择多组中与VEM-Token1.1序列最优选的一组VEM-Token2.1序列,作为选中的VEM-Token2.1序列;ST411: Using the VEM processor, based on all or part of the sample file, the user learns to sing more than once, records, and converts the results into multiple VEM-Token 2.1 sequences in units of multiple beats. The user then selects the VEM-Token 2.1 sequence that best matches the VEM-Token 1.1 sequence among the multiple sequences as the selected VEM-Token 2.1 sequence. ST412:采用ST310、ST311、ST312步骤,将选中的VEM-Token2.1序列生成节拍捕捉和对齐后的VEM-Token2.1序列;和/或,ST412: Using steps ST310, ST311, and ST312, generate a beat-captured and aligned VEM-Token2.1 sequence from the selected VEM-Token2.1 sequence; and/or, ST420:学唱魔改还包括混合学唱,具体包括:ST420: Learning to sing magic also includes mixed learning to sing, specifically including: ST421:对VEM-Token1.1序列和VEM-Token2.1序列设定权重A和B,采用算式A×VEM-Token1.1序列 + B×VEM-Token2.1序列,计算混合学唱后的VEM-Token2.1序列,其中A小于0.3,A+B=1.0。ST421: Set weights A and B for the VEM-Token1.1 sequence and the VEM-Token2.1 sequence. Use the formula A × VEM-Token1.1 sequence + B × VEM-Token2.1 sequence to calculate the VEM-Token2.1 sequence after mixed learning, where A is less than 0.3 and A + B = 1.0. 6.根据权利要求4的方法,其特征在于,魔改方案还包括嗓音克隆魔改步骤,具体包括:6. The method according to claim 4, wherein the modification scheme further comprises a voice cloning modification step, specifically comprising: ST430:魔改方案还包括嗓音克隆魔改,具体包括:ST430: The modification plan also includes voice cloning modification, specifically including: ST431:查询VEM库,如果存在客户的嗓音风格的VEM参数,则获取客户的嗓音风格的VEM参数;或,ST431: query the VEM database, and if the VEM parameters of the client's voice style exist, obtain the VEM parameters of the client's voice style; or, 如果不存在客户的嗓音风格的VEM参数,或,现有的客户的嗓音风格的VEM参数不符合用户的要求,则采集用户的一段练习曲的演唱录音、一段说话和朗诵的录音,由VEM处理器,采用歌声滤波器、情绪泛音滤波器、情绪波动滤波器,将录音转换成带有用户的嗓音风格的频谱格式文件,依据VEM分类,获取用户的嗓音风格的VEM参数,并存入VEM库;If the VEM parameters of the customer's vocal style do not exist, or the existing VEM parameters of the customer's vocal style do not meet the user's requirements, a recording of the user singing an etude, speaking, or reciting is collected. The VEM processor uses a singing filter, an emotional overtone filter, and an emotional fluctuation filter to convert the recording into a spectral format file containing the user's vocal style. Based on the VEM classification, the VEM parameters of the user's vocal style are obtained and stored in the VEM library. ST432:依据VEM-Token1.1序列,采用VEM-Token模型包括的语音识别,识别出样本文件中的歌词谱1,歌词谱1包括歌词和歌词所在节拍的起点和终点的位置;ST432: Based on the VEM-Token 1.1 sequence, use the speech recognition included in the VEM-Token model to identify the lyrics spectrum 1 in the sample file. Lyric spectrum 1 includes the lyrics and the start and end positions of the beats in which the lyrics are located; ST433:复制样本文件的歌词谱1成为用户文件的歌词谱2,依据用户的嗓音风格的VEM参数,采用语音合成器,按照用户文件的歌词谱2及其在节拍的起点和终点位置,逐字完成用户的嗓音克隆魔改,成为VEM-Token2.1序列。ST433: Copy the sample file's Lyrics Spectrum 1 to become the user file's Lyrics Spectrum 2. Based on the VEM parameters of the user's vocal style, use a speech synthesizer to clone the user's voice word by word according to the user file's Lyrics Spectrum 2 and its start and end positions in the beat, creating a VEM-Token 2.1 sequence. 7.根据权利要求6的方法,其特征在于,魔改方案还包括歌词魔改步骤,具体包括:7. The method according to claim 6, wherein the modification scheme further comprises a lyrics modification step, specifically comprising: ST500:当歌词谱2与歌词谱1不一致或用户需要修改时,执行歌词魔改,具体包括:ST500: When Lyrics Score 2 is inconsistent with Lyrics Score 1 or the user needs to modify it, perform lyrics modification, including: ST510:依据语义语法分解歌词谱2与歌词谱1,成为歌词语句2和歌词语句1,按照下列步骤执行:ST510: Decompose Lyric Spectrum 2 and Lyric Spectrum 1 into Lyric Sentence 2 and Lyric Sentence 1 based on semantic grammar. Execute the following steps: ST511:歌词语句2和歌词语句1的字数相同,按照节拍复制歌词谱2为歌词谱1,以歌词语句2中的歌词中的每个字,逐个填空到对应节拍位置,或,由用户人工修改,成为魔改后的歌词谱2;ST511: Lyrics Sentence 2 has the same number of words as Lyrics Sentence 1. Lyrics Score 2 is copied to Lyrics Score 1 according to the beat. Each word in Lyrics Sentence 2 is filled in at the corresponding beat position. Alternatively, the user can manually modify the Lyrics Score 2 to create the modified Lyrics Score 2. ST512:歌词语句2和歌词语句1的字数不同,由用户人工修改,将歌词2的歌词,逐个填空到对应节拍位置,成为魔改后的歌词谱2;ST512: Lyric Sentence 2 has a different number of words than Lyric Sentence 1. The user manually modifies Lyric Sentence 2 by filling in the blanks at the corresponding beat positions, creating the modified Lyric Score 2. ST513:采用语音合成器,依据用户嗓音风格的VEM参数,按照歌词谱2和在节拍的起点和终点位置,逐字完成用户的嗓音克隆魔改,成为VEM-Token2.1序列。ST513: Using a speech synthesizer, based on the VEM parameters of the user's vocal style, the lyrics spectrum 2 and the start and end positions of the beat, the user's voice is cloned and modified word by word to become a VEM-Token2.1 sequence. 8.根据权利要求5或7的方法,其特征在于,魔改方案还包括音高校准魔改、装饰音魔改、节拍长短魔改、节奏快慢魔改、节拍强弱魔改、音色魔改、情绪魔改、视频魔改、自由魔改、多样本魔改、所听即所得的实时监听的步骤、分别具体包括:8. The method according to claim 5 or 7, characterized in that the magic modification scheme further includes the steps of pitch calibration magic modification, ornamentation magic modification, beat length magic modification, rhythm speed magic modification, beat strength magic modification, timbre magic modification, emotion magic modification, video magic modification, free magic modification, multi-sample magic modification, and real-time monitoring of what you hear is what you get, which specifically include: ST520:音高校准魔改:按照乐理十二平均律,声音的基频频率须等于十二平均律的节点频率,介于相邻的两个节点频率之间的声音频率,需将其上调或者下调到节点频率;ST520: Pitch calibration magic change: According to the twelve-tone equal temperament of music theory, the fundamental frequency of the sound must be equal to the node frequency of the twelve-tone equal temperament. The sound frequency between two adjacent node frequencies must be raised or lowered to the node frequency; ST530:前装饰音魔改:在一个节拍中,当用VEM-Token2.1的节拍起点与相应的VEM-Token1.1的节拍起点在时间轴上落后在1/2节拍以内时,采用装饰音在VEM-Token2.1的节拍起点之前予以补偿,以便使得起点对齐,装饰音包括颤音、滑音、延长音、呼吸音;ST530: Pre-beat modification: In a beat, when the beat start point of VEM-Token2.1 lags behind the beat start point of the corresponding VEM-Token1.1 by less than 1/2 beat on the timeline, a grace note is added before the beat start point of VEM-Token2.1 to compensate for the aligning of the start points. Grace notes include trills, glissando, linguistic notes, and breath sounds. ST540:后装饰音魔改:在一个节拍中,当用VEM-Token2.1的节拍终点与相应的VEM-Token1.1的节拍终点在时间轴上超前在1/2节拍以内时,采用装饰音或休止音在用VEM-Token2.1的节拍终点之后予以补偿,以便使得终点对齐;ST540: Post-ornament modification: In a beat, when the beat end point of VEM-Token2.1 is within 1/2 beat of the corresponding beat end point of VEM-Token1.1 on the time axis, an ornament or rest is used to compensate after the beat end point of VEM-Token2.1 to align the end points. ST550:节拍长短魔改:当VEM-Token2.1的节拍与相应的VEM-Token1.1的节拍出现长度不相符时,采用时间伸缩算法步骤或装饰音魔改步骤,压缩或者延展VEM-Token2.1的节拍,以便与相应的VEM-Token1.1的节拍对齐;ST550: Beat length modification: When the beat length of VEM-Token2.1 does not match the beat length of the corresponding VEM-Token1.1, use the time stretching algorithm or the grace note modification step to compress or extend the beat of VEM-Token2.1 so that it is aligned with the beat of the corresponding VEM-Token1.1; ST560:节奏快慢魔改:当需要对于用户文件进行整体节奏加速或者减慢时,采用时间伸缩算法步骤,同步压缩或者延展用户文件的歌声和伴奏的节奏;ST560: Tempo Modification: When the overall tempo of a user file needs to be accelerated or slowed down, a time stretching algorithm is used to synchronously compress or stretch the tempo of the user file's vocals and accompaniment. ST570:节拍强弱魔改:针对VEM-Token2.1和/或VEM-Token2.2,依据用户的需要,采用基于基础层和情绪层的不同处理,其中:ST570: Beat Modification: For VEM-Token 2.1 and/or VEM-Token 2.2, different processing based on the basic layer and emotional layer is adopted according to user needs, including: ST571:基础层包括音量动态处理、包络线塑型的步骤,对于VEM-Token2.1和VEM-Token2.2予以在阈值上调整;ST571: The base layer includes volume dynamics processing and envelope shaping steps, and threshold adjustments are made for VEM-Token 2.1 and VEM-Token 2.2. ST572:情绪层包括基于AI/机器学习的智能动态控制,具体包括训练一个模型来智能地识别音频中的节拍、乐器,并根据预设的情绪标签或目标响度曲线,自动生成动态处理参数,训练结果存放到VEM库;ST572: The emotion layer includes intelligent dynamic control based on AI/machine learning. Specifically, it trains a model to intelligently identify beats and instruments in the audio and automatically generates dynamic processing parameters based on preset emotion labels or target loudness curves. The training results are stored in the VEM library. ST580:当VEM-Token2.1的节拍长短发生改变的时候,需要同步检验VEM-Token2.2的节拍长短,节拍长短不一致时,需要采用时间伸缩的捕捉,并且对齐VEM-Token2.1和VEM-Token2.2;ST580: When the beat length of VEM-Token 2.1 changes, the beat length of VEM-Token 2.2 must be checked synchronously. If the beat lengths are inconsistent, time stretching must be used to capture and align VEM-Token 2.1 and VEM-Token 2.2. ST590:音色魔改,具体包括:ST590: Sound modification, including: ST591:针对VEM-Token1.1序列和VEM-Token2.1序列,采用包括基频的多组高次谐波的滤波器,分解多组的高次谐波的频率:F1、F2、F3、…、Fn,分解高次谐波的泛音幅度:A1、A2、A3、…、An,其中n为谐波的次数,n小于50;ST591: For VEM-Token1.1 and VEM-Token2.1 sequences, filters are used to decompose multiple groups of higher harmonics of the fundamental frequency into the following frequencies: F1, F2, F3, ..., Fn, and the overtone amplitudes of the higher harmonics: A1, A2, A3, ..., An, where n is the harmonic order and is less than 50. ST592:采用音量动态处理、包络线塑型的步骤,分别放大或缩小VEM-Token2.1序列中的A1、A2、A3、…、An中一个以上指定的谐波分量的泛音幅度,以改变用户文件的音色;或,ST592: Using volume dynamics processing and envelope shaping steps, respectively amplify or reduce the overtone amplitude of one or more specified harmonic components among A1, A2, A3, ..., An in the VEM-Token2.1 sequence to change the timbre of the user file; or ST593:查询VEM库中AI/机器学习的音色动态处理参数,调节动态处理参数,以改变用户文件的音色;ST593: Query the AI/machine learning sound dynamic processing parameters in the VEM library and adjust the dynamic processing parameters to change the sound of the user file; ST5A0:情绪魔改,具体包括:依据用户的情绪魔改需求,查询VEM库中,以VEM-Token2.1序列为自变量,分别调节包括情绪魔改、声乐风格、嗓音风格需求的VEM参数,获得情绪魔改结果;ST5A0: Emotional modification, specifically including: querying the VEM library based on the user's emotional modification needs, using the VEM-Token2.1 sequence as the independent variable, and adjusting the VEM parameters including emotional modification, vocal style, and voice style requirements to obtain the emotional modification results; ST5B0:视频魔改,当用户文件需要适配视频时,依据VEM参数和节奏,调节视频画面的内容和节奏,以适应用户文件的需要;ST5B0: Video Magic Modification. When the user file needs to be adapted to the video, the content and rhythm of the video screen are adjusted according to the VEM parameters and rhythm to meet the needs of the user file. ST5C0:自由魔改,用户依据VEM参数和一种以上的模态,魔改VEM-Token2.1、VEM-Token2.2和视频画面的内容和节奏,以适应用户文件的需要;ST5C0: Free modification. Users can modify VEM-Token2.1, VEM-Token2.2, and the content and rhythm of the video according to VEM parameters and one or more modes to suit the needs of the user's file. ST5D0:多样本魔改,用户选择一个以上的样本文件,并且分别在VEM参数中选择部分参数对应部分样本文件,选择另外一部分参数对应另外一部分样本文件,魔改生成VEM-Token2.1、VEM-Token2.2和视频画面的内容和节奏,以适应用户文件的需要;ST5D0: Multi-sample modification. The user selects one or more sample files and selects some parameters in the VEM parameters to correspond to some sample files, and selects other parameters to correspond to other sample files. The content and rhythm of VEM-Token2.1, VEM-Token2.2 and video images are modified to meet the needs of the user's files. ST5E0:所听即所得的实时监听,具体包括由用户对魔改结果进行实时的监听、评判、评分,并将魔改过程和魔改结果生成结果文件的VEM-Token2序列,提交存储到VEM库。ST5E0: What you hear is what you get real-time monitoring, specifically including real-time monitoring, evaluation, and scoring of the modification results by the user, and submitting the VEM-Token2 sequence of the modification process and the result file generated by the modification to the VEM library for storage. 9.根据权利要求8的方法,其特征在于,魔改模型还包括会员管理,具体包括:9. The method according to claim 8, wherein the magic modification model further includes member management, specifically including: ST600:依据用户需要,向魔改模型申请会员,建立会员档案,存储到VEM库;ST600: Based on user needs, apply for membership from the Modified Model, create membership files, and store them in the VEM library; ST610:会员档案包括用户信息、样本文件信息、用户文件信息、VEM参数、声纹加密、声纹解密,其中,加解密的秘钥包括会员签名、会员图像、会员视频、会员VEM参数;ST610: Member files include user information, sample file information, user file information, VEM parameters, voiceprint encryption, and voiceprint decryption. The encryption and decryption keys include the member signature, member image, member video, and member VEM parameters. ST620:会员管理包括在魔改过程中操作步骤的前进、后退、回滚、增加、删除、查询、修改、存储、保持;ST620: Member management includes forward, backward, rollback, add, delete, query, modify, store, and persist operations during the modification process; ST630:会员管理还包括用户文件的即时修改、即时监听、实时评分、监督学习、强化学习、奖励、惩罚,结果存入VEM库。ST630: Member management also includes instant modification of user files, instant monitoring, real-time scoring, supervised learning, reinforcement learning, rewards, and penalties, and the results are stored in the VEM library. 10.根据权利要求9的方法,其特征在于,魔改模型还包括:10. The method according to claim 9, wherein the modifying the model further comprises: ST700:魔改模型还包括移动端的应用系统和PC端的应用系统,还包括云模式的应用系统、区块链应用系统;ST700: The modified model also includes mobile application systems and PC application systems, as well as cloud-based application systems and blockchain application systems. ST800:魔改模型还包括支撑硬件系统,包括通信接口、录音模块、调音模块、回放模块、加密模块、解密模块,包括面向抖音系统接口、面相视频号接口、支撑AI卡拉OK系统;ST800: The modified model also includes supporting hardware systems, including communication interfaces, recording modules, tuning modules, playback modules, encryption modules, and decryption modules. It also includes interfaces for the Douyin system, Face Video Number interfaces, and supports the AI karaoke system. ST900:依据向后续大模型应用提供同步信号,接入包括DeepSeek、Kimi.AI、ChatGPT的AI系统,形成AI智能体Agent;ST900: Provides synchronization signals to subsequent large-scale model applications and connects to AI systems including DeepSeek, Kimi.AI, and ChatGPT to form an AI agent. STA00:魔改模型还包括接口协议,提供基于硬件和网络的MIDI协议、MSC扩展协议、OSC网络协议,提供基于传输层的AES3/PDIF协议、MADI协议,提供网络音频传输协议如Dante协议、AVB/TSN协议、AES67协议。STA00: The modified model also includes interface protocols, providing hardware- and network-based MIDI protocols, MSC extension protocols, OSC network protocols, AES3/PDIF protocols and MADI protocols based on the transport layer, and network audio transmission protocols such as Dante protocol, AVB/TSN protocol, and AES67 protocol.
CN202511340091.8A 2025-09-19 Construction Method of VEM-Token Vocal Emotion Multimodal Modification Model Active CN120853611B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511340091.8A CN120853611B (en) 2025-09-19 Construction Method of VEM-Token Vocal Emotion Multimodal Modification Model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511340091.8A CN120853611B (en) 2025-09-19 Construction Method of VEM-Token Vocal Emotion Multimodal Modification Model

Publications (2)

Publication Number Publication Date
CN120853611A true CN120853611A (en) 2025-10-28
CN120853611B CN120853611B (en) 2025-12-23

Family

ID=

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201040939A (en) * 2009-05-12 2010-11-16 Chunghwa Telecom Co Ltd Method for generating self-recorded singing voice
JP2017040858A (en) * 2015-08-21 2017-02-23 ヤマハ株式会社 Aligning device and program
CN120126506A (en) * 2025-05-13 2025-06-10 港湾之星健康生物(深圳)有限公司 VEM-Token Vocal Emotion Multimodal Tokenization Singing and Accompaniment Deep Learning Method
CN120748450A (en) * 2025-09-03 2025-10-03 港湾之星健康生物(深圳)有限公司 Method for capturing VEM-Token beat and constructing pair Ji Moxing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201040939A (en) * 2009-05-12 2010-11-16 Chunghwa Telecom Co Ltd Method for generating self-recorded singing voice
JP2017040858A (en) * 2015-08-21 2017-02-23 ヤマハ株式会社 Aligning device and program
CN120126506A (en) * 2025-05-13 2025-06-10 港湾之星健康生物(深圳)有限公司 VEM-Token Vocal Emotion Multimodal Tokenization Singing and Accompaniment Deep Learning Method
CN120748450A (en) * 2025-09-03 2025-10-03 港湾之星健康生物(深圳)有限公司 Method for capturing VEM-Token beat and constructing pair Ji Moxing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨婉妍: "基于深度学习的音乐情感识别方法研究与应用", 哲学与人文科学辑, no. 07, 15 July 2025 (2025-07-15), pages 086 - 13 *

Similar Documents

Publication Publication Date Title
Goto et al. Music interfaces based on automatic music signal analysis: New ways to create and listen to music
CN108806656B (en) Automatic generation of songs
Humphrey et al. An introduction to signal processing for singing-voice analysis: High notes in the effort to automate the understanding of vocals in music
CN108806655B (en) Automatic generation of songs
JP2018537727A (en) Automated music composition and generation machines, systems and processes employing language and / or graphical icon based music experience descriptors
CN106971703A (en) A kind of song synthetic method and device based on HMM
JP2017107228A (en) Singing voice synthesis device and singing voice synthesis method
CN112382274B (en) Audio synthesis method, device, equipment and storage medium
Gupta et al. Deep learning approaches in topics of singing information processing
Foster et al. Filosax: A dataset of annotated jazz saxophone recordings
Goto Singing information processing
CN108922505B (en) Information processing method and device
Duinker Auto-Tune as instrument: trap music's embrace of a repurposed technology
Herremans et al. A multi-modal platform for semantic music analysis: visualizing audio-and score-based tension
CN115273806A (en) Song synthesis model training method and device, song synthesis method and device
CN114550690B (en) Song synthesis method and device
Delgado et al. A state of the art on computational music performance
Gounaropoulos et al. Synthesising timbres and timbre-changes from adjectives/adverbs
CN120853611B (en) Construction Method of VEM-Token Vocal Emotion Multimodal Modification Model
Nuanáin et al. Rhythmic concatenative synthesis for electronic music: techniques, implementation, and evaluation
Zhang Advancing deep learning for expressive music composition and performance modeling
Müller et al. Computational methods for melody and voice processing in music recordings (Dagstuhl seminar 19052)
CN120853611A (en) The construction method of VEM-Token vocal emotion multimodal magic modification model
Lu et al. A Novel Piano Arrangement Timbre Intelligent Recognition System Using Multilabel Classification Technology and KNN Algorithm
Furduj Virtual orchestration: a film composer's creative practice

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant