WO2025190785A1

WO2025190785A1 - Apparatus and method for processing an audio file storing a music track and apparatus and method for determining a sample underlying a music track

Info

Publication number: WO2025190785A1
Application number: PCT/EP2025/056195
Authority: WO
Inventors: Olivier Demarto
Original assignee: Sony Europe Bv; Sony Group Corp
Current assignee: Sony Europe Bv; Sony Group Corp
Priority date: 2024-03-11
Filing date: 2025-03-06
Publication date: 2025-09-18
Anticipated expiration: 2026-09-11

Abstract

Provided is a method for processing an audio file storing a music track. The method includes performing source separation on the audio file to determine an audio signal of a sound source in the music track. Furthermore, the method includes determining times of occurrence of the sound source in the music track based on the determined audio signal of the sound source. The method additionally includes extracting sound pieces from the determined audio signal of the sound source at the determined times of occurrence of the sound source. In addition, the method includes grouping the extracted sound pieces into groups of similar sound pieces and determining a respective sound that is common to each sound piece in the respective group of similar sound pieces as a sound sample of the sound source.

Description

APPARATUS AND METHOD FOR PROCESSING AN AUDIO FILE STORING A MUSIC TRACK AND APPARATUS AND METHOD FOR DETERMINING A SAMPLE UNDERLYING A MUSIC TRACK

Field

The present disclosure relates to processing of audio files. In particular, examples of the present disclosure relate to an apparatus and a method for processing an audio file storing a music track as well as an apparatus and a method for determining a sample underlying a music track.

Background

Music tracks are based on various instruments and/or samples. There may be a demand for techniques for the decomposition of music tracks.

Summary

This demand is met by an apparatus and a method for processing an audio file storing a music track as well as an apparatus and a method for determining a sample underlying a music track, a non-transitory machine-readable medium and a program in accordance with the independent claims. Advantageous embodiments are defined by the dependent claims.

According to a first aspect, the present disclosure provides an apparatus for processing an audio file storing a music track. The apparatus comprises processing circuitry configured to perform source separation on the audio file to determine an audio signal of a sound source in the music track. The processing circuitry is further configured to determine times of occurrence of the sound source in the music track based on the determined audio signal of the sound source. In addition, the processing circuitry is configured to extract sound pieces from the determined audio signal of the sound source at the determined times of occurrence of the sound source. The processing circuitry is further configured to group the extracted sound pieces into groups of similar sound pieces and determine a respective sound that is common to each sound piece in the respective group of similar sound pieces as a sound sample of the sound source.

According to a second aspect, the present disclosure provides a method for processing an audio file storing a music track. The method comprises performing source separation on the audio file to determine an audio signal of a sound source in the music track. Furthermore, the method comprises determining times of occurrence of the sound source in the music track based on the determined audio signal of the sound source. The method additionally comprises extracting sound pieces from the determined audio signal of the sound source at the determined times of occurrence of the sound source. In addition, the method comprises grouping the extracted sound pieces into groups of similar sound pieces and determining a respective sound that is common to each sound piece in the respective group of similar sound pieces as a sound sample of the sound source.

According to a third aspect, the present disclosure provides an apparatus for determining a sample underlying a music track. The apparatus comprises processing circuitry configured to determine a spectral representation of the music track based on an audio file storing the music track. The processing circuitry is further configured to determine a repeating pattern in the spectral representation of the music track and extract sound pieces from the audio file comprising the determined repeating pattern. In addition, the processing circuitry is configured to determine a sound that is common to each of the extracted sound pieces as the sample underlying the music track.

According to a fourth aspect, the present disclosure provides a method for determining a sample underlying a music track. The method comprises determining a spectral representation of the music track based on an audio file storing the music track. Further, the method comprises determining a repeating pattern in the spectral representation of the music track and extracting sound pieces from the audio file comprising the determined repeating pattern. The method additionally comprises determining a sound that is common to each of the extracted sound pieces as the sample underlying the music track.

According to a fifth aspect, the present disclosure provides a non-transitory machine- readable medium having stored thereon a program having a program code for performing the method according to the second aspect or the fourth aspect, when the program is executed on a processor or a programmable hardware.

According to a sixth aspect, the present disclosure provides a program having a program code for performing the method according to the second aspect or the fourth aspect, when the program is executed on a processor or a programmable hardware.

Brief description of the Figures

Some examples of apparatuses and/or methods will be described in the following by way of example only, and with reference to the accompanying figures, in which

Fig. 1 illustrates a flowchart of an example of a method for processing an audio file storing a music track;

Fig. 2 illustrates an exemplary virtual instrument;

Fig. 3 illustrates an exemplary process flow for an unpitched percussion instrument;

Fig. 4 illustrates an exemplary process flow for a tonal instrument;

Fig. 5 illustrates a flowchart of an example of a method for determining a sample underlying a music track;

Fig. 6 illustrates spectral representations of different parts of an exemplary music track together with a spectral representation of a repeating pattern; and

Fig. 7 illustrates an exemplary apparatus for performing the methods described herein.

Detailed Description

Some examples are now described in more detail with reference to the enclosed figures. However, other possible examples are not limited to the features of these embodiments described in detail. Other examples may include modifications of the features as well as equiv- alents and alternatives to the features. Furthermore, the terminology used herein to describe certain examples should not be restrictive of further possible examples.

Throughout the description of the figures same or similar reference numerals refer to same or similar elements and/or features, which may be identical or implemented in a modified form while providing the same or a similar function. The thickness of lines, layers and/or areas in the figures may also be exaggerated for clarification.

When two elements A and B are combined using an “or”, this is to be understood as disclosing all possible combinations, i.e., only A, only B as well as A and B, unless expressly defined otherwise in the individual case. As an alternative wording for the same combinations, "at least one of A and B" or "A and/or B" may be used. This applies equivalently to combinations of more than two elements.

If a singular form, such as “a”, “an” and “the” is used and the use of only a single element is not defined as mandatory either explicitly or implicitly, further examples may also use several elements to implement the same function. If a function is described below as implemented using multiple elements, further examples may implement the same function using a single element or a single processing entity. It is further understood that the terms "include", "including", "comprise" and/or "comprising", when used, describe the presence of the specified features, integers, steps, operations, processes, elements, components and/or a group thereof, but do not exclude the presence or addition of one or more other features, integers, steps, operations, processes, elements, components and/or a group thereof.

Fig- 1 illustrates a method 100 for processing an audio file storing a music track (music piece, song). The audio file is a digital file (i.e., machine- or computer-readable data) that contains audio data, which can be played back as sound. The audio data represent the music track. The audio file may be in any file format such as, e.g., MP3, WAV, FLAC or AAC. The music track is an individual piece or recording of music. The music track may be a specific portion of a larger musical work or a standalone composition.

The method 100 comprises performing 102 source separation (also known as “audio source separation” or “audio separation”) on the audio file to determine an audio signal (audio data) of a sound source in the music track. Source separation is a technique (process) in which individual sound sources within an audio mixture such as the music track are separated or isolated from one another. In a music track, a single sound source is recorded or multiple sound sources are combined into a single recording. A sound source is an individual element or component that contributes to the overall auditory experience of the music track. For example, a sound source may be an instrument, a vocal (voice) or another audio element that produces distinct sounds within a musical composition (e.g., ambient or environmental sounds or electronic sounds of an electronic device such as a synthesizer or a drum machine). Accordingly, source separation allows to extract the distinct components or sound sources that contribute to the overall music track stored in the audio file. Various techniques such as Blind Source Separation (BSS), Informed Source Separation (ISS) or trained machine-learning models may be used in performing source separation on the audio file. The determined audio signal of the sound source in the music track is a (e.g., electronic or digital) representation of the sound of the sound source extracted from the music track.

Furthermore, the method 100 comprises determining 104 times of occurrence of the sound source in the music track based on the determined audio signal of the sound source. The times of occurrence of the sound source in the music track are the specific moments or time instances when the sound source appears or is audible in the music track. For example, the times of occurrence of the sound source in the music track may be timestamps and/or durations during which the sound source is present or active in the music track. Various techniques for determining the times of occurrence of a sound source in a music track such as automated analysis of a waveform or a spectrogram of the determined audio signal of the sound source may be used. Specific examples will be described below in greater detail with reference to Fig. 3 and Fig. 4.

The method 100 additionally comprises extracting 106 sound pieces (excerpts, fragments, snippets) from the determined audio signal of the sound source at the determined times of occurrence of the sound source. When extracting the sound pieces from the determined audio signal of the sound source, specific sections or portions of the audio signal of the sound source are isolated or separated at the previously determined times of occurrence of the sound source. A plurality of sound pieces are obtained as a result of the extraction. The sound pieces are representations of the sound of the sound source at the times of occurrence of the sound source in the music track. In addition, the method 100 comprises grouping 108 the extracted sound pieces into groups of similar sound pieces. In other words, the extracted sound pieces are grouped based on similarities between the sound pieces. Various characteristics may be used for determining the similarity between the extracted sound pieces. For example, sound pieces in which the sound source plays a specific (particular) note may be grouped together. In other words, sound pieces in which the sound source emits sound of a specific pitch of frequency may be grouped together. Similarly, sound pieces in which the sound source plays a specific type of beat may be ground together. One or more groups of similar sound pieces are obtained as a result of the grouping.

The method 100 further comprises determining 110 a respective sound that is common to each sound piece in the respective group of similar sound pieces as a sound sample of the sound source. Various techniques such as statistical approaches may be used to determine the respective sound that is common to each sound piece in the respective group of similar sound pieces. For example, the sound piece in the respective group of similar sound pieces may be cross-corelated to determine the sound common to each sound piece in the respective group. The common sound may then be extracted as a respective sound sample of the sound source. A respective sound sample of the sound source is obtained for each group of similar sound pieces. For example, if the different groups relate to different played notes or pitches of the sound source, the samples may represent different played notes of the sound source omitting other interfering sounds present in the sound pieces. Similarly, if the different groups relate to different played beats, the samples may represent different played beats of the sound source omitting other interfering sounds present in the sound pieces.

The method 100 allows to separate one or more sound samples of a sound source from the music track. The method 100 is described above for a single sound source. However, in case the sound of multiple sound sources (e.g., multiple instruments and/or vocals) is combined in the music track, the above described processing may be performed for more than one of the sound sources to obtain one or more respective samples for the respective sound source. The obtained sound sample(s) may be used for various applications as will be described in the following in greater detail. For example, the obtained sound samples may be used for a virtual instrument. Accordingly, the method 100 may further comprise generating a virtual instrument based on the determined sound samples of the sound source. A virtual instrument is software-based mimic of a traditional or electronic music instrument. Unlike physical instruments that produce sound through vibrating strings, resonating air columns, or other tangible means, virtual instruments generate sound using digital signal processing algorithms.

The obtained sound samples of the sound source represent the sounds output by the sound source in great detail. In particular, the obtained sound samples of the sound source catch the specific sound of the instrument in the music track. The obtained sound samples of the sound source may be played back by the virtual instrument or used as a basis for sound output by the virtual instrument when the virtual instrument is played by a user. Accordingly, the specific sound of the sound source in the music track may be used for creating a new piece of music with the virtual instrument. For example, if the sound source is a piano, a guitar or a bass, a corresponding virtual piano, guitar, bass or synthesizer may be generated according to the method 100.

The method 100 may further comprise causing output of a Graphical User Interface (GUI) on a display. The GUI shows at least graphical icons allowing a user to interact with the virtual instrument. Optionally the GUI may show further graphical elements such as a graphical representation of the virtual instrument. For example, the output of the GUI may be caused on a display of a mobile phone, a tablet computer, a laptop, a desktop computer or a TV-set used/accessible by a user. For causing output of the GUI, the method 100 may, e.g., comprise transmitting control data to a target device which is to display the GUI. The control data is encoded with one or more control commands for controlling (instructing) a display of the target device to output the GUI.

An exemplary GUI 200 is illustrated in Fig. 2. In the example of Fig. 2, the virtual instrument is a virtual piano. The virtual piano is depicted in the GUI 200 by a corresponding graphical icon 210. The virtual piano is based on the obtained sound samples of a piano used in the music track. Exemplary additional graphical icons are denoted by reference numbers 220, ..., 260. For example, the graphical icon 220 allows to set parameters like an ambience of the virtual piano or a reverb characteristic of the ambience. The graphical icon 230 allows to action the various virtual keys of the virtual piano and adjust action settings. Similarly, the graphical icon 240 allows to action the various virtual pedals of the virtual piano and adjust action settings. The graphical icon 250 allows to adjust resonance settings of the virtual piano and the graphical icon 260 allows to adjust noise in the virtual piano’s ambience.

A user is able to interact with (play) the virtual piano by means of the graphical icons 210, . . ., 260. As the virtual piano is based on the obtained sound samples of the piano used in the music track, the user may create a new piece of music with a virtual piano having the specific sound of the piano used in the music track.

According to examples, the obtained sound samples may be used to recognize whether the music track uses copyrighted material. Accordingly, the method 100 may comprise determining whether the samples of the sound source are taken from a group of music tracks (e.g., copyrighted music tracks). For example, the samples of the sound source may be compared to the music tracks in the group of music tracks or to samples of the music tracks in the group of music tracks. Based on the similarities of the samples of the sound source to the music tracks in the group of music tracks or to samples of the music tracks in the group of music tracks, it may be determined whether material from the group of music tracks is used in the music track.

The obtained sound samples may alternatively or additionally be used to obtain a transcription of the music track. Accordingly, the method 100 may further comprise determining a transcription of the music track based on the samples of the sound source and the determined times of occurrence of the sound source in the music track. The transcription of the music track refers to a written or symbolic representation of the musical elements found in the music track. For example, the musical elements of the music track may be melody, harmony, rhythm and/or lyrics. The transcription of the music track may, e.g., be a Musical Instrument Digital Interface (MIDI) file. However, it is to be noted that the present disclosure is not limited thereto. Other file formats may be used as well. The determined times of occurrence of the sound source in the music track and the samples of the sound source indicate the contribution of the sound source (e.g., a specific instrument or a vocal) to the music track. Accordingly, a corresponding written or symbolic representation of the contribution of the sound source to the music track may be determined. As described above, the sound samples may be determined for multiple (e.g., all) sound sources of the music track. For example, the method 100 may comprise determining samples and times of occurrence for at least one further sound source according to the abovedescribed principles. Accordingly, the transcription of the music track may be determined further based on the determined samples and times of occurrence for the at least one further sound source. This may allow to obtain a full transcription of the music track.

The transcription of the music track may be used for various applications. For example, the transcription of the music track may be used to recognize or identify the music track. Accordingly, the method 100 may in some examples comprise identifying the music track based on the transcription of the music track. For example, the transcription of the music track may be compared to transcriptions of known music tracks. Based on the similarities of the transcription of the music track to the transcriptions of the known music tracks, the music track may be identified (i.e., it may be determined whether the music track is one of the known music tracks). According to examples, a MIDI file of the music track may be compared to MIDI files of the known music tracks. Comparing the transcription of the music track allows to recognize the music track also if the speed has been altered compared to original recording of the music track.

The transcription of the music track may alternatively or additionally be used to alter the music track. For example, by changing one or more instruments. Accordingly, the method 100 may comprise generating an audio signal for another sound source based on the transcription of the music track. The audio signal for the other sound source is a representation of the sound generated for the other sound source based on the transcription of the music track. The sound source and the other sound source are different instruments. For example, if the (original) audio source is a piano, the other sound source may be a guitar. The audio signal for the other sound source may, e.g., be generated by having a virtual or electronic instrument play those parts of the transcription of the music track that relate to the sound source (e.g., by playing according to a MIDI file representing the transcription). The method 100 may then further comprise modifying the audio file by replacing the audio signal of the sound source with the audio signal for the other sound source. Accordingly, the sound of the (original) audio source in the music track may be replaced by sound of another sound source. For example, as indicated above, one instrument may be replaced by another instrument in the music track to alter the music track. As indicated above, two examples for determining sound samples of a sound source from a music track will be described in greater detail. In the first example, the sound source is an unpitched percussion instrument. The unpitched percussion instrument is a type of percussion instrument that does not produce definite pitches or specific musical notes. Instead, these instruments create sounds with indistinct pitch characteristics. Examples for unpitched percussion instruments are drums, cymbals, tambourines, triangles or wood blocks. The first example will be described with reference to Fig. 3. In the example of Fig. 3, the unpitched percussion instrument is a drum kit.

Analogously to what is described above, source separation is performed on the audio file to determine an audio signal of the drum kit in the music track. Subfigure (a) exemplarily illustrates two audio signals 305 and 310 obtained by source separation of the audio file. The audio signal 310 is for the drum kit. The audio signal 305 is for another instrument used in the music track.

The times of occurrence of the drum kit in the music track are determined by automatic beat detection. Automatic beat detection is a technique (process) in which one or more software algorithms analyze an audio signal to identify and mark the locations of beats or rhythmic pulses in the music. The beats are the regular, recurring patterns that form the foundation of a musical rhythm. The automatic beat detection is applied to the audio signal 310 for the drum kit. Exemplary algorithms for automatic beat detection are described in http s : //es senti a. upf . edu/tutori al rhy thm b eatdetecti on . html or http s : //es senti a. upf . edu/ reference/ std_B eatT rackerDegara . html or https://essentia.upf.edu/reference/std_BeatTrackerMultiFeature.html or https://vamp- plugins.org/plugin-doc/qm-vamp-plugins. html#qm-tempotracker. However, the present disclosure is not limited to these specific algorithms. Other algorithms for automatic beat detection may be used as well. As a result of the automatic beat detection, the times of occurrence 311, ..., 314 of the drum kit in the music track are obtained. In other words, applying automatic beat detection on the drum track results in timestamps 311, ..., 314 for detected impulses or drum hits.

Then, as illustrated in subfigure (b), sound pieces 321, ..., 326 are extracted from the audio signal 310 at the determined times of occurrence of the drum kit. The number of extracted sound pieces in subfigure (b) diverges from the number of occurrence of the drum kit in subfigure (a) to indicate that the present disclosure is not limited to a specific number of occurrences of the drum kit in the music track. In general, one sound piece may be extracted from the audio signal 310 for each determined time of occurrence of the drum kit. In other words, bits of the sound corresponding to the drumbeats are extracted from the audio signal 310.

As illustrated in subfigure (c), the extracted sound pieces are grouped into groups of similar sound pieces. In the example of Fig. 3, the groups of similar sound pieces are groups for the different percussions (e.g., snare drum, bass drum, tom-toms, cymbals) forming the drum kit. Subfigure (c) exemplarily illustrates a first group 331 for the sound pieces relating to the snare drum, a second group 332 for the sound pieces relating to the bass drum and a third group 333 for the sound pieces relating to the tom-toms. Each group comprises multiple of the extracted sound pieces. The grouping is performed automatically using one or more software algorithms. An exemplary algorithm for grouping the extracted sound pieces into groups for the different percussions forming the drum kit is described in https://github.com/TylerMclaughlin/wav_clustering_workflow. However, the present disclosure is not limited to this specific algorithm. Other grouping algorithms may be used as well.

Then, as illustrated in subfigure (d), a respective sound that is common to each sound piece in the respective group of similar sound pieces is determined as a sound sample for the respective percussion of the drum kit. For example, statistical approaches such as crosscorrelation may be used to determine the sound common to each sound piece in the respective group. In the example, three sound samples 341, 342 and 343 are obtained for the respective one of the bass drum, the snare drum and the tom-toms of the drum kit. In other words, for each group (representing a certain drum hit type), statistical approaches or methods may be used to keep only the sound that is common in each of the groups, thereby dismissing other interfering sounds.

The resulting sound samples 341, 342 and 343 may be used as described above. For example, a virtual drum may be generated from the sound samples that reflects the characteristics of the drum kit used in the music track (e.g., the specific sound of the drum kit or the intensity with which the drum kit is played in the music track). This may allow to make the virtual drum kit sound more natural. In the second example, the sound source is a tonal instrument. The tonal instrument is a musical instrument that is capable of producing pitched or tonal sounds. Tonal instruments generate musical notes with discernible pitch, allowing for the creation of melodies and harmonies. For example, the tonal instrument may be a guitar, a bass, a piano or violine. The second example will be described in the following with reference to Fig. 4.

As described above in more general terms, source separation is initially performed on the audio file to determine an audio signal of the tonal instrument in the music track. The times of occurrence of the tonal instrument in the music track are determined by onset detection for played notes. Onset detection for played notes is a technique (process) in which one or more software algorithms analyze an audio signal to identify the precise moments in time when musical notes or sound events begin (onset points) within a piece of audio. The onset detection for played notes is applied to the audio signal for the tonal instrument. Exemplary algorithms for onset detection for played notes are described in https://librosa.org/doc/main/generated/librosa.onset.onset_detect.html or https://essentia.upf.edu/reference/std_OnsetDetection.html. However, the present disclosure is not limited to these specific algorithms. Other algorithms for onset detection for played notes may be used as well. The onset detection is exemplarily illustrated in subfigure (a) which shows a diagram 400. The abscissa of the diagram 410 denotes time and the ordinate denotes the onset strength in arbitrary units. The curve 411 denotes the onset strength over time determined from the audio signal for the tonal instrument. The peaks of the curves 411 are determined as onsets 412 of the tonal instrument for played notes. As a result of the onset detection for played notes, the times of occurrence 412 of the tonal instrument in the music track are obtained. In other words, precise timestamps 412 for the beginning of played notes are obtained via the onset detection.

Additionally, a chromagram is determined based on determined audio signal of the tonal instrument. The chromagram is a representation of the pitch content of an audio signal or a musical piece over time. The chromagram provides information about the presence and intensity of different pitch classes (or chroma) in the audio signal (e.g., regardless of the octave). An exemplary chromagram 420 is illustrated in subfigure (b). The abscissa of the chromagram 420 denotes time and the ordinate denotes the pitch class (e.g., C, D, E, F, G, A, B as in subfigure (b)). The intensity of each pitch class at a given time is encoded in the chromagram 420 as shown by the scale 430 on the right side of the chromagram 420. The value 1.0 denotes the highest intensity and the value 0.0 the lowest intensity.

The pitches and lengths of the played notes are determined from the chromagram. The lengths of the played notes for the different pitches can be read or measured from the chromagram. In other words, a chromagram is constructed to detect the pitches and lengths of played notes. Sound pieces are extracted for the determined pitches of played notes from the determined audio signal of the tonal instrument based on the determined times of occurrence and lengths of the played notes. In other words, bits of the sound corresponding to the played notes are extracted to obtain multiple representations for the tonal instrument in the different pitches used in the music track.

The extracted sound pieces are grouped into groups of similar sound pieces. In the example of Fig. 4, the groups of similar sound pieces are groups for different pitches of played notes. For example, a respective group for the played notes C, D, E, F, G, A, B may be used. Each group comprises multiple of the extracted sound pieces. The grouping is performed automatically using one or more software algorithms. An exemplary algorithm for grouping the extracted sound pieces into groups for the different pitches of played notes is described in https://github.com/TylerMclaughlin/wav_clustering_workflow. However, the present disclosure is not limited to this specific algorithm. Other grouping algorithms may be used as well.

Then, analogously to what is described above, a respective sound that is common to each sound piece in the respective group of similar sound pieces is determined as a sound sample for the played notes of the tonal instrument. For example, statistical approaches such as cross-correlation may be used to determine the sound common to each sound piece in the respective group. In other words, for each group (representing a played note), statistical approaches or methods may be used to keep only the sound that is common in each of the groups, thereby dismissing other interfering sounds.

The resulting sound samples may be used as described above. For example, a virtual tonal instrument such as virtual guitar or a virtual piano may be generated from the sound samples that reflects the characteristics of the tonal instrument used in the music track (e.g., the specific sound of the tonal instrument or the intensity with which the tonal instrument is played in the music track). This may allow to make the virtual tonal instrument sound more natural. In case no sound samples were obtained for one or more notes playable by the tonal instrument, these sound samples may be generated based on the obtained (available, generated) sound samples to generate a full set of sound samples for the tonal instrument. For example, the method 100 may comprise interpolating and/or extrapolating (e.g., by pitch shifting) determined sound samples of the tonal instrument to determine a sound sample for a note not played by the tonal instrument in the music track.

The sound samples obtained by interpolation and/or extrapolation of the sound samples extracted from the music track may be used together with the sound samples extracted from the music track for generating a virtual instrument such that the virtual instrument can provide the full range of notes. In other words, the method 100 may comprise generating the virtual instrument based on the determined sound samples of the sound source.

For example, if the music track comprises a bass line consisting of three different notes with different durations, sound samples may be obtained for the three different notes according to the method 100. The sound samples obtained for the three different notes may be interpolated and/or extrapolated to the full scale to create a full instrument. The resulting virtual bass is able to provide the full scale of notes.

The proposed technology may further be used for identifying a sample underlying a music track. A "sample" is a short piece, segment or snippet of audio taken from a pre-existing music track (recording) and used in a new music track (composition). For example, the sample may comprise a portion of a song, a drumbeat, a vocal line, or any other sound extracted from an existing music track. A sample is usually repeated throughout the new music track. A method 500 for determining a sample underlying (forming the foundation of) a music track (music piece, song) is illustrated in Fig. 5.

The method 500 comprises determining 502 a spectral representation of the music track based on an audio file (audio signal)storing the music track. The spectral representation of the music track is a (e.g., graphical or mathematical) depiction of the frequency content of the music track over time. This spectral representation indicates the distribution of energy across different frequencies and how it evolves throughout the duration of the music track. For example, the spectral representation of the music track may be a spectrogram. The spec- tral representation may, e.g., obtained by segmenting the (audio file/signal of the) music track into short overlapping time frames or windows, windowing each time frame or window with a window function to reduce artifacts, applying a Fourier transform to each windowed frame or window to convert from the time domain to the frequency domain and compute the magnitude spectrum from the output of the Fourier transform. However, the present disclosure is not limited thereto. Other techniques for determining the spectral representation of the music track may be used as well.

Further, the method 500 comprises determining 504 a repeating pattern in the spectral representation of the music track. Various techniques for detecting a repeating pattern in the spectral representation of the music track may be used. For example, pattern recognition techniques may be used to identify a repeating pattern. Additionally or alternatively, autocorrelation of the spectral representation of the music track (or a part thereof) with a time-shifted replica of itself (or a part thereof) may be used to identify a repeating pattern. Further additionally or alternatively, techniques like Fourier analysis or trained machine-learning models may be used to determine a repeating pattern in the spectral representation of the music track. Fig. 6 exemplarily shows in the left part spectral representations 610, ... 640 of four different parts of a music track. The spectral representation 650 of an exemplary repeating pattern identified in the spectral representations 610, ... 640 of the four different parts of the music track is illustrated in the right part of Fig. 6.

Returning to Fig. 5, the method 500 additionally comprises extracting 506 sound pieces (excerpts, fragments, snippets) from the audio file comprising the determined repeating pattern. When extracting the sound pieces from the audio file comprising the determined repeating pattern, specific sections or portions of the audio signal represented by (encoded to) the audio file comprising the determined repeating pattern are isolated or separated. A plurality of sound pieces are obtained as a result of the extraction. The sound pieces are representations of the determined repeating pattern in the music track.

In addition, the method 500 comprises determining 508 a sound that is common to each of the extracted sound pieces as the sample underlying the music track. Analogously to what is described above, various techniques such as statistical approaches may be used to determine the respective sound that is common to each of the extracted sound pieces. The common sound may then be extracted as the sample underlying the music track. The method allows to identify and extract one or more samples underlying a music track. The method 500 is described above for a single sample. However, in case the music track is based on multiple samples, the above described processing may be performed to discover and extract more than one sample in the music track.

The obtained sample underlying the music track may be used for various applications as will be described in the following in greater detail. For example, the obtained sample underlying the music track may be used to further analyze the composition of the music track. Alternatively or additionally, the obtained sample underlying the music track may be used to recognize whether the music track uses copyrighted material. For example, the method 500 may comprise determining whether the determined sample underlying the music track is taken from a group of music tracks (e.g., copyrighted music tracks). For example, the obtained sample underlying the music track may be compared to the music tracks in the group of music tracks or to samples of the music tracks in the group of music tracks. Based on the similarities of the obtained sample underlying the music track to the music tracks in the group of music tracks or to samples of the music tracks in the group of music tracks, it may be determined whether material from the group of music tracks is used in the music track.

The proposed technology may be performed by on various devices. Fig. 7 illustrates an exemplary apparatus 700 comprising processing circuitry 710. The processing circuitry 710 is configured to receive an audio file 701 and process it as described herein (e.g., according to one of the methods 100 and 500 described above). For example, the processing circuitry 710 may be a single dedicated processor, a single shared processor, or a plurality of individual processors, some of which or all of which may be shared, a digital signal processor (DSP) hardware, an application specific integrated circuit (ASIC), a neuromorphic processor, a sys- tem-on-a-chip (SoC) or a field programmable gate array (FPGA). The processing circuitry 110 may optionally be coupled to, e.g., memory such as read only memory (ROM) for storing software, random access memory (RAM) and/or non-volatile memory. For example, the apparatus 700 may comprise memory configured to store instructions, which when executed by the processing circuitry 710, cause the processing circuitry 710 to perform the steps and methods described herein. The apparatus 700 may, e.g., be or be part of a server, a computing cloud, a personal computer or a mobile device (e.g., a mobile phone, a laptop-computer or a tablet computer).

Examples of the present disclosure may provide a system that can analyze a piece of music in order to decompose it into a transcription (like a MIDI file) and a set of sample based instruments that accurately reproduce the original. For example, if a user loves the sound of the piano in a song, the user may use the proposed technology on this song to obtain a virtual piano instrument that sounds the same and use it for his/her own music.

As described above, the same sounds (with or without pitch changes or small variations depending on the instrument) typically occur repeatedly in a piece of music as these are the building blocks of the piece. By applying source separation and further techniques such as rhythm analysis and audio analysis (waveform and/or spectral) in accordance with the proposed technology, a piece of music can be decomposed in a set of transcriptions (e.g., MIDI tracks) each associated with a number of samples making up a separate instrument. For example, the accurate transcription (e.g., a MIDI file) may be used to alter the original by changing the instrument(s) and/or recognize the musical performance (e.g., also if the speed has been altered). The sample sets (instruments) obtained according to the proposed technology may, e.g., be used to automatically create a reusable instrument that makes it possible to use a specific sounding instrument in new work or to recognize samples from other copyrighted material.

The following examples pertain to further embodiments:

(1) An apparatus for processing an audio file storing a music track, the apparatus comprising processing circuitry configured to: perform source separation on the audio file to determine an audio signal of a sound source in the music track; determine times of occurrence of the sound source in the music track based on the determined audio signal of the sound source; extract sound pieces from the determined audio signal of the sound source at the determined times of occurrence of the sound source; group the extracted sound pieces into groups of similar sound pieces; and determine a respective sound that is common to each sound piece in the respective group of similar sound pieces as a sound sample of the sound source.

(2) The apparatus of (1), wherein the sound source is an unpitched percussion instrument, and wherein the processing circuitry is configured to determine the times of occurrence of the sound source in the music track by automatic beat detection.

(3) The apparatus of (2), wherein the unpitched percussion instrument is a drum kit, and wherein the groups of similar sound pieces are groups for the different percussions forming the drum kit.

(4) The apparatus of (1), wherein the sound source is a tonal instrument, and wherein the processing circuitry is configured to: determine the times of occurrence of the sound source in the music track by onset detection for played notes; and extract the sound pieces from the determined audio signal of the sound source by: determining a chromagram based on the determined audio signal of the sound source; determining pitches and lengths of the played notes from the chromagram; and extract sound pieces for the determined pitches of played notes from the determined audio signal of the sound source based on the determined times of occurrence and lengths of the played notes.

(5) The apparatus of (4), wherein the groups of similar sound pieces are groups for different pitches of played notes.

(6) The apparatus of (4) or (5), wherein the processing circuitry is configured to: interpolate and/or extrapolate determined sound samples of the tonal instrument to determine a sound sample for a note not played by the tonal instrument in the music track.

(7) The apparatus of any one of (1) to (6), wherein the processing circuitry is configured to: generate a virtual instrument based on the determined sound samples of the sound source.

(8) The apparatus of (7), wherein the processing circuitry is configured to: cause output of a graphical user interface on a display, wherein the graphical user interface shows graphical icons allowing a user to interact with the virtual instrument.

(9) The apparatus of any one of (1) to (8), wherein the processing circuitry is configured to: determine a transcription of the music track based on the samples of the sound source and the determined times of occurrence of the sound source in the music track.

(10) The apparatus of (9), wherein the processing circuitry is configured to: determine samples and times of occurrence for at least one further sound source; and determine the transcription of the music track further based on the determined samples and times of occurrence for the at least one further sound source.

(11) The apparatus of (9) or (10), wherein the processing circuitry is further configured to identify the music track based on the transcription of the music track.

(12) The apparatus of any one of (9) to (11), wherein the transcription of the music track is a Musical Instrument Digital Interface, MIDI, file.

(13) The apparatus of any one of (9) to (12), wherein the processing circuitry is further configured to: generate an audio signal for another sound source based on the transcription of the music track; and modify the audio file by replacing the audio signal of the sound source with the audio signal for the other sound source.

(14) The apparatus of (13), wherein the sound source and the other sound source are different instruments.

(15) A method for processing an audio file storing a music track, the method comprising: performing source separation on the audio file to determine an audio signal of a sound source in the music track; determining times of occurrence of the sound source in the music track based on the determined audio signal of the sound source; extracting sound pieces from the determined audio signal of the sound source at the determined times of occurrence of the sound source; grouping the extracted sound pieces into groups of similar sound pieces; and determining a respective sound that is common to each sound piece in the respective group of similar sound pieces as a sound sample of the sound source.

(16) An apparatus for determining a sample underlying a music track, the apparatus comprising processing circuitry configured to: determine a spectral representation of the music track based on an audio file storing the music track; determine a repeating pattern in the spectral representation of the music track; extract sound pieces from the audio file comprising the determined repeating pattern; determine a sound that is common to each of the extracted sound pieces as the sample underlying the music track.

(17) The apparatus of (16), wherein the processing circuitry is further configured to determine whether the determined sample underlying the music track is taken from a group of music tracks.

(18) A method for determining a sample underlying a music track, the method comprising: determining a spectral representation of the music track based on an audio file storing the music track; determining a repeating pattern in the spectral representation of the music track; extracting sound pieces from the audio file comprising the determined repeating pattern; determining a sound that is common to each of the extracted sound pieces as the sample underlying the music track.

(19) A non-transitory machine-readable medium having stored thereon a program having a program code for performing the method according to (15) or (18), when the program is executed on a processor or a programmable hardware.

(20) A program having a program code for performing the method according to (15) or (18), when the program is executed on a processor or a programmable hardware. The aspects and features described in relation to a particular one of the previous examples may also be combined with one or more of the further examples to replace an identical or similar feature of that further example or to additionally introduce the features into the further example.

Examples may further be or relate to a (computer) program including a program code to execute one or more of the above methods when the program is executed on a computer, processor or other programmable hardware component. Thus, steps, operations or processes of different ones of the methods described above may also be executed by programmed computers, processors or other programmable hardware components. Examples may also cover program storage devices, such as digital data storage media, which are machine-, processor- or computer-readable and encode and/or contain machine-executable, processor-executable or computer-executable programs and instructions. Program storage devices may include or be digital storage devices, magnetic storage media such as magnetic disks and magnetic tapes, hard disk drives, or optically readable digital data storage media, for example. Other examples may also include computers, processors, control units, (field) programmable logic arrays ((F)PLAs), (field) programmable gate arrays ((F)PGAs), graphics processor units (GPU), ASICs, integrated circuits (ICs) or SoCs programmed to execute the steps of the methods described above.

It is further understood that the disclosure of several steps, processes, operations or functions disclosed in the description or claims shall not be construed to imply that these operations are necessarily dependent on the order described, unless explicitly stated in the individual case or necessary for technical reasons. Therefore, the previous description does not limit the execution of several steps or functions to a certain order. Furthermore, in further examples, a single step, function, process or operation may include and/or be broken up into several substeps, -functions, -processes or -operations.

If some aspects have been described in relation to a device or system, these aspects should also be understood as a description of the corresponding method. For example, a block, device or functional aspect of the device or system may correspond to a feature, such as a method step, of the corresponding method. Accordingly, aspects described in relation to a method shall also be understood as a description of a corresponding block, a corresponding element, a property or a functional feature of a corresponding device or a corresponding system.

The following claims are hereby incorporated in the detailed description, wherein each claim may stand on its own as a separate example. It should also be noted that although in the claims a dependent claim refers to a particular combination with one or more other claims, other examples may also include a combination of the dependent claim with the subject matter of any other dependent or independent claim. Such combinations are hereby explicitly proposed, unless it is stated in the individual case that a particular combination is not intend- ed. Furthermore, features of a claim should also be included for any other independent claim, even if that claim is not directly defined as dependent on that other independent claim.

Claims

Claims What is claimed is:

1. An apparatus for processing an audio file storing a music track, the apparatus comprising processing circuitry configured to: perform source separation on the audio file to determine an audio signal of a sound source in the music track; determine times of occurrence of the sound source in the music track based on the determined audio signal of the sound source; extract sound pieces from the determined audio signal of the sound source at the determined times of occurrence of the sound source; group the extracted sound pieces into groups of similar sound pieces; and determine a respective sound that is common to each sound piece in the respective group of similar sound pieces as a sound sample of the sound source.

2. The apparatus of claim 1, wherein the sound source is an unpitched percussion instrument, and wherein the processing circuitry is configured to determine the times of occurrence of the sound source in the music track by automatic beat detection.

3. The apparatus of claim 2, wherein the unpitched percussion instrument is a drum kit, and wherein the groups of similar sound pieces are groups for the different percussions forming the drum kit.

4. The apparatus of claim 1, wherein the sound source is a tonal instrument, and wherein the processing circuitry is configured to: determine the times of occurrence of the sound source in the music track by onset detection for played notes; and extract the sound pieces from the determined audio signal of the sound source by: determining a chromagram based on the determined audio signal of the sound source; determining pitches and lengths of the played notes from the chromagram; and extract sound pieces for the determined pitches of played notes from the determined audio signal of the sound source based on the determined times of occurrence and lengths of the played notes.

5. The apparatus of claim 4, wherein the groups of similar sound pieces are groups for different pitches of played notes.

6. The apparatus of claim 4, wherein the processing circuitry is configured to: interpolate and/or extrapolate determined sound samples of the tonal instrument to determine a sound sample for a note not played by the tonal instrument in the music track.

7. The apparatus of claim 1, wherein the processing circuitry is configured to: generate a virtual instrument based on the determined sound samples of the sound source.

8. The apparatus of claim 7, wherein the processing circuitry is configured to: cause output of a graphical user interface on a display, wherein the graphical user interface shows graphical icons allowing a user to interact with the virtual instrument.

9. The apparatus of claim 1, wherein the processing circuitry is configured to: determine a transcription of the music track based on the samples of the sound source and the determined times of occurrence of the sound source in the music track.

10. The apparatus of claim 9, wherein the processing circuitry is configured to: determine samples and times of occurrence for at least one further sound source; and determine the transcription of the music track further based on the determined samples and times of occurrence for the at least one further sound source.

11. The apparatus of claim 9, wherein the processing circuitry is further configured to identify the music track based on the transcription of the music track.

12. The apparatus of claim 9, wherein the transcription of the music track is a Musical Instrument Digital Interface, MIDI, file.

13. The apparatus of claim 9, wherein the processing circuitry is further configured to: generate an audio signal for another sound source based on the transcription of the music track; and modify the audio file by replacing the audio signal of the sound source with the audio signal for the other sound source.

14. The apparatus of claim 13, wherein the sound source and the other sound source are different instruments.

15. A method for processing an audio file storing a music track, the method comprising: performing source separation on the audio file to determine an audio signal of a sound source in the music track; determining times of occurrence of the sound source in the music track based on the determined audio signal of the sound source; extracting sound pieces from the determined audio signal of the sound source at the determined times of occurrence of the sound source; grouping the extracted sound pieces into groups of similar sound pieces; and determining a respective sound that is common to each sound piece in the respective group of similar sound pieces as a sound sample of the sound source.

16. An apparatus for determining a sample underlying a music track, the apparatus comprising processing circuitry configured to: determine a spectral representation of the music track based on an audio file storing the music track; determine a repeating pattern in the spectral representation of the music track; extract sound pieces from the audio file comprising the determined repeating pattern; determine a sound that is common to each of the extracted sound pieces as the sample underlying the music track.

17. The apparatus of claim 16, wherein the processing circuitry is further configured to determine whether the determined sample underlying the music track is taken from a group of music tracks.

18. A method for determining a sample underlying a music track, the method compris- ing: determining a spectral representation of the music track based on an audio file storing the music track; determining a repeating pattern in the spectral representation of the music track; extracting sound pieces from the audio file comprising the determined repeating pattern; determining a sound that is common to each of the extracted sound pieces as the sample underlying the music track.

19. A non-transitory machine-readable medium having stored thereon a program having a program code for performing the method according to claim 15, when the program is executed on a processor or a programmable hardware.

20. A program having a program code for performing the method according to claim 15, when the program is executed on a processor or a programmable hardware.