WO2025079040A1

WO2025079040A1 - Method of and system for attributing influence to training data based on an output of a generative model

Info

Publication number: WO2025079040A1
Application number: PCT/IB2024/059997
Authority: WO
Inventors: Nicolas GONZALEZ THOMAS; James Maxwell; Matthew ADELL; Sean Power
Original assignee: Somms Ai Inc
Current assignee: Somms Ai Inc
Priority date: 2023-10-11
Filing date: 2024-10-11
Publication date: 2025-04-17
Anticipated expiration: 2026-04-11

Abstract

There is described methods and systems for determining an influence of training data into the generation of an output of a generative machine learning model. The system receives: a training dataset that comprises training data instances, where a given generative machine learning model was trained using the training dataset; and a generated data instance that was generated by the generative machine leaning model. The system obtains source embeddings and a generated embedding, then determines influence by assigning an influence value based on the similarity between the generated embedding and the source embedding. The system may determine which training data instances or instance groups had the greatest influence in the generation of the generated data instance. The system outputs the influence value and an indication of the associated training data instance. The system may block the generated data instance, causing the model to regenerate a new generated data instance.

Description

METHOD OF AND SYSTEM FOR ATTRIBUTING INFLUENCE TO TRAINING DATA BASED ON AN OUTPUT OF A GENERATIVE MODEL

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The present application claims priority to US Provisional Application No. 63/589,412 filed on October 11, 2023.

FIELD

[0002] The present technology relates to artificial intelligence in general and more specifically to methods, systems and non-transitory storage mediums for attributing influence to training data based on the output of a generative machine learning model.

BACKGROUND

[0003] Rapid advancements in Generative Al (Gen-AI) have brought forth an array of applications for generating data such as text, images, video, audio, and music. While these technologies offer unprecedented creative opportunities, they also raise important questions about the dependency of generative models on prior human creative works and the value of individual training data points on generative model outputs. Understanding this value is crucial for addressing the ethical questions surrounding fair compensation for contributors of data such as copyrighted data.

[0004] Therefore, there is a need for an improved method and system for attributing influence to training data based on the output of a generative machine learning model. SUMMARY

[0005] It is an object of the present technology to ameliorate at least some of the inconveniences present in the prior art. One or more implementations of the present technology may provide and/or broaden the scope of approaches to and/or methods of achieving the aims and objects of the present technology.

[0006] Thus, one or more implementations of the present technology are directed to a method of and a system for determining the influence of training data into the generation of an output of a generative machine learning model.

[0007] According to a first broad aspect, there is provided a method for determining an influence of training data, the method being executed by a processor, the method comprising: receiving a training dataset comprising a plurality of training data instances, a given generative machine learning model having been trained using the training dataset; receiving a generated data instance having been generated by the generative machine learning model; extracting first features from the training data instances, thereby obtaining source embeddings; extracting second features of the generated data instance, thereby obtaining a generated embedding; determining similarity measure values between the generated embedding and the source embeddings; assigning an influence value to at least one of the training data instances based on the similarity measure values, the influence value being indicative of an influence in a generation of the generated data instance by the generative machine learning model; and outputting the influence value and an indication of the training data instance assigned the influence value.

[0008] In at least one embodiment, the method further comprises: determining that the influence value exceeds a threshold value; and flagging or blocking the generated data instance.

[0009] In at least one embodiment, the method further comprises: regrouping the training data instances into a plurality of instance groups and extracting first features from the training data instances contained in each instance group, thereby obtaining a cloud of source embeddings for each instance group; dividing the generated data instance into a plurality of generated chunks and extracting second features from each generated chunk, thereby obtaining a cloud of generated embeddings; determining given ones of the instance groups that had a greatest influence in a generation of the generated data instance using a statistical analysis method, the clouds of source embeddings and the cloud of generated embeddings; and outputting an identification of the given ones of the instance groups.

[0010] In at least one embodiment, the method further comprises: determining that the greatest influence exceeds a threshold value; and flagging or blocking the generated data instance.

[0011] In at least one embodiment, the method further comprises: modifying training of the generative machine learning model to reduce the influence in a generation of a future generated data instance by the generative machine learning model.

[0012] According to a second broad aspect, there is provided a system for determining an influence of training data, the system comprising: a processor; a non- transitory storage medium operatively connected to the processor, the non-transitory storage medium comprising computer-readable instructions; the processor, upon executing the instructions, being configured to: receive a training dataset comprising a plurality of training data instances, a given generative machine learning model having been trained using the training dataset; receive a generated data instance having been generated by the generative machine learning model; extract first features from the training data instances, thereby obtaining source embeddings; extract second features of the generated data instance, thereby obtaining a generated embedding; determine similarity measure values between the generated embedding and the source embeddings; assign an influence value to at least one of the training data instances based on the similarity measure values, the influence value being indicative of an influence in a generation of the generated data instance by the generative machine learning model; and output the influence value and an indication of the at least one of the training data instances to which the influence value was assigned. [0013] In at least one embodiment, the processor, upon executing the instructions, is further configured to: determine that the influence value exceeds a threshold value; and flag or block the generated data instance.

[0014] In at least one embodiment, the processor, upon executing the instructions, is further configured to: divide each training data instance into a respective plurality of training instance chunks and extract first features from each training instance chunk, thereby obtaining a cloud of source embeddings for each training data instance; divide the generated data instance into a plurality of generated chunks and extract second features from each generated chunk, thereby obtaining a cloud of generated embeddings; determine given ones of the training data instances that had a greatest influence in a generation of the generated data instance using a statistical analysis method, the clouds of source embeddings and the cloud of generated embeddings; and output an identification of the given ones of the training data instances.

[0015] In at least one embodiment, the processor, upon executing the instructions, is further configured to: determine that the greatest influence exceeds a threshold value; and flag or block the generated data instance.

[0016] In at least one embodiment, the processor, upon executing the instructions, is further configured to: regroup the training data instances into a plurality of instance groups and extract first features from the training data instances contained in each instance group, thereby obtaining a cloud of source embeddings for each instance group; divide the generated data instance into a plurality of generated chunks and extract second features from each generated chunk, thereby obtaining a cloud of generated embeddings; determine given ones of the instance groups that had a greatest influence in a generation of the generated data instance using a statistical analysis method, the clouds of source embeddings and the cloud of generated embeddings; and output an identification of the given ones of the instance groups. [0017] In at least one embodiment, the processor, upon executing the instructions, is further configured to: determine that the greatest influence exceeds a threshold value; and flag or block the generated data instance.

[0018] In at least one embodiment, the processor, upon executing the instructions, is further configured to: modify training of the generative machine learning model to reduce the influence in a generation of a future generated data instance by the generative machine learning model.

[0019] According to a third broad aspect, there is provided a non-transitory computer-readable medium having stored thereon computer-readable instructions that, when executed by a processor, cause the processor to: receive a training dataset comprising a plurality of training data instances, a given generative machine learning model having been trained using the training dataset; receive a generated data instance having been generated by the generative machine learning model; extract first features from the training data instances, thereby obtaining source embeddings; extract second features of the generated data instance, thereby obtaining a generated embedding; determine similarity measure values between the generated embedding and the source embeddings; assign an influence value to at least one of the training data instances based on the similarity measure values, the influence value being indicative of an influence in a generation of the generated data instance by the generative machine learning model; and output the influence value and an indication of the at least one of the training data instances to which the influence value was assigned.

[0020] Terms and Definitions

[0021] In the context of the present specification, a “server” is a computer program that is running on appropriate hardware and is capable of receiving requests (e.g., from computing devices) over a network (e.g., a communication network), and carrying out those requests, or causing those requests to be carried out. The hardware may be one physical computer or one physical computer system, but neither is required to be the case with respect to the present technology. In the present context, the use of the expression a “server” is not intended to mean that every task (e.g., received instructions or requests) or any particular task will have been received, carried out, or caused to be carried out, by the same server (i.e., the same software and/or hardware); it is intended to mean that any number of software elements or hardware devices may be involved in receiving/sending, carrying out or causing to be carried out any task or request, or the consequences of any task or request; and all of this software and hardware may be one server or multiple servers, both of which are included within the expressions “at least one server” and “a server”.

[0022] In the context of the present specification, “computing device” is any computing apparatus or computer hardware that is capable of running software appropriate to the relevant task at hand. Thus, some (non-limiting) examples of computing devices include general purpose personal computers (desktops, laptops, netbooks, etc.), mobile computing devices, smartphones, and tablets, and network equipment such as routers, switches, and gateways. It should be noted that a computing device in the present context is not precluded from acting as a server to other computing devices. The use of the expression “a computing device” does not preclude multiple computing devices being used in receiving/sending, carrying out or causing to be carried out any task or request, or the consequences of any task or request, or steps of any method described herein. In the context of the present specification, a “client device” refers to any of a range of end-user client computing devices, associated with a user, such as personal computers, tablets, smartphones, and the like.

[0023] In the context of the present specification, the expression “computer readable storage medium” (also referred to as “storage medium” and “storage”) is intended to include non-transitory media of any nature and kind whatsoever, including without limitation RAM, ROM, disks (CD-ROMs, DVDs, floppy disks, hard drivers, etc.), USB keys, solid state-drives, tape drives, etc. A plurality of components may be combined to form the computer information storage media, including two or more media components of a same type and/or two or more media components of different types. [0024] In the context of the present specification, a “database” is any structured collection of data, irrespective of its particular structure, the database management software, or the computer hardware on which the data is stored, implemented or otherwise rendered available for use. A database may reside on the same hardware as the process that stores or makes use of the information stored in the database or it may reside on separate hardware, such as a dedicated server or plurality of servers.

[0025] In the context of the present specification, the expression “information” includes information of any nature or kind whatsoever capable of being stored in a database. Thus, information includes, but is not limited to audiovisual works (images, movies, sound records, presentations etc.), data (location data, numerical data, etc.), text (opinions, comments, questions, messages, etc.), documents, spreadsheets, lists of words, etc.

[0026] In the context of the present specification, the expression “data” or “data instance” refers to a piece of data or information about the piece of data, that may be generated by a generative machine learning model. The data may be of different type. For example, a data instance may refer to a media such as a text, a video, an image, a picture, a drawing, audio, music, a song, etc. In another example, a data instance may refer to a molecule or molecule structure, e.g., a chemical molecule, a protein, a peptide, deoxyribonucleic acid (DNA), ribonucleic acid (RNA), etc. In a further example, a data instance may refer to a material composition.

[0027] In the context of the present specification, unless expressly provided otherwise, an “indication” of an information element may be the information element itself or a pointer, reference, link, or other indirect mechanism enabling the recipient of the indication to locate a network, memory, database, or other computer-readable medium location from which the information element may be retrieved. For example, an indication of a document could include the document itself (i.e., its contents), or it could be a unique document descriptor identifying a file with respect to a particular file system, or some other means of directing the recipient of the indication to a network location, memory address, database table, or other location where the file may be accessed. As one skilled in the art would recognize, the degree of precision required in such an indication depends on the extent of any prior understanding about the interpretation to be given to information being exchanged as between the sender and the recipient of the indication. For example, if it is understood prior to a communication between a sender and a recipient that an indication of an information element will take the form of a database key for an entry in a particular table of a predetermined database containing the information element, then the sending of the database key is all that is required to effectively convey the information element to the recipient, even though the information element itself was not transmitted as between the sender and the recipient of the indication.

[0028] In the context of the present specification, the expression “communication network” is intended to include a telecommunications network such as a computer network, the Internet, a telephone network, a Telex network, a TCP/IP data network (e.g., a WAN network, a LAN network, etc.), and the like. The term “communication network” includes a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared and other wireless media, as well as combinations of any of the above.

[0029] In the context of the present specification, the words “first”, “second”, “third”, etc. have been used as adjectives only for the purpose of allowing for distinction between the nouns that they modify from one another, and not for the purpose of describing any particular relationship between those nouns. Thus, for example, it should be understood that the use of the terms “server” and “third server” is not intended to imply any particular order, type, chronology, hierarchy or ranking (for example) of/between the server, nor is their use (by itself) intended imply that any “second server” must necessarily exist in any given situation. Further, as is discussed herein in other contexts, reference to a “first” element and a “second” element does not preclude the two elements from being the same actual real-world element. Thus, for example, in some instances, a “first” server and a “second” server may be the same software and/or hardware, in other cases they may be different software and/or hardware. [0030] Implementations of the present technology each have at least one of the above-mentioned objects and/or aspects, but do not necessarily have all of them. It should be understood that some aspects of the present technology that have resulted from attempting to attain the above-mentioned object may not satisfy this object and/or may satisfy other objects not specifically recited herein.

[0031] Additional and/or alternative features, aspects and advantages of implementations of the present technology will become apparent from the following description, the accompanying drawings and the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0032] For a better understanding of the present technology, as well as other aspects and further features thereof, reference is made to the following description which is to be used in conjunction with the accompanying drawings, where:

[0033] Figure 1 depicts a schematic diagram of a computing device in accordance with one or more non-limiting implementations of the present technology.

[0034] Figure 2 depicts a schematic diagram of a communication system in accordance with one or more non-limiting implementations of the present technology.

[0035] Figure 3 depicts a schematic diagram of a method for attributing an influence to training data in accordance with one or more non-limiting implementations of the present technology.

DETAILED DESCRIPTION

[0036] The examples and conditional language recited herein are principally intended to aid the reader in understanding the principles of the present technology and not to limit its scope to such specifically recited examples and conditions. It will be appreciated that those skilled in the art may devise various arrangements which, although not explicitly described or shown herein, nonetheless embody the principles of the present technology and are included within its spirit and scope. [0037] Furthermore, as an aid to understanding, the following description may describe relatively simplified implementations of the present technology. As persons skilled in the art would understand, various implementations of the present technology may be of a greater complexity.

[0038] In some cases, what are believed to be helpful examples of modifications to the present technology may also be set forth. This is done merely as an aid to understanding, and, again, not to define the scope or set forth the bounds of the present technology. These modifications are not an exhaustive list, and a person skilled in the art may make other modifications while nonetheless remaining within the scope of the present technology. Further, where no examples of modifications have been set forth, it should not be interpreted that no modifications are possible and/or that what is described is the sole manner of implementing that element of the present technology.

[0039] Moreover, all statements herein reciting principles, aspects, and implementations of the present technology, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof, whether they are currently known or developed in the future. Thus, for example, it will be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the present technology. Similarly, it will be appreciated that any flowcharts, flow diagrams, state transition diagrams, pseudo-code, and the like represent various processes which may be substantially represented in computer-readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.

[0040] The functions of the various elements shown in the figures, including any functional block labeled as a “processor” or a “graphics processing unit”, may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. In one or more non-limiting implementations of the present technology, the processor may be a general-purpose processor, such as a central processing unit (CPU) or a processor dedicated to a specific purpose, such as a graphics processing unit (GPU). Moreover, explicit use of the term “processor” or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read-only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.

[0041] Software modules, or simply modules which are implied to be software, may be represented herein as any combination of flowchart elements or other elements indicating performance of process steps and/or textual description. Such modules may be executed by hardware that is expressly or implicitly shown.

[0042] With these fundamentals in place, we will now consider some nonlimiting examples to illustrate various implementations of aspects of the present technology.

[0043] Computing device

[0044] Referring to Figure 1, there is shown a computing device 100 suitable for use with some implementations of the present technology, the computing device 100 comprising various hardware components including one or more single or multicore processors collectively represented by processor 110, a graphics processing unit (GPU) 111, a solid-state drive 120, a random-access memory 130, a display interface 140, and an input/output interface 150.

[0045] Communication between the various components of the computing device 100 may be enabled by one or more internal and/or external buses 160 (e.g., a PCI bus, universal serial bus, IEEE 1394 “Firewire” bus, SCSI bus, Serial-ATA bus, etc.), to which the various hardware components are electronically coupled. [0046] The input/output interface 150 may be coupled to a touchscreen 190 and/or to the one or more internal and/or external buses 160. The touchscreen 190 may be part of the display. In one or more implementations, the touchscreen 190 is the display. The touchscreen 190 may equally be referred to as a screen 190. In the implementations illustrated in Figure 1, the touchscreen 190 comprises touch hardware 194 (e.g., pressure-sensitive cells embedded in a layer of a display allowing detection of a physical interaction between a user and the display) and a touch input/output controller 192 allowing communication with the display interface 140 and/or the one or more internal and/or external buses 160. In one or more implementations, the input/output interface 150 may be connected to a keyboard (not shown), a mouse (not shown) or a trackpad (not shown) allowing the user to interact with the computing device 100 in addition or in replacement of the touchscreen 190.

[0047] According to implementations of the present technology, the solid-state drive 120 stores program instructions suitable for being loaded into the random-access memory 130 and executed by the processor 110 and/or the GPU 111 for attributing an influence to training data used to train a generative machine learning model based on an output of the generative machine learning model. For example, the program instructions may be part of a library or an application.

[0048] The computing device 100 may be implemented as a server, a desktop computer, a laptop computer, a tablet, a smartphone, a personal digital assistant or any device that may be configured to implement the present technology, as it may be understood by a person skilled in the art.

[0049] System

[0050] Referring to Figure 2, there is shown a schematic diagram of a system 200, the system 200 being suitable for implementing one or more non-limiting implementations of the present technology. It is to be expressly understood that the system 200 as shown is merely an illustrative implementation of the present technology. Thus, the description thereof that follows is intended to be only a description of illustrative examples of the present technology. This description is not intended to define the scope or set forth the bounds of the present technology. In some cases, what are believed to be helpful examples of modifications to the system 200 may also be set forth below. This is done merely as an aid to understanding, and, again, not to define the scope or set forth the bounds of the present technology. These modifications are not an exhaustive list, and, as a person skilled in the art would understand, other modifications are likely possible. Further, where this has not been done (i.e., where no examples of modifications have been set forth), it should not be interpreted that no modifications are possible and/or that what is described is the sole manner of implementing that element of the present technology. As a person skilled in the art would understand, this is likely not the case. In addition, it is to be understood that the system 200 may provide in certain instances simple implementations of the present technology, and that where such is the case they have been presented in this manner as an aid to understanding. As persons skilled in the art would understand, various implementations of the present technology may be of a greater complexity.

[0051] The system 200 comprises inter alia a first server 210, a second server 220, and a database 230 communicatively coupled over a communications network 250.

[0052] First server

[0053] The first server 210 is configured to inter alia', generate an output, i.e., a generated data instance, based on a training dataset of training data instances. It should be understood that the first server 210 is configured for being trained using any adequate training method based on the training dataset in order to execute any adequate generative machine learning model in order to generate an output or generated data instance.

[0054] It will be appreciated that the first server 210 can be implemented as a conventional computer server and may comprise at least some of the features of the computing device 100 shown in Figure 1. In a non-limiting example of one or more implementations of the present technology, the first server 210 is implemented as a server running an operating system (OS). Needless to say that the first server 210 may be implemented in any suitable hardware and/or software and/or firmware or a combination thereof. In the disclosed non-limiting implementation of present technology, the first server 210 is a single server. In one or more alternative nonlimiting implementations of the present technology, the functionality of the first server 210 may be distributed and may be implemented via multiple servers (not shown).

[0055] The implementation of the first server 210 is well known to the person skilled in the art. However, the first server 210 comprises a communication interface (not shown) configured to communicate with various entities (such as the database 230, for example and other devices potentially coupled to the communication network 240) via the communication network 240. The first server 210 further comprises at least one computer processor (e.g., the processing device of the computing device 100) operationally connected with the communication interface and structured and configured to execute various processes to be described herein.

[0056] Second server

[0057] The second server 220 is configured to, inter alia. receive the training dataset that was used for training the generative machine learning model executed by the first server 210; receiving a generated data instance having been generated by the generative machine learning model running on the first server 210; extract features of each one of the training data instances, thereby obtaining source embeddings each being associated with a respective one of the training data instances; extract features of the generated data instance, thereby obtaining a generated embedding; determine a similarity measure value between the generated embedding and each one of the source embeddings; assign an influence value to at least one of the training data instances based on the similarity measure value, the influence value being indicative of an influence of a respective one of the training data instances in a generation of the generated data instance; and output the influence value and an indication of the at least one of the training data instances to which the influence value was assigned.

[0058] It will be appreciated that the first server 220 can be implemented as a conventional computer server and may comprise at least some of the features of the computing device 100 shown in Figure 1. In a non-limiting example of one or more implementations of the present technology, the first server 220 is implemented as a server running an operating system (OS). Needless to say that the first server 220 may be implemented in any suitable hardware and/or software and/or firmware or a combination thereof. In the disclosed non-limiting implementation of present technology, the first server 220 is a single server. In one or more alternative nonlimiting implementations of the present technology, the functionality of the first server 220 may be distributed and may be implemented via multiple servers (not shown).

[0059] The implementation of the first server 220 is well known to the person skilled in the art. However, the first server 220 comprises a communication interface (not shown) configured to communicate with various entities (such as the database 230, for example and other devices potentially coupled to the communication network 240) via the communication network 240. The first server 220 further comprises at least one computer processor (e.g., the processing device of the computing device 100) operationally connected with the communication interface and structured and configured to execute various processes to be described herein.

[0060] Database

[0061] A database 230 is communicatively coupled to the server 220 and the client device 210 via the communications network 240 but, in one or more alternative implementations, the database 230 may be directly coupled to the server 220 without departing from the teachings of the present technology. Although the database 230 is illustrated schematically herein as a single entity, it will be appreciated that the database 230 may be configured in a distributed manner, for example, the database 230 may have different components, each component being configured for a particular kind of retrieval therefrom or storage therein.

[0062] The database 230 may be a structured collection of data, irrespective of its particular structure or the computer hardware on which data is stored, implemented or otherwise rendered available for use. The database 230 may reside on the same hardware as a process that stores or makes use of the information stored in the database 230 or it may reside on separate hardware, such as on the server 220. The database 230 may receive data from the server 220 for storage thereof and may provide stored data to the server 220 for use thereof.

[0063] In one or more implementations of the present technology, the database 230 is configured to store inter alia', (i) the training dataset comprising the plurality of training data instances, (ii) the data instance(s) generated by the first server 210 and/or (iii) the influence values determined by the second server 220.

[0064] Communication Network

[0065] In one or more implementations of the present technology, the communications network 240 is the Internet. In one or more alternative non-limiting implementations, the communication network 240 may be implemented as any suitable local area network (LAN), wide area network (WAN), a private communication network or the like. It will be appreciated that implementations for the communication network 240 are for illustration purposes only. How a communication link 245 (not separately numbered) between the client device 210, the server 220, the database 230, and/or another computing device (not shown) and the communications network 240 is implemented will depend inter aha on how each computing device is implemented.

[0066] The communication network 240 may be used in order to transmit data packets between the first and second servers 210 and 220. For example, the communication network 240 may be used to transmit requests from the second server 220 to the first server 210. In another example, the communication network 240 may be used to transmit the training dataset.

[0067] Method Description

[0068] Figure 3 depicts a flowchart of a method 300 for attributing an influence to training data based on an output of a generative machine learning model, in accordance with one or more non-limiting implementations of the present technology.

[0069] In one or more implementations, the second server 220 comprises a processing device such as the processor 110 and/or the GPU 111 operatively connected to a non-transitory computer readable storage medium such as the solid- state drive 120 and/or the random-access memory 130 storing computer-readable instructions. The processing device upon executing the computer-readable instructions, is configured to or operable to execute the method 300.

[0070] According to processing step 302, the processing device receives the training data that was used to train a generative machine learning model. The training data comprises a plurality of training data instances. For example, when the generative machine learning model is configured for generating images, the training dataset comprises training images. In another example, when the generative machine learning model is configured for generating music, the training dataset comprises training pieces of music.

[0071] According to processing step 304, the processing device receives a data instance that was generated by the generative machine learning model. For example, when the generative machine learning model is configured for generating images and the training dataset comprises training images, the data instance received at step 304 is an image that was generated by the generative machine learning model. In another example, when the generative machine learning model is configured for generating music and the training dataset comprises training pieces of music, the data instance received at step 304 is a piece of music that was generated by the generative machine learning model.

[0072] According to processing step 306, the processing device extracts features for each training data instance received at step 302 using a given feature extractor, thereby obtaining an embedding, i.e., a feature vector in an n-dimension embedding space, for each training data instance.

[0073] It should be understood that any adequate feature extractor may be used at step 306. In at least some embodiments, the feature extractor to be used at step 306 is chosen at least based on the type of the training data instances. For example, when the training data instances correspond to training images, the feature extractor is adapted to extract features from images. Examples of such an adequate feature extractor for images may comprise: a convolutional neural network (CNN), an autoencoder, a histogram of oriented gradients (HOG) extractor, etc.

[0074] According to processing step 308, the processing device extracts features from the generated data instance received at step 304 using the same feature extractor as the one used at step 306, thereby obtaining an embedding for the generated data instance.

[0075] According to processing step 310, the processing device determines a similarity measure value between the embedding of the generated data instance and the embedding of each training data instance. It should be understood that any adequate similarity measure such as the Euclidean distance, cosine, dot product, etc. may be used.

[0076] According to processing step 312, the processing device assigns an influence value to at least some of the training data instances based on their respective similarity measure value determined at step 310. For each training data instance to which an influence value has been assigned, the influence value is indicative of the influence or importance of the training data instance into the creation/generation of the generated data instance received at step 304. [0077] According to processing step 314, the processing device outputs the influence values determined at step 312. For example, an indication of each training data instance for which an influence value has been determined and its respective influence value may also be output and/or stored in the database 230. Alternatively, or in addition, the processing device outputs an identification of the source of the training data instance for which an influence value was determined, such as a particular digital image file or particular digital audio file (e.g., for a piece of music). In another embodiment, they may be sent to a personal computer device over the network 250.

[0078] In some embodiments in which the source embeddings used by the method 300 correspond to the embeddings used for training the generative machine learning model, step 306 consists in received the source embeddings of the training data instances from the first server 210 or the database 230. In this case, step 304 consists in receiving the embedding of the generated data instance also from the first server 210 or the database 230. In this case, step 302 may comprise receiving information about the training data instance such an identification (ID) information of each training data instance, and step 304 may also comprise receiving information about the generated data instance such as the generated data instance ID.

[0079] In some embodiments, the method 300 further comprises the steps of dividing each training data instance into a plurality of training instance chunks and dividing the generated data instance into a plurality of generated instance chunks.

[0080] In some embodiments in which the training data instances are divided into training instance chunks and the generated data instance is divided into generated instance chunks, step 306 comprises, for each training data instance, extracting an embedding for each training instance chunk and combining the embeddings of the training instance chunks together to obtain a source embedding for the training data instance. Similarly, step 308 comprises extracting an embedding for each generated instance chunk and combining the generated instance chunks together to obtain a generated embedding for the generated data instance. For example, for each training data instance, the embeddings of the training instance chunks may be added together, aggregated, averaged, etc. It should be understood that the same combination method is also used for obtaining the embedding of the generated data instance, e.g., if the embeddings of the training instance chunks are averaged, then the embeddings of the generated instance chunks are also averaged. It should be understood that at step 310, the similarity measure values are obtained by using the source embeddings and the generated embedding.

[0081] While in the above-described embodiment an embedding is generated for each chunk, it should be understood that some chunks of the training data instances and the corresponding chunks of the generated data instance may be discarded so that an embedding is generated for the non-discarded chunks of the training data instances and the non-discarded chunks of the generated data instance.

[0082] In some embodiments in which the embeddings are aggregated, the method 300 results in computational efficiencies or improvements, for example, due to the optimizations inherent in the method 300. For example, some of the outcomes from aggregating embeddings include: reducing memory usage, speeding up processing, and optimizing similarity calculations. These optimizations are advantageous over conventional methods as they maintain similarity accuracy while reducing computation time and memory usage, which optimizes scalability for large datasets. Also, by aggregating embeddings, the similarity calculation phase requires fewer pairwise comparisons, improving computational efficiency.

[0083] In some embodiments, there is an embedding matrix for every object in the training set as well as for the generated output. For example, the embedding matrix may be an [n,m] embedding matrix, where:

• n represents the number of frames, which may vary based on the audio segment length selected; and

• m denotes the feature dimension, corresponding to the number of features in each frame that the model is extracting.

[0084] In another example, the embedding matrix may be an [n,m] embedding matrix, where: • n represents the number of image segments, which may vary based on the type of segmentation used (e.g., semantic, instance, panoptic); and

• m denotes the feature dimension, corresponding to the number of features in each segment that the model is extracting.

[0085] With the [n,m] embedding matrix, two types of optimization can be done: n and m. Optimization of m can be done with PCA, where a reduction on n and m independently and also simultaneously preserves attribution accuracy. In at least one implementation, this optimization results in optimizing speed by a factor of 100 (from 50s to .5 on an embedding group with 100 million rows) and memory by a factor of 150 (from 75GB to 0.57GB on an embedding group with 100 million rows).

[0086] In some embodiments, the number of frames is reduced. For example, the method 300 includes an averaging over fixed windows. In such a case, the frames are split into smaller groups of windows of a fixed size and then the average of each group is taken.

[0087] In other embodiments in which the training data instances are divided into training instance chunks for each training data instance and the generated data instance is divided into generated instance chunks, training instance chunks are regrouped into source chunk groups and generated instance chunks are similarly regrouped into generated chunk groups. Embeddings are then generated for each source chunk group, for each training data instance, and embeddings are also generated for each generated chunk group.

[0088] In some embodiments, the embeddings generated for the generated chunk groups are combined to obtain a single embedding for the generated data instance and, for each training data instance, the embeddings generated for the source chunk groups are combined together to obtain a single embedding for the training data instance which is then used along with embedding for the generated data instance to obtain the similarity measure value for the training data instance. For example, embeddings may be aggregated, added, averaged, etc. to obtain a single embedding. For each training data instance, a similarity measure value is then determined using the single embedding obtained for the training data instance and the embedding obtained for the generated data instance.

[0089] In other embodiments, for each training data instance, a first similarity value is obtained for each source chunk group based on its respective embedding and the embedding of its respective generated chunk group of the generated data instance, and a second similarity value is obtained by combining together the first similarity values obtained for the training data instance. For example, the first similarity values may be added, averaged, etc.

[0090] In some embodiments, step 306 comprises a step of regrouping at least some of the training data instances into a plurality of instance groups and step 308 comprises extracting an embedding for each group of training data instances. In this case, step 310 consists in determining a similarity measure value for each instance group based on the embedding of the instance group and the embedding of the generated data instance, and step 312 consists in assigning an influence value to each instance group based on the similarity measure values.

[0091] It should be understood that not all of the training data instances may be part of an instance group. For example, all of the training data instances except a given training data instance may be included into a respective one of n instance groups. In this case, a respective embedding is generated for each one of the n instance groups and a further embedding is generated for the given training data that is not part of any group. A similarity measure value and an influence value are then determined for each instance group and also for each training data instance that is not part of an instance group.

[0092] It should also be understood that a same training data instance may be part of different instance groups.

[0093] It should further be understood that while at least some of the training data instances may be part of instance groups and an embedding may be generated for each instance group, an individual embedding may still be generated for each training data instance or at least some of the training data instances. In this case, a similarity measure value and an influence value are determined for each instance group and also for each training data instance.

[0094] In some embodiments in which instance groups are generated, the processing device is configured for automatically regrouping at least some of the training data instances into instance groups.

[0095] In other embodiments in which instance groups are generated, the instance groups are received by the processing device, i.e., the identification of the training data instance that are part of an instance group is received by the processing device.

[0096] In some embodiments in which the generated data instance is divided into a plurality of generated instance chunks and each training data instance is also divided into a plurality of training instance chunks, the method 300 further comprises a step of preselecting some training data instances to obtain a first group of training data instance for which an influence value will be determined and a second group of training data instances for which no influence value will be determined. In this case, the embeddings generated for the generated instance chunks form a cloud of generated embeddings and the embeddings generated for the training instance chunks of each training data instance form a respective cloud of source embeddings. Using any adequate statistical method such as permutation test and adequately choosing the p-value, it is possible to select only the clouds of training instance chunks that are the closest from the cloud of generated instance chunks or that have the most overlap with the cloud of generated instance chunks, thereby selecting the first group of training data instances for which an influence value will be subsequently determined as described above.

[0097] In some embodiments in which the generated data instance is divided into a plurality of generated instance chunks and the training data instances are regrouped into a plurality of instance groups, the method 300 further comprises a step of preselecting some instance groups to obtain a first grouping of instance groups for which an influence value will be determined and a second grouping of instance groups for which no influence value will be determined. In this case, the embeddings generated for the generated instance chunks form a cloud of generated embeddings and the embeddings generated for the training data instances contained in an instance group form a respective cloud of source embeddings. Using any adequate statistical method such as permutation test and adequately choosing the p-value, it is possible to select only the clouds of source embeddings that are the closest from the cloud of generated instance chunks or that have the most overlap with the cloud of generated instance chunks, thereby selecting the first grouping of instance groups for which an influence value will be subsequently determined as described above.

[0098] In some embodiments, the influence value corresponds to a percentage value.

[0099] In some embodiments, an influence value is calculated for each training data instance based on their respective similarity measure value determined at step 310.

[0100] In some embodiments, the method 300 further comprises a step of comparing the determined influence value to a minimal influence threshold and discarding any training data instance for which the respective influence value is below minimal influence threshold. In this case, only the training data instances for which the respective influence value is equal to or above the minimal influence threshold are kept and outputted at step 314.

[0101] In some embodiments, step 312 consists in iteratively determining the influence values, i.e., the influence value of a first training data instance is determined, then the influence value of a second training data instance is determined, the influence value of a third training data instance is determined, etc. In this case, the method 300 may further comprise a step of iteratively adding together the determined influence values, thereby obtaining a total influence value, and a step of comparing at each iteration the total influence value to a total influence threshold. For example, after calculating the influence values of the first and second training data instances, the determined influence values for the first and second training data instances are added together to obtain a total influence value and the total influence value is compared to the total influence threshold. If the total influence value is equal to or above the total influence threshold, step 312 is stopped and only the influence value for the first and second training data instances and their respective influence value is outputted at step 314. If the total influence value is below the total influence threshold, then the influence value is determined for a third training data instance and added to the total influence value. The new total influence value is then compared to the total influence threshold. If the new total influence value is equal to or above the total influence threshold, step 312 is stopped and only the influence value for the first, second and third training data instances and their respective influence value is outputted at step 314. If the new total influence value is below the total influence threshold, then the influence value is determined for a fourth training data instance, etc.

[0102] In some embodiments, the method 300 uses both a minimal influence threshold and a total influence threshold. In this case, the above-described method for calculating the total influence value and comparing the total influence value to the total influence threshold is used, but only the determined influence values being at least equal to the minimal influence threshold are taken into account for the calculation of the total influence value. The training data instances of which the influence value has been determined but was not taken into account for the calculation of the total influence value (i.e., of which the determined influence value is below the minimal influence threshold) and the training data instances for which no influence value has been determined are discarded.

[0103] It should be understood that the execution order of some steps of the method 300 may vary. For example, step 304 may be executed before step 302. In another example, step 308 may be executed prior to step 306.

[0104] In some embodiments, the method 300 further comprises a step of dimensionality reduction of the embeddings generated at steps 306 and 308, thereby obtaining a reduced embedding for each training data instance and the generated data instance. In this case, it should be understood that step 310 is performed using the reduced embeddings.

[0105] It should be understood that any adequate method for embedding dimensionality reduction can be used. For example, methods such as principal component analysis (CPA), uniform manifold approximation and projection (UMAP), or the like can be used.

[0106] In embodiments in which the generated data instance is divided into a plurality of generated instance chunks, the method 300 may be adapted to determine the most influential training data instances amongst a set of training data instances or the most influential groups of training data instances.

[0107] In some embodiments in which each training data instance is divided into a plurality of training instance chunks, the most influential training data instances (i.e., the training data instances having the greatest influence) are determined as follows. The embeddings generated for the generated instance chunks form a cloud of generated embeddings and the embeddings generated for the training instance chunks of each training data instance form a respective cloud of source embeddings. Using any adequate statistical method such as permutation test and adequately choosing the p-value, it is possible to select only the clouds of training instance chunks that are the closest from the cloud of generated instance chunks or that have the most overlap with the cloud of generated instance chunks, thereby determining the training data instances that have the most influence in the generation of the generated data instance. These training data instances that are most influential may be output, or an identification of these training data instances may be output, or both.

[0108] In some embodiments in which the training data instances are regrouped into a plurality of instance groups, the most influential instance groups are determined as follows. The embeddings generated for the generated instance chunks form a cloud of generated embeddings and the embeddings generated for the training data instances contained in an instance group form a respective cloud of source embeddings (so that each instance group is provided with its respective cloud of source embeddings). Using any adequate statistical method such as permutation test and adequately choosing the p-value, it is possible to select only the clouds of source embeddings that are the closest from the cloud of generated instance chunks or that have the most overlap with the cloud of generated instance chunks, thereby selecting the instance groups that had the most influence in the generation of the generated data instance. These instance groups that are most influential may be output, or an identification of these instance groups may be output, or both.

[0109] In some embodiments, a predefined influence value may further be assigned to the selected most influential training data instances or the selected most influential groups of training data instances. For example, when the ten most influential training data instances are to be determined, the p-value may be chosen so that the output of the statistical analysis correspond to the top ten most influential training data instances and an influence value of 10% is assigned to each selected training data instance.

[0110] In some embodiments, a method for determining an influence of training data combines one or more portions of one or more embodiments described above. These portions include: receiving a training dataset and a generated data instance, obtaining embeddings, determining influence (or training data instances or instance groups with the greatest influence), outputting the influence (or training data instances or instance groups with the greatest influence), and flagging or blocking the generated data instance. For example, the receiving a training dataset and a generated data instance may include: receiving a training dataset comprising a plurality of training data instances (where a given generative machine learning model was trained using the training dataset) and receiving a generated data instance (that was generated by the generative machine learning model). The obtaining embeddings may include obtaining source embeddings (or a cloud of source embeddings) and a generated embedding (or a cloud of generated embeddings). The determining influence may be assigning an influence value based on similarity between the generated embedding and the source embedding, or determining which training data instances or instance groups had the greatest influence in the generation of the generated data instance. The outputting the influence may be outputting the influence value and an indication of the training data instance assigned the influence value, or outputting the training data instance or the instance groups that had the greatest influence in the generation of the generated data instance. The flagging or blocking the generated data instance may cause the model to regenerate a new generated data instance.

[oni] In the following, there is described a particular implementation of the method 300 when the generative machine learning model is configured for generating audio such as a musical piece. In this case, the method 300 is adapted to determine the influence of training audio pieces into the generation of a given audio piece by the generative machine learning model.

[0112] This particular implementation corresponds to a comprehensive methodology for attributing value to a limited set of individual data instances within a training dataset used for generation. It should be understood that this particular implementation may serve as a general framework for a wide range of generative artificial intelligence (Al) applications. This approach is also designed to be modelagnostic, enabling its application across a variety of generative model architectures (e.g., Transformers, GANs, Latent Diffusion.). In some embodiments, by anchoring attribution in similarity metrics based on human perception, how each data point influences the generated output may be elucidated. This approach can be used for addressing the ethical implications of Gen-AI for content creation, ensuring alignment between the human creators of training data and the resulting Al-generated content.

[0113] In the case of the application of the method 300 to music, any adequate dataset of training musical pieces can be used.

[0114] For example, the training musical pieces may come from the MagnaTagATune dataset which presents a diverse representation of audio features and their accompanying annotations and a strong community validation, providing a well-rounded estimation of similarity based on human tagged music. The MagnaTagATune dataset contains about 25,863 29-second music excerpts, each belonging to one of about 5223 songs, from about 445 albums, across about 230 artists. The clips span a broad range of genres like Classical, New Age, Electronica, Rock, Pop, World, Jazz, Blues, Metal, Punk, and more. Each audio clip is supplied with a vector of binary annotations of 188 tags created via a “TagATune” game paradigm in which two players are presented with either the same or different audio clips and asked to come up with tags for the presented clip, after which they are asked to decide whether they were presented with the same audio clip.

[0115] In another example, the training musical pieces may come from the Million Song Dataset which is a large, freely available collection of audio features and metadata for a million contemporary popular music tracks. It was created to serve as a rich dataset for researchers working on music information retrieval (MIR) and related fields. Unlike audio datasets that contain the actual audio, the Million Song Dataset provides features extracted from the audio, which can be used for various tasks such as genre classification, recommendation, and more. It includes various pieces of metadata for each track like song title, artist name, album name, year of release, and more. Additionally, it provides pre-computed audio features like tempo, key, time signature, and other attributes that can describe a song's musical characteristics.

[0116] In a further example, the training musical pieces may come from the Free Music Archive (FMA) which is an interactive library of high-quality, legal audio downloads, curated by established audio curators and institutions. This database provides a rich trove of music across multiple genres, and each piece is available under various Creative Commons licenses or through direct permissions from the artists.

[0117] In still another example, the training musical pieces may come from AudioSet which is a vast dataset originating from YouTube™ video soundtracks. Comprising over 2 million 10-second audio clips extracted from these videos, it's annotated with a rich ontology of 5 1 sound event classes. The dataset covers a wide range of sounds from real-world environments, and notably, it includes music-related labels, encapsulating various genres, instruments, and other music-associated events. This comprehensive nature positions AudioSet as a versatile resource, not just for generic audio recognition tasks but also for music-specific machine learning applications.

[0118] In the case of musical pieces, examples of adequate feature extractors are provided in the following.

[0119] For example, the feature extractor may be the VGGish model which is derived from the foundational architecture of the VGG model. VGGish has been specifically optimized for audio feature extraction. With an input configuration designed for mel-spectrograms, VGGish accepts dimensions of 96x64 and yields a 128-dimensional embedding. Such a design ensures its efficacy in capturing a wide spectrum of audio characteristics. Building upon the principles established by the original VGG for image processing, VGGish has been meticulously adapted to analyze and distinguish audio data, establishing it as a robust model suitable for diverse musical applications.

[0120] In another example, the feature extractor may be the MusiCNN model. Designed with a primary focus on musical attributes, MusiCNN effectively captures the temporal dynamics inherent in musical compositions. It utilizes filters of varying lengths to process audio input, thus discerning both short-term timbral features and longer-term rhythmic structures. The architecture's convolutional layers are designed to extract hierarchical musical features, resulting in the precise identification of patterns and timbres.

[0121] In a further example, the feature extractor may be the Audio-Adapted VGG model, which is another adaptation of the VGG model, modified for audio data. This model processes spectrograms with an architecture reminiscent of its visual counterpart but refocused on audio frequency bands. The convolutional layers are designed to recognize spatial patterns within spectrograms. Although its foundational architecture is borrowed from image processing, the adaptability of this version of VGG demonstrates its efficacy in analyzing intricate audio data, particularly when the focus is on detailed spectrogram interpretation. [0122] In some embodiments, each training musical piece is chunked into a plurality of training musical segments and an embedding is generated for each training musical segment. An average embedding is then determined using the embeddings of the training musical segments.

[0123] For example, each training musical piece may be chunked all in 3- second musical segments and an embedding is extracted for each musical segment. In order to reduce computation in the similarity calculation step, an average embedding is calculated across all musical segments.

[0124] For each training musical segment

a feature vector

is computed using an encoder or feature extractor E:

[0126] where ^xi is an audio chunk, <Pi is the corresponding embedding in R”, and n is the dimensionality of the embedding space.

[0127] These feature vectors lie in a n-dimensional space, where the size n of depends on the model used. For example, for MusiCNN, the size of an embedding vector is 200, for VGG, the size of an embedding vector is 256 and for VGGish, the size of an embedding vector is 128.

[0128] As a result, for an entire training musical piece divided into m training musical segments, one can obtain:

n averaged to obtain a single embedding for each training musical piece.

[0132] It should be understood that the features for the musical piece generated by the generative machine learning model are extracted in a similar method using the same encoder. [0133] In some embodiments, dimensionality reduction is applied to the determined embeddings prior to determining the similarity measure values.

[0134] In one example, dimensionality reduction is achieved using Principal Component Analysis (PCA) which is a statistical method rooted in linear algebra. Its primary goal is to reduce dimensionality by projecting data onto orthogonal axes, known as principal components. These axes are arranged in a manner where the first captures the highest variance in the data, followed by the second (orthogonal to the first), and so on, ensuring that most of the data's original variability is preserved in fewer dimensions.

[0135] In another example, dimensionality reduction is achieved using t- Distributed Stochastic Neighbor Embedding (t-SNE). t-SNE, or t-Distributed Stochastic Neighbor Embedding, stands out as a probabilistic technique tailored for high-dimensional data. Unlike many other methods, t-SNE maintains local structures within data, making it proficient at capturing intricate non-linear relationships. This has made it a good choice for visualizing high-dimensional datasets.

[0136] In a further example, dimensionality reduction is achieved using STD- PCA (Sparse Tikhonov-Regularized Discriminant-PCA). STD-PCA is a refined version of traditional PCA, introducing elements of sparse and Tikhonov regularization. This combination often results in reduced dimensions that are more interpretable, particularly beneficial when variables or features exhibit collinearity. By integrating regularization, STD-PCA offers a balance between capturing variance and ensuring interpretability.

[0137] In still another example, dimensionality reduction is achieved using Uniform Manifold Approximation and Projection (UMAP) which is a non-linear dimensionality reduction technique grounded in the principle of preserving the topological structure inherent in the original space. UMAP stands out for its ability to maintain both local and global structures within the data. Additionally, it may offer a computational advantage, being faster than techniques like t-SNE in numerous scenarios. [0138] Still considering the case of musical pieces, the similarity measure values determined at step 310 may be obtained using the Euclidean Distance method, the Cosine Similarity method, the Dot Product method, the Triangle Area Similarity (TS) method, the Sector Area Similarity (SS) method, the TS-SS similarity metric which is the difference between Triangle Area Similarity and Sector Area Similarity, or the like, as known in the art.

[0139] In some embodiments, a mechanism is present for determining when to cease attributing an influence value to training musical pieces. By setting an empirically derived or predetermined minimal influence threshold, training musical pieces that have negligible influence are discarded, thereby focusing on the most impactful training musical pieces.

[0140] The influence values are distributed up to a total influence threshold, e.g., 50% of influence amongst the nearest neighbors in the embedding space, which provides a cutoff when the smallest fraction falls below the minimal influence threshold, e.g., 5%. If the minimal influence threshold is set to 0, then each training musical piece is assigned an influence value, and if the total influence threshold is set to 100%, then the combined attributions per object will total 100%.

[0141] In some embodiments, the distance between each training musical piece and the musical piece generated by the generative machine learning model is calculated, which allows for ranking the embeddings of the training musical pieces based on distances to the embedding of the generated musical piece. For example, for a minimum influence threshold of 5%, one can obtain the top 10 training musical pieces as that should sufficiently fulfill the minimum and max limits (5%-50%) .

[0142] The influence values (referred to as shares hereinafter) are then determined based on the determined distances, as follows.

[0143] First, the weight «’> for each distance di is calculated as its reciprocal, as follows:

[0145] The total weight W corresponding to the sum of all individual weights is then calculated:

[0146]

[0147] The initial share allocated to each training musical piece based on its weight is calculated as:

[0149] Where pie share corresponds to the maximum total threshold described above and is indicative of the total share of attribution to be allocated among the training musical pieces. For example, the pie_share may be set to 0.5 by default, meaning that 50% of the "attribution pie" will be distributed.

[0150] Shares below the minimum influence threshold (e.g., 0.05) are discarded. If a share is found to be less than the minimum influence threshold, the corresponding and are removed, and W is recalculated as:

[0152] Subsequently, the shares ■st are recalculated based on this new W using the formula for initial share allocation.

[0153] A list of shares

each of which represents the final influence value for at least some of the training musical pieces in the dataset is then obtained.

[0154] In at least some embodiments the above-described method methodology is model agnostic, structurally decoupled from the model code itself, and designed for broad applicability across various generative model architectures, enabling its application to existing pretrained models. This may ensure relevance of the method as Gen-AI technologies evolve, offering a standardized solution for all Gen-AI applications, and not limited to music. It emphasizes a universal approach for models trained on copyrighted material, rather than bespoke solutions tailored to specific media — i.e., text, image, audio, or video — or specific models such as Latent Diffusion, GAN, or Transformers. By being independent of any specific generative model architecture, the method can be applied across a wide array of technologies. This not only expands its utility but also circumvents the need to develop new attribution techniques each time a technological breakthrough occurs in a particular media space. As generative models continue to evolve at a rapid pace, a modelagnostic approach ensures that the methodology remains relevant and actionable, without requiring constant updates or modifications to keep up with Gen-AI progress.

[0155] In some embodiments, for the particular case of music, the method may emphasize measures of similarity anchored in human perception. Such a method not only makes the attribution process inherently more comprehensible, supporting interpretability and reflecting how a human being naturally experiences and appreciates music, it also recognizes and values the disproportionate influence certain artists may have, even with limited contributions — e.g., Michael Jackson’s “Thriller” was hugely influential but will constitute only a very small portion of training musical pieces. Moving beyond music, similarity measures might be devised that emphasize end-use or mode of production for the generated outputs. For example, in the case of Gen-AI for materials or molecules similarity measures might include metrics related to production cost or environmental impact — special considerations of materials that are not obvious from the data alone — or functional similarity, as opposed to structural similarity. For architectural Gen-AI similarity might include measures for functional utility or integration with surroundings, which likewise might not be obvious from architectural design data alone.

[0156] In some embodiments, an attribution cutoff is incorporated into the method. This may help in dealing with the “long-tail” of training data instances that have less influence on the generated output but nevertheless display a non-trivial degree of similarity. Such cut-off mechanisms also reflect industry standards of royalty distribution, where ownership is typically assigned to a relatively short list of contributors to the production — e.g., songwriter, lyricist, producer, etc. Having a predetermined or empirically derived threshold ensures that attribution is both practical and meaningful, avoiding the noise introduced by large numbers of attributed data points, most of which demonstrate negligible influence on the output. As generative processes are inherently stochastic, variations across individual generative outputs will also naturally tend to distribute attribution across relevant portions of the training data as the system continues to be used.

[0157] In some embodiments, the method incorporates Dimensionality Reduction (DR) techniques which allows for both streamlining computational demands and improving attribution quality. Such a DR technique may enable efficient processing of large, high-dimensional datasets common in audio and music, offering increased scalability and speed. DR not only boosts efficiency but may also filter out noise from less pertinent dimensions, preserving the most information-rich features essential for real-time similarity computations. By addressing the curse of dimensionality, DR may ensure that distance-based metrics in the reduced spaces are more reliable, leading to robust similarity measurements that prioritize principal features capturing the most significant data variations. For example, the dimensions may be limited to 5-10 components, optimizing computational efficiency and precision in similarity assessments.

[0158] In some embodiments, the method provides digital copyright management and enforcement based on the influence value. Alternatively, or in addition, the method provides digital copyright management and enforcement based on the training data instances (or instance groups) that had the most influence (e.g., in a generation of the generated data instance). After running the attribution process (e.g., assigning the influence value, determining which training instances had the most influence, determining which instance groups had the most influence), the method determines that a particular training data instance or a small subset of instances has a disproportionately high influence (e.g., greater than a threshold influence value) on a generated output (e.g., the generated data instance). In this case, the method may flag or block this output, where flagging provides the option where a user on an user interface (UI) is notified of the risk, and blocking is the option where the UI requires regenerating a new output with an indication that the model owner is not allowing the rights access to that output as it infringes on copyright agreements under which the model was trained.

[0159] The method accordingly prevents copyright infringement by ensuring that any particular outputs that are too closely tied to certain training data instances are not distributed or used, thereby safeguarding the rights of original content creators if that has been determined in the licensing agreement of the training dataset used. To accomplish this, the method may employ software tools that are configured to flag outputs when a similarity threshold exceeds a predetermined level, effectively serving as a digital rights management tool. Similarly, the method can cause the blocking of generations that are from artists that have been removed or have opted out from the training set. For example, if a model was trained on 100,000 artists’ content, there can be a huge expense in the additional training (e.g., weeks or months of training on multiple GPUs). If then an artist from this list decides to opt out from the model, usually that would require retraining of the model, which may be computationally expensive. This attribution mechanism can be set up with a threshold and be used to determine if the model is using this artist above the allowed threshold, for example 1%. Then the tool may block or flag the output as above. This thereby allows for rights holders to opt out and for the model to remain in use without retraining.

[0160] In some embodiments, the method modifies training to reduce infringement on output based on the influence value. Alternatively, or in addition, the method provides digital copyright management and enforcement based on the training data instances (or instance groups) that had the most influence (e.g., in a generation of the generated data instance). In a particular context, at the time of generative model training, there may be alternative or additional contributions. A model can be trained with an extra optimization technique by using attribution results: if output similarity exceeds a predetermined level, adjustments to the model’s weights are done. The method mitigates copyright infringement risks at the training level, reducing the likelihood that final outputs will resemble specific copyrighted instances too closely.

[0161] In some embodiments, the method includes prompt-focused training. The method begins training the model on a large dataset (e.g., 1,000,000 audio samples with text descriptions) to establish basic associations between text prompts and audio features. The method identifies a subset of “high-value” prompts (e.g., 1,000 prompts) that are especially important, such as those tied to unique or desired output characteristics, and calculates attribution values for these prompts during training. During training, the method focuses on examples that show strong attribution alignment with high-value prompts, emphasizing these samples in the model’s learning process. This reweighting ensures that more influential data guides the model’s early training stages. The model uses attribution scores to adjust the model weights such that generated outputs align with expected attribution profiles for high- value prompts. If attribution is low, the method updates the model to minimize the discrepancy, focusing the model’s learning on achieving the desired output characteristics. The method continuously monitors attribution for these high-value prompts and refines the model to ensure these prompts consistently generate strong, aligned attribution values. This iterative feedback loop enhances the model’s ability to reproduce desired results aligned with important training inputs.

[0162] It should be expressly understood that not all technical effects mentioned herein need to be enjoyed in each and every implementation of the present technology. For example, implementations of the present technology may be implemented without the user enjoying some of these technical effects, while other non-limiting implementations may be implemented with the user enjoying other technical effects or none at all.

[0163] Some of these steps and signal sending-receiving are well known in the art and, as such, have been omitted in certain portions of this description for the sake of simplicity. The signals can be sent-received using optical means (such as a fiberoptic connection), electronic means (such as using wired or wireless connection), and mechanical means (such as pressure-based, temperature based, or any other suitable physical parameter based).

[0164] Modifications and improvements to the above-described implementations of the present technology may become apparent to those skilled in the art. The foregoing description is intended to be exemplary rather than limiting.

Claims

CLAIMS What is claimed is:

1. A method for determining an influence of training data, the method being executed by a processor, the method comprising: accessing a database containing a training dataset comprising a plurality of training data instances, a given generative machine learning model having been trained using the training dataset; receiving a generated data instance having been generated by the generative machine learning model; extracting first features from the training data instances, thereby obtaining source embeddings; extracting second features of the generated data instance, thereby obtaining a generated embedding; determining similarity measure values between the generated embedding and the source embeddings; assigning an influence value to at least one of the training data instances based on the similarity measure values, the influence value being indicative of an influence in a generation of the generated data instance by the generative machine learning model; and outputting the influence value and an indication of the at least one of the training data instances to which the influence value was assigned.

2. The method of claim 1, further comprising: determining that the influence value exceeds a threshold value; and flagging or blocking the generated data instance.

3. The method of claim 1, further comprising: dividing each training data instance into a respective plurality of training instance chunks and extracting first features from each training instance chunk, thereby obtaining a cloud of source embeddings for each training data instance; dividing the generated data instance into a plurality of generated chunks and extracting second features from each generated chunk, thereby obtaining a cloud of generated embeddings; determining given ones of the training data instances that had a greatest influence in a generation of the generated data instance using a statistical analysis method, the clouds of source embeddings and the cloud of generated embeddings; and outputting an identification of the given ones of the training data instances.

4. The method of claim 3, further comprising: determining that the greatest influence exceeds a threshold value; and flagging or blocking the generated data instance.

5. The method of claim 1, further comprising: regrouping the training data instances into a plurality of instance groups and extracting first features from the training data instances contained in each instance group, thereby obtaining a cloud of source embeddings for each instance group; dividing the generated data instance into a plurality of generated chunks and extracting second features from each generated chunk, thereby obtaining a cloud of generated embeddings; determining given ones of the instance groups that had a greatest influence in a generation of the generated data instance using a statistical analysis method, the clouds of source embeddings and the cloud of generated embeddings; and outputting an identification of the given ones of the instance groups.

6. The method of claim 5, further comprising: determining that the greatest influence exceeds a threshold value; and flagging or blocking the generated data instance.

7. The method of claim 1, further comprising: modifying training of the generative machine learning model to reduce the influence in a generation of a future generated data instance by the generative machine learning model.

8. A system for determining an influence of training data, the system comprising: a processor; a non-transitory storage medium operatively connected to the processor, the non-transitory storage medium comprising computer-readable instructions; the processor, upon executing the instructions, being configured to: access a database containing a training dataset comprising a plurality of training data instances, a given generative machine learning model having been trained using the training dataset; receive a generated data instance having been generated by the generative machine learning model; extract first features from the training data instances, thereby obtaining source embeddings; extract second features of the generated data instance, thereby obtaining a generated embedding; determine similarity measure values between the generated embedding and the source embeddings; assign an influence value to at least one of the training data instances based on the similarity measure values, the influence value being indicative of an influence in a generation of the generated data instance by the generative machine learning model; and output the influence value and an indication of the at least one of the training data instances to which the influence value was assigned.

9. The system of claim 8, wherein the processor, upon executing the instructions, is further configured to: determine that the influence value exceeds a threshold value; and flag or block the generated data instance.

10. The system of claim 8, wherein the processor, upon executing the instructions, is further configured to: divide each training data instance into a respective plurality of training instance chunks and extract first features from each training instance chunk, thereby obtaining a cloud of source embeddings for each training data instance; divide the generated data instance into a plurality of generated chunks and extract second features from each generated chunk, thereby obtaining a cloud of generated embeddings; determine given ones of the training data instances that had a greatest influence in a generation of the generated data instance using a statistical analysis method, the clouds of source embeddings and the cloud of generated embeddings; and output an identification of the given ones of the training data instances.

11. The system of claim 10, wherein the processor, upon executing the instructions, is further configured to: determine that the greatest influence exceeds a threshold value; and flag or block the generated data instance.

12. The system of claim 8, wherein the processor, upon executing the instructions, is further configured to: regroup the training data instances into a plurality of instance groups and extract first features from the training data instances contained in each instance group, thereby obtaining a cloud of source embeddings for each instance group; divide the generated data instance into a plurality of generated chunks and extract second features from each generated chunk, thereby obtaining a cloud of generated embeddings; determine given ones of the instance groups that had a greatest influence in a generation of the generated data instance using a statistical analysis method, the clouds of source embeddings and the cloud of generated embeddings; and output an identification of the given ones of the instance groups.

13. The system of claim 12, wherein the processor, upon executing the instructions, is further configured to: determine that the greatest influence exceeds a threshold value; and flag or block the generated data instance.

14. The system of claim 8, wherein the processor, upon executing the instructions, is further configured to: modify training of the generative machine learning model to reduce the influence in a generation of a future generated data instance by the generative machine learning model.

15. A non-transitory computer-readable medium having stored thereon computer- readable instructions that, when executed by a processor, cause the processor to: access a database containing a training dataset comprising a plurality of training data instances, a given generative machine learning model having been trained using the training dataset; receive a generated data instance having been generated by the generative machine learning model; extract first features from the training data instances, thereby obtaining source embeddings; extract second features of the generated data instance, thereby obtaining a generated embedding; determine similarity measure values between the generated embedding and the source embeddings; assign an influence value to at least one of the training data instances based on the similarity measure values, the influence value being indicative of an influence in a generation of the generated data instance by the generative machine learning model; and output the influence value and an indication of the at least one of the training data instances to which the influence value was assigned.

16. The non-transitory computer-readable medium of claim 15, wherein the computer-readable instructions, when executed by the processor, further cause the processor to: determine that the influence value exceeds a threshold value; and flag or block the generated data instance.

17. The non-transitory computer-readable medium of claim 15, wherein the computer-readable instructions, when executed by the processor, further cause the processor to: divide each training data instance into a respective plurality of training instance chunks and extract first features from each training instance chunk, thereby obtaining a cloud of source embeddings for each training data instance; divide the generated data instance into a plurality of generated chunks and extract second features from each generated chunk, thereby obtaining a cloud of generated embeddings; determine given ones of the training data instances that had a greatest influence in a generation of the generated data instance using a statistical analysis method, the clouds of source embeddings and the cloud of generated embeddings; and output an identification of the given ones of the training data instances.

18. The non-transitory computer-readable medium of claim 17, wherein the computer-readable instructions, when executed by the processor, further cause the processor to: determine that the greatest influence exceeds a threshold value; and flag or block the generated data instance.

19. The non-transitory computer-readable medium of claim 15, wherein the computer-readable instructions, when executed by a processor, further cause the processor to: regroup the training data instances into a plurality of instance groups and extract first features from the training data instances contained in each instance group, thereby obtaining a cloud of source embeddings for each instance group; divide the generated data instance into a plurality of generated chunks and extract second features from each generated chunk, thereby obtaining a cloud of generated embeddings; determine given ones of the instance groups that had a greatest influence in a generation of the generated data instance using a statistical analysis method, the clouds of source embeddings and the cloud of generated embeddings; and output an identification of the given ones of the instance groups.

20. The non-transitory computer-readable medium of claim 19, wherein the computer-readable instructions, when executed by the processor, further cause the processor to: determine that the greatest influence exceeds a threshold value; and flag or block the generated data instance.