CN116306771A

CN116306771A - A kind of model training method and related equipment

Info

Publication number: CN116306771A
Application number: CN202310224144.4A
Authority: CN
Inventors: 李银川; 邵云峰; 郝建业
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-06-23

Abstract

The application discloses a model training method and related equipment, wherein a generated stream model is obtained by training in an offline training mode, so that the generated stream model has more excellent performance. The method comprises the following steps: when the model to be trained is required to be trained, first information of the intelligent agent can be acquired from a preset offline data set, and the first information is used for indicating that the intelligent agent is in a target state. Then, the first information can be input into the model to be trained, so that the first information is processed through the model to be trained, and the occurrence probability of the first action of the intelligent agent is obtained, wherein the first action of the intelligent agent is used for enabling the intelligent agent to enter a next state of the target state from the target state. Finally, the model to be trained can be trained based on the occurrence probability of the first action of the agent and the true occurrence probability of the first action derived from the offline data set, thereby obtaining a generated flow model.

Description

Model training method and related equipment thereof

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence (artificial intelligence, AI), in particular to a model training method and related equipment thereof.

Background

With the rapid development of AI technology, the generated flow model is widely applied to describing and solving action policy selection of an agent in the process of interaction with an environment, so that the agent can realize maximum return or realize a specific target after executing corresponding actions.

At present, a generated flow model provided by the related art may process information associated with a target state in which an agent is located after determining that the agent is in the target state, so as to predict occurrence probabilities of one or more actions of the agent, where the actions of the agent are used to cause the agent to enter one or more next states of the target state from the target state. In this way, the agent can execute the action with the largest occurrence probability predicted by the neural network model, so as to enter a certain next state of the target state.

The above-mentioned generation flow model generally adopts an online training mode, that is, during the training of the model, for any one state of the agent, the model may apply, in the environment simulator, the predicted action to that state, thereby randomly generating the next state of the agent. Although such training patterns can make the model learn all the states of the intelligent agent as much as possible, some states are not fit to the actual environment where the intelligent agent is located, so that the performance of the generated flow model obtained by training is relatively general.

Disclosure of Invention

The embodiment of the application provides a model training method and related equipment, and a generated stream model is obtained by training in an offline training mode, so that the generated stream model has more excellent performance.

A first aspect of an embodiment of the present application provides a model training method, including:

when the model to be trained is required to be trained, a preset offline data set can be acquired first, and first information is extracted from the offline data set, wherein the first information is used for indicating that the intelligent agent is in a target state.

After the first information is obtained, the first information can be input into a model to be trained, so that the first information is processed through the model to be trained, and the (prediction) occurrence probability of the first action of the intelligent agent is obtained, wherein the first action is used for enabling the intelligent agent to enter a next state of the target state from the target state. So far, the model to be trained completes the action prediction aiming at the target state. In one possible implementation, the model to be trained, when obtaining the probability of occurrence of the first action of the agent, may follow as much as possible the following constraints: the difference between the occurrence probability of the first action of the intelligent agent and the true occurrence probability of the first action of the intelligent agent is located within a preset range, wherein the true occurrence probability of the first action of the intelligent agent can be extracted from the offline data set.

After the occurrence probability of the first action of the intelligent agent is obtained, the model to be trained can be trained based on the occurrence probability of the first action of the intelligent agent until the model training condition is met, so that the generated flow model is obtained.

From the above method, it can be seen that: when the model to be trained is required to be trained, first information of the intelligent agent can be acquired from a preset offline data set, and the first information is used for indicating that the intelligent agent is in a target state. Then, the first information can be input into the model to be trained, so that the first information is processed through the model to be trained, and the occurrence probability of a first action of the intelligent agent is obtained, wherein the first action is used for enabling the intelligent agent to enter a next state of the target state from the target state. Finally, the model to be trained can be trained based on the occurrence probability of the first action of the agent and the true occurrence probability of the first action, so that a generated flow model is obtained, and the true occurrence probability of the first action is derived from the offline data set. In the foregoing process, the occurrence probability of the first action of the agent may be referred to as a predicted action policy of the to-be-trained model for the target state, and the true occurrence probability of the first action of the agent may be referred to as a true action policy of the offline database for the target state, so that the predicted action policy for the target state may be fitted to the true action policy for the target state as much as possible, and the true action policy for the target state determines the true probability of the agent entering the next state of the target state from the target state, so that the to-be-trained model not only can learn as much as possible into each next state of the target state, but also the learned states are enough to conform to the actual environment where the agent is located (because the data in the offline data set are all set in advance based on the actual environment where the agent is located), and then the generated flow model obtained by training in the offline training mode may have better performance.

In one possible implementation, training the model to be trained based on the probability of occurrence of the first action, the generating a flow model includes: correcting the occurrence probability of a second action of the intelligent agent based on the offline data set to obtain corrected occurrence probability of the second action, wherein the second action is used for enabling the intelligent agent to enter a target state from a previous state of the target state; correcting the reward value corresponding to the target state based on the offline data set to obtain a corrected reward value corresponding to the target state; based on the occurrence probability of the first action, the corrected occurrence probability of the second action and the corrected reward value corresponding to the target state, training the model to be trained to obtain a generated stream model. In the foregoing implementation manner, after the occurrence probability of the first action of the agent is obtained, the occurrence probability of the second action of the agent may be also obtained, where the second action of the agent is used to enable the agent to enter the target state from the previous state of the target state. It should be noted that, since the model to be trained has completed the motion prediction for the previous state of the target state, the occurrence probability of the second motion of the agent can be directly obtained. Then, the probability of occurrence of the second action of the agent may be corrected using some of the data in the offline data set, thereby obtaining a corrected probability of occurrence of the second action of the agent. After the occurrence probability of the first action of the intelligent agent is obtained, the rewarding value of the target state object can be obtained from the offline data set, and the rewarding value corresponding to the target state is corrected by utilizing some data in the offline data set, so that the corrected rewarding value corresponding to the target state is obtained. After the corrected occurrence probability of the second action of the intelligent agent and the corrected reward value corresponding to the target state are obtained, the occurrence probability of the first action of the intelligent agent, the corrected occurrence probability of the second action of the intelligent agent and the corrected reward value corresponding to the target state can be trained on the model to be trained, so that the generated flow model is obtained.

In one possible implementation, the offline data set includes M first candidate information and M second candidate information, the ith first candidate information being used to indicate that the agent is in an ith candidate state, the ith second candidate information being used to indicate that the agent is in a previous state to the ith candidate state, the M first candidate information including first information, the M second candidate information including second information, the second information being used to indicate that the agent is in a previous state to the target state, the M candidate states including the target state, M being greater than or equal to 1; correcting the occurrence probability of the second action of the agent based on the offline data set, the obtaining the corrected occurrence probability of the second action comprising: and correcting the occurrence probability of the second action of the intelligent agent based on the first information, the second information, the M first candidate information and the M second candidate information to obtain the corrected occurrence probability of the second action. In the foregoing implementation, the offline data set includes M data sets, and the 1 st data set includes the 1 st first candidate information, the 1 st second candidate information, the 1 st third candidate information, the reward value corresponding to the 1 st candidate state, and the 1 st real action policy. Similarly, the mth data set includes the mth first candidate information, the mth second candidate information, the mth third candidate information, the reward value corresponding to the mth candidate state and the mth real action strategy. Then, one of the M data sets may be selected, the first candidate information in the data set is referred to as first information, the candidate state indicated by the first candidate information in the data set is referred to as target state, and the second candidate information in the data set is referred to as second information. It follows that the first information is used to indicate that the agent is in the target state, the second information is used to indicate that the agent is in a state preceding the target state, and the probability of the actual occurrence of the first action of the agent is known (from the corresponding actual action strategy). In this way, M first candidate information and M second candidate information may be extracted from the offline database, and the first information, the second information, the M first candidate information and the M second candidate information may be calculated, to obtain a new conversion value for the target state. Then, the probability of occurrence of the second action of the agent is corrected by using the new transition value for the target state, thereby obtaining the corrected probability of occurrence of the second action.

In one possible implementation manner, the offline data set further includes prize values corresponding to M candidate states, correcting the prize value corresponding to the target state based on the offline data set, and obtaining the corrected prize value corresponding to the target state includes: and correcting the reward value corresponding to the target state based on the first information, the M pieces of first candidate information and the reward values corresponding to the M pieces of candidate states to obtain the corrected reward value corresponding to the target state. In the foregoing implementation manner, the M first candidate information and the prize values corresponding to the M candidate states may be extracted from the offline database, and the first information, the M first candidate information, and the prize values corresponding to the M candidate states may be calculated, so as to obtain corrected prize values corresponding to the target states.

In one possible implementation, the first information is information collected when the agent is in a target state, where the information includes at least one of: image, video, audio or text.

A second aspect of the embodiments of the present application provides a motion prediction method, which is implemented by a generated flow model in the first aspect or any one of possible implementation manners of the first aspect, where the method includes: acquiring information of an intelligent agent, wherein the information is used for indicating that the intelligent agent is in a target state; and processing the information through the model to be trained to obtain the occurrence probability of the action of the intelligent agent, wherein the action is used for enabling the intelligent agent to enter the next state of the target state from the target state.

From the above method, it can be seen that: when the agent is currently in the target state, the agent may first collect information indicating that the agent is in the target state in order to predict an action to bring itself into a next state from the target state into the target state. After obtaining the information of the intelligent agent, the intelligent agent can input the information of the intelligent agent into the generated flow model so as to process the information of the intelligent agent through the generated flow model, thereby obtaining the occurrence probability of one or more actions of the intelligent agent. Then, the agent may select an action with the highest probability of occurrence among the one or more actions, and perform the action to enter a certain next state of the target state. In the process, the generated flow model is built in the intelligent agent, so that the self-conversion between different states can be accurately completed based on the completion of the motion prediction and the motion execution.

A third aspect of the embodiments of the present application provides an action prediction method, including: acquiring information of an intelligent agent, wherein the information is used for indicating that the intelligent agent is in a target state; and processing the information through the model to be trained to obtain the occurrence probability of the action of the intelligent agent, wherein the action is used for enabling the intelligent agent to enter the next state of the target state from the target state. The action to be performed is determined based on the occurrence probability of the action based on the occurrence probability of the preset action.

In one possible implementation, determining the occurrence probability of the action based on the occurrence probability of the preset action includes: among the actions and the preset actions, the action with the highest occurrence probability is determined as the action to be executed.

A fourth aspect of embodiments of the present application provides a model training apparatus, the apparatus comprising: the acquisition module is used for acquiring first information of the intelligent agent from a preset offline data set, wherein the first information is used for indicating that the intelligent agent is in a target state; the processing module is used for processing the first information through the model to be trained to obtain the occurrence probability of a first action of the intelligent agent, wherein the first action is used for enabling the intelligent agent to enter a next state of the target state from the target state; the training module is used for training the model to be trained based on the occurrence probability of the first action to obtain a generated flow model, and the true occurrence probability is derived from the offline data set.

From the above device, it can be seen that: when the model to be trained is required to be trained, first information of the intelligent agent can be acquired from a preset offline data set, and the first information is used for indicating that the intelligent agent is in a target state. Then, the first information can be input into the model to be trained, so that the first information is processed through the model to be trained, and the occurrence probability of a first action of the intelligent agent is obtained, wherein the first action is used for enabling the intelligent agent to enter a next state of the target state from the target state. Finally, the model to be trained can be trained based on the occurrence probability of the first action of the agent and the true occurrence probability of the first action, so that a generated flow model is obtained, and the true occurrence probability of the first action is derived from the offline data set. In the foregoing process, the occurrence probability of the first action of the agent may be referred to as a predicted action policy of the to-be-trained model for the target state, and the true occurrence probability of the first action of the agent may be referred to as a true action policy of the offline database for the target state, so that the predicted action policy for the target state may be fitted to the true action policy for the target state as much as possible, and the true action policy for the target state determines the true probability of the agent entering the next state of the target state from the target state, so that the to-be-trained model not only can learn as much as possible into each next state of the target state, but also the learned states are enough to conform to the actual environment where the agent is located (because the data in the offline data set are all set in advance based on the actual environment where the agent is located), and then the generated flow model obtained by training in the offline training mode may have better performance.

In one possible implementation manner, the training module is configured to train the model to be trained based on the occurrence probability of the first action, so that a difference between the occurrence probability of the first action and the actual occurrence probability of the first action is within a preset range, and a generated flow model is obtained.

In one possible implementation, the training module is configured to: correcting the occurrence probability of a second action of the intelligent agent based on the offline data set to obtain corrected occurrence probability of the second action, wherein the second action is used for enabling the intelligent agent to enter a target state from a previous state of the target state; correcting the reward value corresponding to the target state based on the offline data set to obtain a corrected reward value corresponding to the target state; based on the occurrence probability of the first action, the corrected occurrence probability of the second action and the reward value corresponding to the target state, training the model to be trained to obtain a generated stream model.

In one possible implementation, the offline data set includes M first candidate information and M second candidate information, the ith first candidate information being used to indicate that the agent is in an ith candidate state, the ith second candidate information being used to indicate that the agent is in a previous state to the ith candidate state, the M first candidate information including first information, the M second candidate information including second information, the second information being used to indicate that the agent is in a previous state to the target state, the M candidate states including the target state, M being greater than or equal to 1; the training module is used for correcting the occurrence probability of the second action of the intelligent agent based on the first information, the second information, the M pieces of first candidate information and the M pieces of second candidate information to obtain the corrected occurrence probability of the second action.

In one possible implementation manner, the offline data set further includes reward values corresponding to M candidate states, and the training module is configured to correct the reward values corresponding to the target states based on the first information, the M first candidate information, and the reward values corresponding to the M candidate states, to obtain corrected reward values corresponding to the target states.

A fifth aspect of embodiments of the present application provides an action prediction apparatus, including the third aspect or a generating flow model in any one of possible implementation manners of the third aspect, where the apparatus includes: the acquisition module is used for acquiring information of the intelligent agent, wherein the information is used for indicating that the intelligent agent is in a target state; the processing module is used for processing the information through the model to be trained to obtain the occurrence probability of the action of the intelligent agent, wherein the action is used for enabling the intelligent agent to enter the next state of the target state from the target state.

From the above device, it can be seen that: when the agent is currently in the target state, the agent may first collect information indicating that the agent is in the target state in order to predict an action to bring itself into a next state from the target state into the target state. After obtaining the information of the intelligent agent, the intelligent agent can input the information of the intelligent agent into the generated flow model so as to process the information of the intelligent agent through the generated flow model, thereby obtaining the occurrence probability of one or more actions of the intelligent agent. Then, the agent may select an action with the highest probability of occurrence among the one or more actions, and perform the action to enter a certain next state of the target state. In the process, the generated flow model is built in the intelligent agent, so that the self-conversion between different states can be accurately completed based on the completion of the motion prediction and the motion execution.

A sixth aspect of embodiments of the present application provides an action prediction apparatus, including: the acquisition module is used for acquiring information of the intelligent agent, wherein the information is used for indicating that the intelligent agent is in a target state; the processing module is used for processing the information through the model to be trained to obtain the occurrence probability of the action of the intelligent agent, wherein the action is used for enabling the intelligent agent to enter the next state of the target state from the target state; and the determining module is used for determining the action to be executed based on the occurrence probability of the action and the occurrence probability of the preset action.

A seventh aspect of embodiments of the present application provides a model training apparatus, the apparatus comprising a memory and a processor; the memory stores code, the processor being configured to execute the code, and when the code is executed, the model training apparatus performs the method as described in the first aspect or any one of the possible implementations of the first aspect.

An eighth aspect of the embodiments of the present application provides an action prediction apparatus, the apparatus comprising a memory and a processor; the memory stores code, the processor being configured to execute the code, the motion prediction apparatus performing the method according to any one of the possible implementations of the second, third or third aspects when the code is executed.

A ninth aspect of the embodiments of the present application provides a circuit system comprising a processing circuit configured to perform the method according to any one of the first aspect, any one of the possible implementations of the first aspect, the second aspect, the third aspect or any one of the possible implementations of the third aspect.

A tenth aspect of the embodiments of the present application provides a chip system, which includes a processor for invoking a computer program or computer instructions stored in a memory to cause the processor to perform a method as described in any one of the first aspect, any one of the possible implementations of the first aspect, the second aspect, the third aspect, or any one of the possible implementations of the third aspect.

In one possible implementation, the processor is coupled to the memory through an interface.

In one possible implementation, the system on a chip further includes a memory having a computer program or computer instructions stored therein.

An eleventh aspect of the embodiments of the present application provides a computer storage medium storing a computer program which, when executed by a computer, causes the computer to implement a method as described in any one of the first aspect, the second aspect, the third aspect, or any one of the possible implementations of the first aspect, the second aspect, the third aspect, or the third aspect.

A twelfth aspect of the embodiments of the present application provides a computer program product storing instructions that, when executed by a computer, cause the computer to implement a method as described in any one of the first aspect, the second aspect, the third aspect or any one of the possible implementations of the first aspect, the second aspect, the third aspect or the third aspect.

In this embodiment, when the model to be trained needs to be trained, first information of the agent may be obtained from a preset offline data set, where the first information is used to indicate that the agent is in a target state. Then, the first information can be input into the model to be trained, so that the first information is processed through the model to be trained, and the occurrence probability of a first action of the intelligent agent is obtained, wherein the first action is used for enabling the intelligent agent to enter a next state of the target state from the target state. Finally, the model to be trained can be trained based on the occurrence probability of the first action of the agent, thereby obtaining the generated flow model. In the foregoing process, when the to-be-trained model obtains the occurrence probability of the first action of the agent, the difference between the occurrence probability of the first action of the agent and the actual occurrence probability of the first action of the agent is located within a preset range, and because the occurrence probability of the first action of the agent can be referred to as a predicted action policy of the to-be-trained model for the target state, the actual occurrence probability of the first action of the agent can be referred to as an actual action policy of the target state in the offline database, so that the predicted action policy of the target state can be attached to the actual action policy of the target state as much as possible, and the actual action policy of the target state determines the actual probability of the agent entering the next state of the target state from the target state, so that the to-be-trained model can learn as much of each next state of the target state as possible, and the learned state is enough to conform to the actual environment in which the agent is located (because the data in the offline data set are all set in advance based on the actual environment in which the agent is located), the generated flow model obtained by training in the offline training mode can have better performance.

Drawings

FIG. 1 is a schematic diagram of a structure of an artificial intelligence main body frame;

FIG. 2a is a schematic diagram of a motion prediction system according to an embodiment of the present disclosure;

FIG. 2b is a schematic diagram of another configuration of the motion prediction system according to the embodiments of the present application;

FIG. 2c is a schematic diagram of a device related to motion prediction provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a system 100 architecture according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a generated flow model according to an embodiment of the present application;

FIG. 5a is a schematic flow chart of a modeling method according to an embodiment of the present application;

FIG. 5b is an exemplary illustration of an application of the model training method provided by embodiments of the present application;

FIG. 6 is a schematic flow chart of a motion prediction method according to an embodiment of the present disclosure;

FIG. 7 is another schematic structural diagram of a generated flow model according to an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of a model training device according to an embodiment of the present disclosure;

fig. 9a is a schematic structural diagram of an action prediction device according to an embodiment of the present application;

fig. 9b is another schematic structural diagram of the motion prediction apparatus according to the embodiment of the present application;

Fig. 10 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a training device according to an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which the embodiments of the application described herein have been described for objects of the same nature. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

At present, a generated flow model provided by the related art can process information associated with a target state where an agent is located after determining that the agent is located in the target state, so as to predict occurrence probability of one or more actions of the agent, where the actions of the agent are used to make the agent enter one or more actions next to the target state from the target state. In this way, the agent can execute the action with the largest occurrence probability predicted by the neural network model, so as to enter a certain next state of the target state. The process is repeated continuously, so that the agent can go from the initial state to the final state continuously through the intermediate states. For example, let the agent be a vehicle in an autopilot scenario. When the vehicle moves straight near an intersection, the vehicle detects that the intersection has a red light (for example, a camera of the vehicle shoots that the intersection has a red light). At this time, the vehicle traveling to the intersection where the red light appears may be regarded as an initial state in which the vehicle is located. Then, the vehicle may input information indicating that the vehicle itself is in an initial state (for example, an image of an intersection where a red light appears, which is captured by a camera) into the neural network model, and the model may analyze the information to predict an occurrence probability of an action to be performed by the vehicle (for example, an occurrence probability of stopping the vehicle from running is 99% and an occurrence probability of continuing to run is 1%), so that the vehicle performs an action with the maximum occurrence probability, thereby stopping in front of the intersection where the red light appears, and may be regarded as an intermediate state where the vehicle is located.

Further, in the process of training the model by using the online training mode, for any one state of the intelligent agent, the model needs to determine not only the next state of the state but also the previous state of the state, however, the model often has difficulty in finding the correct previous state of the state, which further results in poor performance of the generated flow model obtained by training.

To address the above issues, embodiments of the present application provide a method of motion prediction that may be implemented in conjunction with artificial intelligence (artificial intelligence, AI) technology. AI technology is a technical discipline that utilizes digital computers or digital computer controlled machines to simulate, extend and extend human intelligence, and obtains optimal results by sensing environments, acquiring knowledge and using knowledge. In other words, artificial intelligence technology is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Data processing using artificial intelligence is a common application of artificial intelligence.

First, the overall workflow of the artificial intelligence system will be described, referring to fig. 1, fig. 1 is a schematic structural diagram of an artificial intelligence subject framework, and the artificial intelligence subject framework is described below in terms of two dimensions, namely, an "intelligent information chain" (horizontal axis) and an "IT value chain" (vertical axis). Where the "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process. The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.

(1) Infrastructure of

The infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform. Communicating with the outside through the sensor; the computing power is provided by a smart chip (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform comprises a distributed computing framework, a network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection and interworking networks and the like. For example, the sensor and external communication obtains data that is provided to a smart chip in a distributed computing system provided by the base platform for computation.

(2) Data

The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to the internet of things data of the traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capability

After the data has been processed, some general-purpose capabilities can be formed based on the result of the data processing, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5) Intelligent product and industry application

The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, smart city etc.

Next, several application scenarios of the present application are described.

Fig. 2a is a schematic structural diagram of a motion prediction system according to an embodiment of the present application, where the motion prediction system includes an agent and a data processing device. The intelligent body comprises an intelligent terminal such as a robot, vehicle-mounted equipment or an unmanned aerial vehicle. The intelligent agent is an initiating terminal of action prediction and can initiate a request by itself as an initiating party of an action prediction request.

The data processing device may be a device or a server having a data processing function, such as a cloud server, a web server, an application server, and a management server. The data processing equipment receives the action prediction request from the intelligent terminal through the interactive interface, and then performs information processing in the modes of machine learning, deep learning, searching, reasoning, decision making and the like through a memory for storing data and a processor link for data processing. The memory in the data processing device may be a generic term comprising a database storing the history data locally, either on the data processing device or on another network server.

In the motion prediction system shown in fig. 2a, during the interaction process of the agent with the environment, the agent may acquire its own state information, and then initiate a request to the data processing device, so that the data processing device executes a motion prediction application for the state information acquired by the agent, thereby acquiring the occurrence probability of the motion of the agent. For example, a certain agent may acquire information indicating that itself is in a certain state, and initiate a processing request for the information to the data processing apparatus. Then, the data processing device may call the generated flow model to process the information, thereby obtaining the occurrence probability of the actions of the agent, and return the occurrence probability of the actions to the agent, where the actions may enable the agent to enter the next state from the state, so far, the action prediction for the agent is completed, and then the agent may select the action with the highest occurrence probability, and execute the action to enter the next state.

In fig. 2a, a data processing device may perform the action prediction method of the embodiments of the present application.

Fig. 2b is another schematic structural diagram of the motion prediction system provided in the embodiment of the present application, in fig. 2b, an agent may perform motion prediction, and the agent may directly obtain its own state information and directly process the state information by the hardware of the agent, and the specific process is similar to fig. 2a, and reference is made to the above description and will not be repeated here.

In the motion prediction system shown in fig. 2b, for example, a certain agent may obtain information indicating that the agent itself and other agents are in a certain state, and process the information to obtain the occurrence probability of the motion of the agent, where the motion may make the agent enter a next state from the state, so far, motion prediction for the agent is completed, and then the agent may select the motion with the highest occurrence probability and execute the motion, thereby entering the next state.

In fig. 2b, the agent itself may perform the action prediction method of the embodiments of the present application.

Fig. 2c is a schematic diagram of a related device for motion prediction according to an embodiment of the present application.

The agent in fig. 2a and 2b may be specifically the local device 301 or the local device 302 in fig. 2c, and the data processing device in fig. 2a may be specifically the execution device 210 in fig. 2c, where the data storage system 250 may store data to be processed of the execution device 210, and the data storage system 250 may be integrated on the execution device 210, or may be disposed on a cloud or other network server.

The processors in fig. 2a and 2b may perform data training/machine learning/deep learning through a neural network model or other model (e.g., a generated stream model), and complete an action prediction application for the state information of the agent using the data final training or the learned model, thereby predicting the action of the agent.

Fig. 3 is a schematic diagram of a system 100 architecture provided in an embodiment of the present application, in fig. 3, an execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a client device 140 (i.e. the foregoing agent) inputs data to the I/O interface 112, where the input data may include in an embodiment of the present application: each task to be scheduled, callable resources, and other parameters.

In the process of preprocessing input data by the execution device 110, or performing relevant processing (such as performing functional implementation of a neural network in the present application) such as calculation by the calculation module 111 of the execution device 110, the execution device 110 may call data, codes, etc. in the data storage system 150 for corresponding processing, or may store data, instructions, etc. obtained by corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing results to the client device 140.

It should be noted that the training device 120 may generate, based on different training data, a corresponding target model/rule for different targets or different tasks, where the corresponding target model/rule may be used to achieve the targets or complete the tasks, thereby providing the user with the desired result. Wherein the training data may be stored in database 130 and derived from training samples collected by data collection device 160.

In the case shown in FIG. 3, the user may manually give input data, which may be manipulated through an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 140 may also be used as a data collection terminal to collect input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data as shown in the figure, and store the new sample data in the database 130. Of course, instead of being collected by the client device 140, the I/O interface 112 may directly store the input data input to the I/O interface 112 and the output result output from the I/O interface 112 as new sample data into the database 130.

It should be noted that fig. 3 is only a schematic diagram of a system architecture provided in the embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, for example, in fig. 3, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110. As shown in fig. 3, the neural network may be trained in accordance with the training device 120.

The embodiment of the application also provides a chip, which comprises the NPU. The chip may be provided in an execution device 110 as shown in fig. 3 for performing the calculation of the calculation module 111. The chip may also be provided in the training device 120 as shown in fig. 3 to complete the training work of the training device 120 and output the target model/rule.

The neural network processor NPU is mounted as a coprocessor to a main central processing unit (central processing unit, CPU) (host CPU) which distributes tasks. The core part of the NPU is an operation circuit, and the controller controls the operation circuit to extract data in a memory (a weight memory or an input memory) and perform operation.

In some implementations, the arithmetic circuitry includes a plurality of processing units (PEs) internally. In some implementations, the operational circuit is a two-dimensional systolic array. The arithmetic circuitry may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuitry is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit takes the data corresponding to the matrix B from the weight memory and caches the data on each PE in the arithmetic circuit. The operation circuit takes the matrix A data and the matrix B from the input memory to perform matrix operation, and the obtained partial result or the final result of the matrix is stored in an accumulator (accumulator).

The vector calculation unit may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, etc. For example, the vector computation unit may be used for network computation of non-convolutional/non-FC layers in a neural network, such as pooling, batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit can store the vector of processed outputs to a unified buffer. For example, the vector calculation unit may apply a nonlinear function to an output of the arithmetic circuit, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit generates a normalized value, a combined value, or both. In some implementations, the vector of processed outputs can be used as an activation input to an arithmetic circuit, for example for use in subsequent layers in a neural network.

The unified memory is used for storing input data and output data.

The weight data is transferred to the input memory and/or the unified memory directly by the memory cell access controller (direct memory access controller, DMAC), the weight data in the external memory is stored in the weight memory, and the data in the unified memory is stored in the external memory.

And a bus interface unit (bus interface unit, BIU) for implementing interaction among the main CPU, the DMAC and the instruction fetch memory through a bus.

The instruction fetching memory (instruction fetch buffer) is connected with the controller and used for storing instructions used by the controller;

and the controller is used for calling the instruction which refers to the cache in the memory and controlling the working process of the operation accelerator.

Typically, the unified memory, input memory, weight memory, and finger memory are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, which may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

Since the embodiments of the present application relate to a large number of applications of neural networks, for ease of understanding, related terms and related concepts of the neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, which may refer to an arithmetic unit having xs and intercept 1 as inputs, and the output of the arithmetic unit may be:

Where s=1, 2, … … n, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by joining together a number of the above-described single neural units, i.e., the output of one neural unit may be the input of another. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

The operation of each layer in a neural network can be described by the mathematical expression y=a (wx+b): the operation of each layer in a physical layer neural network can be understood as the transformation of input space into output space (i.e., row space to column space of the matrix) is accomplished by five operations on the input space (set of input vectors), including: 1. dimension increasing/decreasing; 2. zoom in/out; 3. rotating; 4. translating; 5. "bending". Wherein operations of 1, 2, 3 are completed by Wx, operation of 4 is completed by +b, and operation of 5 is completed by a (). The term "space" is used herein to describe two words because the object being classified is not a single thing, but rather a class of things, space referring to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value of a neuron in the layer neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weights W of each layer control how the space is transformed. The purpose of training the neural network is to finally obtain a weight matrix (a weight matrix formed by a plurality of layers of vectors W) of all layers of the trained neural network. Thus, the training process of the neural network is essentially a way to learn and control the spatial transformation, and more specifically to learn the weight matrix.

Since it is desirable that the output of the neural network is as close as possible to the value actually desired, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually desired target value and then according to the difference between the two (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be predicted to be lower, and the adjustment is continued until the neural network can predict the actually desired target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and the training of the neural network becomes the process of reducing the loss as much as possible.

(2) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial neural network model are updated by back propagation of the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal neural network model, such as a weight matrix.

(3) Generating a flow model (generative flow networks, GFlowNet)

Generating a flow model generally refers to a model constructed in the form of a directed acyclic graph, i.e., each state node has at least one parent state node, as opposed to only a unique parent state node for each state node in the tree structure. The generated flow model has a unique initial state node and a plurality of termination state nodes. Generating a flow model begins with initial state nodes predicting actions, and executing these actions can complete transitions between different states until a termination state node is reached.

The initial state node may include an output traffic, the intermediate state node may include an input traffic, an output traffic, and a preset prize value, and the termination node may include an input traffic and a preset prize value. In particular, the generated flow model can be imagined as a water pipe, the output flow of the initial state node is the total inflow of the whole generated flow model, and the sum of the input flows of all the termination state nodes is the total outflow of the whole generated flow model. For each intermediate state node, the ingress traffic is equal to the egress traffic. The input traffic and the output traffic of each intermediate state node are predicted by the neural network, and finally the input traffic of each termination state node can be predicted.

For example, as shown in fig. 4 (fig. 4 is a schematic structural diagram of a generated flow model provided in an embodiment of the present application), in the generated flow model, si represents a state node (i=0,..11), and xj represents a composite structure (j=3, 4,6, 10, 11). s0 is an initial state node, s10 and s11 are end state nodes, x3, x4, x6, x10 and x11 are composite structures, and prize values are arranged in the composite structures. Wherein the output traffic of the initial state node S0 is equal to the sum of the input traffic of the intermediate state node S1 and the input traffic of the intermediate state node S2.

The method provided in the present application is described below from the training side of the neural network and the application side of the neural network.

The model training method provided by the embodiment of the application relates to data sequence processing, and can be particularly applied to methods such as data training, machine learning, deep learning and the like, and intelligent information modeling, extraction, preprocessing, training and the like of symbolizing and formalizing training data (for example, first information of a first agent in the model training method provided by the embodiment of the application) are performed, so that a trained neural network (such as a generated flow model in the model training method provided by the embodiment of the application) is finally obtained; in addition, the motion prediction method provided in the embodiment of the present application may use the trained neural network, and input data (for example, information of the first agent in the motion prediction method provided in the embodiment of the present application, etc.) into the trained neural network, so as to obtain output data (for example, occurrence probability of the motion of the first agent in the motion prediction method provided in the embodiment of the present application). It should be noted that, the model training method and the motion prediction method provided in the embodiments of the present application are inventions based on the same concept, and may be understood as two parts in a system or two stages of an overall flow: such as a model training phase and a model application phase.

The model training method provided by the embodiment of the present application is described below with reference to fig. 5a, and fig. 5a is a schematic flow chart of the model method provided by the embodiment of the present application, as shown in fig. 5a, where the method includes:

501. and acquiring first information of the intelligent agent from a preset offline data set, wherein the first information is used for indicating that the intelligent agent is in a target state.

In this embodiment, when the model to be trained (the neural network model to be trained) needs to be trained, a preset offline data set (offset data set) may be acquired first, where the offline data set includes M data sets (M is a positive integer greater than or equal to 1), and the 1 st data set includes the 1 st first candidate information, the 1 st second candidate information, the 1 st third candidate information, the reward value corresponding to the 1 st candidate state, and the 1 st real action policy. Similarly, the mth data set includes the mth first candidate information, the mth second candidate information, the mth third candidate information, the reward value corresponding to the mth candidate state and the mth real action strategy. For ease of explanation, the following description will be given by taking the i-th data set as an example (i=1., M):

in the ith data set, the ith first candidate information is used for indicating that the intelligent agent is in the ith candidate state, the ith second candidate information is used for indicating that the intelligent agent is in the previous state of the ith candidate state, the ith third candidate information is used for indicating that the intelligent agent is in the next state of the ith candidate state, and it should be noted that the ith candidate state can have one or more previous states, and similarly, the ith candidate state can also have one or more next states. Also, the previous state and the next state of the i-th candidate state are generally the remaining candidate states other than the i-th candidate state among the M candidate states, that is, the transition relationship between the M candidate states has been set to be completed in advance (for example, the transition relationship between the M candidate states may refer to the transition relationship between the 11 states of s1 to s11 in fig. 4).

In the ith data set, the ith real action policy refers to the real occurrence probability of actions of the agent flowing out of the ith candidate state, the actions of the agent flowing out of the ith candidate state are used for enabling the agent to enter the next state of the ith candidate state from the ith candidate state, and it is required to be noted that the number of actions flowing out of the ith candidate state is the same as the number of the next states of the ith candidate state, and the actions are in one-to-one correspondence with each other.

In the ith data set, the prize value corresponding to the ith candidate state is a preset value, when the ith candidate state is an intermediate state, the corresponding prize value is zero, and when the ith candidate state is a termination state, the corresponding prize value is not zero (the size of the prize value can be set according to actual requirements, and the method is not limited).

Then, after the offline data set is obtained, since M data sets of the offline data set may be used as training data of the model to be trained, a certain data set is schematically described below, and the first candidate information in the data set is referred to as first information, the candidate state indicated by the first candidate information in the data set is referred to as a target state, the second candidate information in the data set is referred to as second information, and the third candidate information in the data set is referred to as third information. It follows that the first information is used to indicate that the agent is in the target state, the second information is used to indicate that the agent is in a previous state to the target state, the third information is used to indicate that the agent is in a next state to the target state, and the true occurrence probability of the action (i.e., the first action mentioned below) flowing out of the target state is known.

For example, for an offline data set D, D includes (s ₁ ,s ₁ ,s′R ₁ ,π ₁ )，...，(s″ _M ,s _M ,s′ _M ,R _M ,π _M ) The M data sets. Wherein s is ₁ For information indicating the 1 st (candidate) state, and so on, s _M Is information for indicating the mth state. s' ₁ For information indicating the previous state of the 1 st state, and so on, s _M Is information indicating a previous state of the mth state. s' ₁ For information indicating the next state to the 1 st state, and so on, s' _M Is information indicating the next state of the mth state. R is R ₁ For the prize value corresponding to state 1, and so on, R _M And the value of the reward corresponding to the Mth state. Pi ₁ For the 1 st real action strategy, for the real occurrence probability of actions flowing from the 1 st state, and so on, pi _M The M-th real action strategy is the real occurrence probability of the action flowing out from the M-th state.

Then, can be in (s ₁ ,s ₁ ,s′ ₁ ,R ₁ ,π ₁ )，...，(s″ _M ,s _M ,s′ _M ,R _M ,π _M ) Wherein one of the data sets is selected as (s ', s, s', R, pi), s is information indicating a target state, s 'is information indicating a previous state of the target state, s' is information indicating a next state of the target state, R is a prize value corresponding to the target state, pi Is the true probability of occurrence of an action flowing from the target state.

It should be understood that the information mentioned in this embodiment may be presented in various manners, for example, the first information for indicating that the agent is in the target state may be an image for presenting that the agent is in the target state, a video for presenting that the agent is in the target state, an audio for recording that the agent is in the target state, a text for describing that the agent is in the target state, or the like.

It should be further understood that, in this embodiment, the occurrence probability of an action flowing out from a certain state, that is, the output flow of the state, and accordingly, the occurrence probability of an action flowing into a certain state, that is, the input flow of the state, are not described in detail later.

502. And processing the first information through the model to be trained to obtain the occurrence probability of a first action of the intelligent agent, wherein the first action is used for enabling the intelligent agent to enter a next state of the target state from the target state.

After the first information is obtained, the first information may be input into the model to be trained, so that the first information is processed by the model to be trained, thereby obtaining the (prediction) occurrence probability of the first action (i.e., the action flowing out from the target state) of the intelligent agent, where the first action is used to make the intelligent agent enter the next state of the target state from the target state. So far, the model to be trained completes the action prediction aiming at the target state.

As still another example, after s is input into the model to be trained, the model to be trained can perform a series of processing on s to obtain the predicted occurrence probability of the motion flowing from the target state

In the above-mentioned method, the step of,

to predict the occurrence probability of the jth action flowing out of the target state, N is the number of actions flowing out of the target state.

503. Training the model to be trained based on the occurrence probability of the first action and the true occurrence probability of the first action to obtain a generated flow model, wherein the true occurrence probability is derived from the offline data set.

It should be noted that, when the to-be-trained model acquires the occurrence probability of the first action of the agent, the following constraint condition is to be followed as far as possible (it is also understood that the to-be-trained model takes the constraint condition as a model training target): the difference between the occurrence probability of the first action of the agent and the true occurrence probability of the first action of the agent is made to lie within a preset range (the size of the range may be set according to the actual implementation, and is not limited here).

As still the above example, the model to be trained will be as much as possible

The following conditions are satisfied:

in the above formula, F (s, a) _j ) Is the true probability of occurrence of the jth action flowing from the target state.

Specifically, the model to be trained may be trained by:

(1) After obtaining the occurrence probability of the first action of the agent, the occurrence probability of the second action of the agent (i.e., the action flowing into the target state) for causing the agent to enter the target state from the previous state of the target state may also be obtained. It should be noted that, since the model to be trained has completed the motion prediction for the previous state of the target state, the occurrence probability of the second motion of the agent can be directly obtained. Then, the probability of occurrence of the second action of the agent may be corrected using some of the data in the offline data set, thereby obtaining a corrected probability of occurrence of the second action of the agent.

More specifically, the corrected occurrence probability of the second action of the agent may be obtained by:

and extracting M pieces of first candidate information and M pieces of second candidate information from the offline database, and calculating the first information, the second information, the M pieces of first candidate information and the M pieces of second candidate information to obtain new conversion values aiming at the target state. Then, the probability of occurrence of the second action of the agent is corrected by using the new transition value for the target state, and the corrected probability of occurrence of the second action is obtained.

Still as in the example above, the new transition value for the target state may be obtained by the following formula

In the above, s _i S', which is information for indicating the ith state _i Is information indicating a previous state of the i-th state. Obtaining

Afterwards, use->

Predicted occurrence probability of action flowing into target state +.>

Correction is performed to obtain a corrected predicted occurrence probability of the action flowing into the target state>

Wherein a' _k The kth action of the inflow target state, P is the number of actions of the inflow target state, F (s, a' _k ) Predicted probability of occurrence of kth action for flowing into target state, +.>

The probability of occurrence of the k-th action after correction, which is the inflow target state.

(2) After the occurrence probability of the first action of the intelligent agent is obtained, the rewarding value of the target state object can be obtained from the offline data set, and the rewarding value corresponding to the target state is corrected by utilizing some data in the offline data set, so that the corrected rewarding value corresponding to the target state is obtained.

More specifically, the corrected prize value corresponding to the target state may be obtained by:

and extracting M pieces of first candidate information and rewarding values corresponding to the M pieces of candidate states from an offline database, and calculating the first information, the M pieces of first candidate information and the rewarding values corresponding to the M pieces of candidate states so as to obtain corrected rewarding values corresponding to the target states.

As still another example, the corrected prize value corresponding to the target state may be obtained by the following equation

In the above, R _i And the value of the reward corresponding to the ith state.

(3) After obtaining the corrected occurrence probability of the second action of the intelligent agent and the corrected reward value corresponding to the target state, the occurrence probability of the first action of the intelligent agent, the corrected occurrence probability of the second action of the intelligent agent and the corrected reward value corresponding to the target state can be calculated, and the loss for the target state can be obtained.

Still as in the example above, the loss L(s) for the target state may be obtained by the following formula:

(4) After the loss aiming at the target state is obtained, similar operations can be executed for the rest candidate states except the target state in the M candidate states, so that the loss aiming at the M candidate states can be finally obtained, the loss can be overlapped to obtain the target loss, the parameters of the model to be trained are updated by utilizing the target loss, and the model to be trained after the parameters are updated is obtained. The next batch of training data may then be used to continue training the model to be trained after updating the parameters until model training conditions (e.g., target loss convergence, etc.) are met, resulting in a generated flow model.

Further, in the process of training the model to be trained by using the offline training mode provided by the embodiment of the application, the conversion value of the target state is corrected to obtain a new conversion value of the target state, and then the occurrence probability of the second action of the intelligent agent is corrected based on the new conversion value.

Further, in the process of training the model to be trained by using the offline training mode provided by the embodiment of the application, the reward value corresponding to the target state is corrected, so that the corrected reward value corresponding to the target state is obtained, then the loss for the target state constructed based on the corrected reward value is more correct, then the generated stream model is obtained by training based on the loss, and the performance of generating the stream model can be further improved.

For a further understanding of the model training method provided in the embodiments of the present application, the method is further described below in conjunction with fig. 5 b. Fig. 5b is a schematic diagram illustrating an application of the model training method provided in the embodiment of the present application, as shown in fig. 5b, an offline data set may be constructed by a third party (for example, a developer of an agent, etc.), where a plurality of questions (gold/program, that is, information for indicating a plurality of states), a corresponding plurality of answers (answers, that is, a plurality of actions), and labels of the plurality of answers (label, which may be referred to as a true occurrence probability of the plurality of answers, that is, a true occurrence probability of the plurality of actions) are collected first, and a reward value (reward) of the plurality of questions is constructed based on the labels of the plurality of answers, so that the information may form the offline data set.

For a generated flow model to be trained (gflowets), a plurality of questions in the offline data set may be input into the generated flow model to be trained, resulting in a predicted probability of occurrence of a corresponding plurality of answers. In this way, a target loss (loss) may be calculated based on the prize value, the probability of true occurrence, and the probability of predicted occurrence, and the parameters of the generated flow model may be updated with the target loss. The training process is circularly executed for a plurality of times until the loss converges, and a trained generated stream model can be obtained. The trained generated stream model has intelligent question-answering capabilities, either directly as an automatic question-answering model or as part of a large automatic question-answering model (e.g., chatGPT model, etc.), which can refine and enhance the functionality of these automatic question-answering models (e.g., enhance the diversity of answers in the ChatGPT model, etc.).

The foregoing is a detailed description of the model training method provided in the embodiments of the present application, and the motion prediction method provided in the embodiments of the present application will be described below. It should be noted that, the motion prediction method provided in the embodiment of the present application may be applied to various scenarios, where concepts such as an agent, a motion, and a state involved in the method are changed along with a change of an application scenario, for example, in an automatic driving scenario, a vehicle may predict a next driving motion of the vehicle according to a driving state of the vehicle in a traffic environment, and execute the driving motion, so as to change the driving state of the vehicle in the traffic environment. As another example, in a supply chain scenario, a certain robot may predict its own next transport direction and advance in these directions according to its own current transport state in the shop, so as to change its own transport state in the shop. In another example, in an advertisement recommendation scenario, an advertiser may predict a switch between advertisement contents according to advertisement contents that itself is currently recommended to a user and perform the switch to change advertisement contents that itself is recommended to the user. In another example, in a game scenario, a game player may predict its own next operation and perform the operation according to its own competitive state in the virtual game environment, so as to change its own competitive state in the virtual game environment, and so on. As another example, in an intelligent question-and-answer scenario, a machine may predict its own next operation based on its current dialog with the user and perform that operation to generate the next dialog with the user, and so on. Fig. 6 is a schematic flow chart of a motion prediction method provided in an embodiment of the present application, as shown in fig. 6, where the method is performed by an agent, and the agent has the generated flow model trained in fig. 5 built therein, and the method includes:

601. And acquiring information of the intelligent agent, wherein the information is used for indicating that the intelligent agent is in a target state.

In this embodiment, it is assumed that an agent exists in the environment, and the agent continuously interacts with the environment, i.e., performs various actions in the environment, thereby changing the state of the agent in the environment. It should be noted that, the agent may predict its own action by itself and perform the action, thereby continuously changing the state of the agent in the environment.

In order to predict the action of making the intelligent body enter the next state from the target state to the target state, the intelligent body can collect information for indicating the intelligent body to be in the target state, wherein the information can be an image shot by the intelligent body through a camera, a video shot by the intelligent body through the camera, audio collected by the intelligent body through a microphone, a text generated by the intelligent body and the like.

602. And processing the information through the model to be trained to obtain the occurrence probability of the action of the intelligent agent, wherein the action is used for enabling the intelligent agent to enter the next state of the target state from the target state.

After obtaining the information of the agent, the agent may input the information of the agent into a generated flow model provided in the agent, so as to process the information of the agent (for example, a series of feature extraction processes, etc.) through the generated flow model, thereby predicting the occurrence probability of one or more actions of the agent.

Any action predicted by the agent is used for enabling the agent to enter a state next to the target state in the environment. Thus, the agent completes the action prediction for the target state.

In this way, the agent can select an action with the highest occurrence probability from the one or more predicted actions, and execute the action to enter a certain next state of the target state. Of course, the agent may also select an action with the highest occurrence probability among the predicted one or more actions, and one or more preset actions (the occurrence probability of these preset actions is derived from a dataset empirically constructed by an expert, and this preset action may be an action predicted before the agent or an action set in advance in the agent by a person, which is not limited here), and execute the action to enter a certain next state of the target state. Thus, the agent completes the execution of the action for the target state.

For example, as shown in fig. 7 (fig. 7 is another schematic structural diagram of a generated flow model provided in the embodiment of the present application), after the vehicle 1 inputs a photo taken by the generated flow model into the generated flow model of the vehicle 1, the generated flow model may determine that the vehicle 1 is in the initial state node S0 (i.e., the vehicle 1 is in the initial state S0) based on the photo, and then the generated flow model may process the photo to obtain the flow rate of the action a1 of the vehicle 1 (may also be referred to as the occurrence probability of the action a 1) and the flow rate of the action a2 of the vehicle 1.

The action a1 of the vehicle 1 is used to make the vehicle 1 directly enter the intermediate state node S1 from the initial state node S0 (may also be referred to as that the action a1 of the vehicle 1 flows out from the initial state node S0 and flows into the intermediate state node S1), and the action a2 of the vehicle 1 is used to make the vehicle 1 directly enter the intermediate state node S2 from the initial state node S0. The sum of the flow rate of the operation a1 of the vehicle 1 and the flow rate of the operation a2 of the vehicle 1 is the output flow rate of the initial state node S0, the flow rate of the operation a1 of the vehicle 1 is the input flow rate of the intermediate state node S1, and the flow rate of the operation a2 of the vehicle 1 is the input flow rate of the intermediate state node S2. If the flow rate of the operation a1 of the vehicle 1 is greater than the flow rate of the operation a2 of the vehicle 1, the vehicle 1 may select the operation a1 and execute the operation a1. It can be seen that the vehicle 1 is now in the intermediate state S1.

After the vehicle 1 is in the intermediate state S1, the vehicle 1 may take a new photo indicating that the vehicle 1 is in the intermediate state S1. Then, the vehicle 1 inputs a new photograph into the generated flow model of the vehicle 1, and the generated flow model can process the photograph to obtain the flow rate of the operation a3 of the vehicle 1 (may also be referred to as the occurrence probability of the operation a3 of the vehicle 1).

Wherein the action a3 of the vehicle 1 is for letting the vehicle 1 enter the intermediate state node S3 directly from the intermediate state node S1. The flow rate of the operation a3 of the vehicle 1 is the output flow rate of the intermediate state node S1, and the flow rate of the operation a3 of the vehicle 1 is the input flow rate of the intermediate state node S3. Since the flow rate of the action a3 of the vehicle 1 is maximum, the vehicle 1 may perform the action a3 of the vehicle 1. It can be seen that the vehicle 1 is now in the intermediate state S3.

And so on until the vehicle 1 is at some termination state node. For example, the vehicle 1 continues to perform state transition (i.e., performs motion prediction and motion execution) from the intermediate state node S3, passes through the intermediate state node S7 and the intermediate state node S10, and finally reaches the termination state node S13, and no state transition is performed.

In this embodiment of the present application, when the agent is currently in the target state, in order to predict an action to make itself enter the next state from the target state into the target state, the agent may collect information for indicating that the agent is in the target state. After obtaining the information of the intelligent agent, the intelligent agent can input the information of the intelligent agent into the generated flow model so as to process the information of the intelligent agent through the generated flow model, thereby obtaining the occurrence probability of one or more actions of the intelligent agent. Then, the agent may select an action with the highest probability of occurrence among the one or more actions, and perform the action to enter a certain next state of the target state. In the process, the generated flow model is built in the intelligent agent, so that the self-conversion between different states can be accurately completed based on the completion of the motion prediction and the motion execution.

The model training method and the motion prediction method provided in the embodiments of the present application are described in detail above, and the model training apparatus and the motion prediction apparatus provided in the embodiments of the present application will be described below. Fig. 8 is a schematic structural diagram of a model training device according to an embodiment of the present application, as shown in fig. 8, where the device includes:

an obtaining module 801, configured to obtain first information of an agent from a preset offline data set, where the first information is used to indicate that the agent is in a target state;

the processing module 802 is configured to process the first information through a model to be trained to obtain an occurrence probability of a first action of the agent, where the first action is used to enable the agent to enter a next state of the target state from the target state;

the training module 803 is configured to train the model to be trained based on the occurrence probability of the first action, so as to obtain a generated flow model, where the true occurrence probability is derived from the offline data set.

In this embodiment, when the model to be trained needs to be trained, first information of the agent may be obtained from a preset offline data set, where the first information is used to indicate that the agent is in a target state. Then, the first information can be input into the model to be trained, so that the first information is processed through the model to be trained, and the occurrence probability of a first action of the intelligent agent is obtained, wherein the first action is used for enabling the intelligent agent to enter a next state of the target state from the target state. Finally, the model to be trained can be trained based on the occurrence probability of the first action of the agent and the true occurrence probability of the first action, so that a generated flow model is obtained, and the true occurrence probability is derived from the offline data set. In the foregoing process, the occurrence probability of the first action of the agent may be referred to as a predicted action policy of the to-be-trained model for the target state, and the true occurrence probability of the first action of the agent may be referred to as a true action policy of the offline database for the target state, so that the predicted action policy for the target state may be fitted to the true action policy for the target state as much as possible, and the true action policy for the target state determines the true probability of the agent entering the next state of the target state from the target state, so that the to-be-trained model not only can learn as much as possible into each next state of the target state, but also the learned states are enough to conform to the actual environment where the agent is located (because the data in the offline data set are all set in advance based on the actual environment where the agent is located), and then the generated flow model obtained by training in the offline training mode may have better performance.

In one possible implementation, the training module 803 is configured to train the model to be trained based on the occurrence probability of the first action, so that a difference between the occurrence probability of the first action and the true occurrence probability of the first action is within a preset range, to obtain the generated flow model.

In one possible implementation, training module 803 is configured to: correcting the occurrence probability of a second action of the intelligent agent based on the offline data set to obtain corrected occurrence probability of the second action, wherein the second action is used for enabling the intelligent agent to enter a target state from a previous state of the target state; correcting the reward value corresponding to the target state based on the offline data set to obtain a corrected reward value corresponding to the target state; based on the occurrence probability of the first action, the corrected occurrence probability of the second action and the reward value corresponding to the target state, training the model to be trained to obtain a generated stream model.

In one possible implementation, the offline data set includes M first candidate information and M second candidate information, the ith first candidate information being used to indicate that the agent is in an ith candidate state, the ith second candidate information being used to indicate that the agent is in a previous state to the ith candidate state, the M first candidate information including first information, the M second candidate information including second information, the second information being used to indicate that the agent is in a previous state to the target state, the M candidate states including the target state, M being greater than or equal to 1; the training module 803 is configured to correct the occurrence probability of the second action of the agent based on the first information, the second information, the M first candidate information, and the M second candidate information, and obtain a corrected occurrence probability of the second action.

In one possible implementation manner, the offline data set further includes reward values corresponding to M candidate states, and the training module 803 is configured to correct the reward value corresponding to the target state based on the first information, the M first candidate information, and the reward value corresponding to the M candidate states, to obtain a corrected reward value corresponding to the target state.

Fig. 9a is a schematic structural diagram of an action prediction device according to an embodiment of the present application, where, as shown in fig. 9a, the device includes a generated flow model trained by the model training device, and the device includes:

the acquiring module 901 is configured to acquire information of an agent, where the information is used to indicate that the agent is in a target state;

the processing module 902 is configured to process the information through the model to be trained to obtain an occurrence probability of an action of the agent, where the action is used to enable the agent to enter a next state of the target state from the target state.

Fig. 9b is another schematic structural diagram of an action prediction device according to an embodiment of the present application, where, as shown in fig. 9b, the device includes a generated flow model obtained by training by the model training device, and the device includes:

A determining module 903, configured to determine an action to be performed based on the occurrence probability of the action based on the occurrence probability of the preset action.

In one possible implementation, the determining module 903 is configured to: and determining the action with the largest occurrence probability as the action to be executed in the action predicted by the generated flow model and the preset action.

It should be noted that, because the content of information interaction and execution process between the modules/units of the above-mentioned apparatus is based on the same concept as the method embodiment of the present application, the technical effects brought by the content are the same as the method embodiment of the present application, and specific content may refer to the description in the foregoing illustrated method embodiment of the present application, which is not repeated herein.

The embodiment of the application also relates to an execution device, and fig. 10 is a schematic structural diagram of the execution device provided in the embodiment of the application. As shown in fig. 10, the execution device 1000 may be embodied as a mobile phone, a tablet, a notebook, a smart wearable device, a server, etc., which is not limited herein. The execution device 1000 may be configured with the motion prediction apparatus described in the corresponding embodiment of fig. 9, to implement the motion prediction function in the corresponding embodiment of fig. 6. Specifically, the execution apparatus 1000 includes: receiver 1001, transmitter 1002, processor 1003, and memory 1004 (where the number of processors 1003 in execution device 1000 may be one or more, one processor is exemplified in fig. 10), where processor 1003 may include application processor 10031 and communication processor 10032. In some embodiments of the present application, the receiver 1001, transmitter 1002, processor 1003, and memory 1004 may be connected by a bus or other means.

Memory 1004 may include read only memory and random access memory and provide instructions and data to processor 1003. A portion of the memory 1004 may also include non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 1004 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 1003 controls the operation of the execution device. In a specific application, the individual components of the execution device are coupled together by a bus system, which may include, in addition to a data bus, a power bus, a control bus, a status signal bus, etc. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.

The methods disclosed in the embodiments of the present application may be applied to the processor 1003 or implemented by the processor 1003. The processor 1003 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry of hardware in the processor 1003 or instructions in the form of software. The processor 1003 may be a general purpose processor, digital signal processor (digital signal processing, DSP), microprocessor or microcontroller, and may further include an application specific integrated circuit (application specific integrated circuit, ASIC), field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The processor 1003 may implement or execute the methods, steps and logical blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 1004, and the processor 1003 reads information in the memory 1004 and performs the steps of the method in combination with its hardware.

The receiver 1001 may be used to receive input numeric or character information and to generate signal inputs related to performing relevant settings and function control of the device. The transmitter 1002 may be configured to output numeric or character information via a first interface; the transmitter 1002 may also be configured to send instructions to the disk stack via the first interface to modify data in the disk stack; the transmitter 1002 may also include a display device such as a display screen.

In this embodiment, in one case, the processor 1003 is configured to predict the behavior of the agent by generating the flow model in the corresponding embodiment of fig. 6.

The embodiment of the application also relates to training equipment, and fig. 11 is a schematic structural diagram of the training equipment provided by the embodiment of the application. As shown in FIG. 11, the training device 1100 is implemented by one or more servers, the training device 1100 may vary considerably in configuration or performance, and may include one or more central processing units (central processing units, CPU) 1114 (e.g., one or more processors) and memory 1132, one or more storage mediums 1130 (e.g., one or more mass storage devices) that store applications 1142 or data 1144. Wherein the memory 1132 and the storage medium 1130 may be transitory or persistent. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a series of instruction operations on the training device. Still further, central processor 1114 may be configured to communicate with storage medium 1130 to perform a series of instruction operations in storage medium 1130 on training device 1100.

The training device 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input-output interfaces 1158; or one or more operating systems 1141, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

Specifically, the training apparatus may perform the model training method in the corresponding embodiment of fig. 5.

The embodiments of the present application also relate to a computer storage medium in which a program for performing signal processing is stored, which when run on a computer causes the computer to perform the steps as performed by the aforementioned performing device or causes the computer to perform the steps as performed by the aforementioned training device.

Embodiments of the present application also relate to a computer program product storing instructions that, when executed by a computer, cause the computer to perform steps as performed by the aforementioned performing device or cause the computer to perform steps as performed by the aforementioned training device.

The execution device, training device or terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit to cause the chip in the execution device to perform the data processing method described in the above embodiment, or to cause the chip in the training device to perform the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), etc.

Specifically, referring to fig. 12, fig. 12 is a schematic structural diagram of a chip provided in an embodiment of the present application, where the chip may be represented as a neural network processor NPU 1200, and the NPU 1200 is mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU distributes tasks. The core part of the NPU is an operation circuit 1203, and the operation circuit 1203 is controlled by the controller 1204 to extract matrix data in the memory and perform multiplication operation.

In some implementations, the operation circuit 1203 internally includes a plurality of processing units (PEs). In some implementations, the operational circuit 1203 is a two-dimensional systolic array. The operation circuit 1203 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 1203 is a general purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1202 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 1201 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 1208.

The unified memory 1206 is used to store input data and output data. The weight data is carried directly through the memory cell access controller (Direct Memory Access Controller, DMAC) 1205, the DMAC into the weight memory 1202. The input data is also carried into the unified memory 1206 through the DMAC.

BIU is Bus Interface Unit, bus interface unit 1213, for the AXI bus to interact with the DMAC and finger memory (Instruction Fetch Buffer, IFB) 1209.

The bus interface unit 1213 (Bus Interface Unit, abbreviated as BIU) is configured to fetch the instruction from the external memory by the instruction fetch memory 1209, and is also configured to fetch the raw data of the input matrix a or the weight matrix B from the external memory by the memory unit access controller 1205.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1206 or to transfer weight data to the weight memory 1202 or to transfer input data to the input memory 1201.

The vector calculation unit 1207 includes a plurality of operation processing units, and further processes such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like are performed on the output of the operation circuit 1203 as needed. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization (batch normalization), pixel-level summation, up-sampling of a predicted label plane and the like.

In some implementations, the vector computation unit 1207 can store the vector of processed outputs to the unified memory 1206. For example, the vector calculation unit 1207 may perform a linear function; alternatively, a nonlinear function is applied to the output of the operation circuit 1203, such as linear interpolation of the predicted tag plane extracted by the convolutional layer, and then such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 1207 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 1203, for example for use in subsequent layers in a neural network.

An instruction fetch memory (instruction fetch buffer) 1209 connected to the controller 1204, for storing instructions used by the controller 1204;

the unified memory 1206, the input memory 1201, the weight memory 1202, and the finger memory 1209 are all On-Chip memories. The external memory is proprietary to the NPU hardware architecture.

The processor mentioned in any of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection therebetween, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course may be implemented by dedicated hardware including application specific integrated circuits, dedicated CPUs, dedicated memories, dedicated components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment in many cases for the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a training device, or a network device, etc.) to perform the method described in the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Claims

1. A method of model training, the method comprising:

acquiring first information of an intelligent agent from a preset offline data set, wherein the first information is used for indicating that the intelligent agent is in a target state;

processing the first information through a model to be trained to obtain the occurrence probability of a first action of the intelligent agent, wherein the first action is used for enabling the intelligent agent to enter a next state of the target state from the target state;

training the model to be trained based on the occurrence probability of the first action and the true occurrence probability of the first action to obtain a generated flow model, wherein the true occurrence probability is derived from the offline data set.

2. The method of claim 1, wherein the training the model to be trained based on the probability of occurrence of the first action and the probability of true occurrence of the first action to obtain a generated flow model comprises:

and training the model to be trained based on the occurrence probability of the first action so that the difference between the occurrence probability of the first action and the actual occurrence probability of the first action is within a preset range, thereby obtaining a generated flow model.

3. The method of claim 2, wherein training the model to be trained based on the probability of occurrence of the first action to obtain a generated flow model comprises:

correcting the occurrence probability of a second action of the intelligent agent based on the offline data set to obtain corrected occurrence probability of the second action, wherein the second action is used for enabling the intelligent agent to enter the target state from the previous state of the target state;

correcting the reward value corresponding to the target state based on the offline data set to obtain a corrected reward value corresponding to the target state;

and training the model to be trained based on the occurrence probability of the first action, the corrected occurrence probability of the second action and the corrected reward value corresponding to the target state to obtain a generated flow model.

4. The method of claim 3, wherein the offline data set includes M first candidate information for indicating that the agent is in an i-th candidate state and M second candidate information for indicating that the agent is in a previous state to the i-th candidate state, the M first candidate information including the first information, the M second candidate information including second information for indicating that the agent is in a previous state to the target state, the M candidate states including the target state, M being ≡1;

The correcting the occurrence probability of the second action of the intelligent agent based on the offline data set, and obtaining the corrected occurrence probability of the second action includes:

and correcting the occurrence probability of the second action of the intelligent agent based on the first information, the second information, the M pieces of first candidate information and the M pieces of second candidate information, so as to obtain the corrected occurrence probability of the second action.

5. The method of claim 4, wherein the offline data set further comprises prize values for the M candidate states, and wherein modifying the prize value for the target state based on the offline data set to obtain a modified prize value for the target state comprises:

and correcting the reward value corresponding to the target state based on the first information, the M pieces of first candidate information and the reward values corresponding to the M pieces of candidate states, and obtaining the corrected reward value corresponding to the target state.

6. The method of any one of claims 1 to 5, wherein the first information is information collected when the agent is in the target state, the information comprising at least one of: image, video, audio or text.

7. A method of motion prediction, the method being implemented by the generated flow model of any one of claims 1 to 6, the method comprising:

acquiring information of an intelligent agent, wherein the information is used for indicating that the intelligent agent is in a target state;

and processing the information through a model to be trained to obtain the occurrence probability of the action of the intelligent agent, wherein the action is used for enabling the intelligent agent to enter the next state of the target state from the target state.

8. A method of motion prediction, the method comprising:

processing the information through a model to be trained to obtain the occurrence probability of the action of the intelligent agent, wherein the action is used for enabling the intelligent agent to enter the next state of the target state from the target state;

and determining the action to be executed based on the occurrence probability of the action and the occurrence probability of the preset action.

9. The method of claim 8, wherein determining an action to be performed based on the probability of occurrence of the preset action comprises:

And determining the action with the largest occurrence probability as the action to be executed in the actions and the preset actions.

10. A model training apparatus, the apparatus comprising:

the acquisition module is used for acquiring first information of the intelligent agent from a preset offline data set, wherein the first information is used for indicating that the intelligent agent is in a target state;

the processing module is used for processing the first information through a model to be trained to obtain the occurrence probability of a first action of the intelligent agent, wherein the first action is used for enabling the intelligent agent to enter a next state of the target state from the target state;

the training module is used for training the model to be trained based on the occurrence probability of the first action and the real occurrence probability of the first action to obtain a generated flow model, and the real occurrence probability is derived from the offline data set.

11. The apparatus of claim 10, wherein the training module is configured to train the model to be trained based on the occurrence probability of the first action such that a difference between the occurrence probability of the first action and the true occurrence probability of the first action is within a preset range, resulting in a generated flow model.

12. The apparatus of claim 11, wherein the training module is configured to:

and training the model to be trained based on the occurrence probability of the first action, the corrected occurrence probability of the second action and the reward value corresponding to the target state to obtain a generated flow model.

13. The apparatus of claim 12, wherein the offline data set includes M first candidate information and M second candidate information, the i first candidate information being used to indicate that the agent is in an i candidate state, the i second candidate information being used to indicate that the agent is in a previous state to the i candidate state, the M first candidate information including the first information, the M second candidate information including second information, the second information being used to indicate that the agent is in a previous state to the target state, the M candidate states including the target state, M being ≡1;

The training module is configured to correct the occurrence probability of the second action of the agent based on the first information, the second information, the M first candidate information, and the M second candidate information, and obtain the corrected occurrence probability of the second action.

14. The apparatus of claim 13, wherein the offline data set further comprises prize values for the M candidate states, and wherein the training module is configured to correct the prize value for the target state based on the first information, the M first candidate information, and the prize value for the M candidate states to obtain a corrected prize value for the target state.

15. The apparatus of any one of claims 10 to 14, wherein the first information is information collected when the agent is in the target state, the information comprising at least one of: image, video, audio or text.

16. An action prediction device, characterized in that the device comprises a generated flow model according to any one of claims 10 to 15, the device comprising:

the acquisition module is used for acquiring information of the intelligent agent, wherein the information is used for indicating that the intelligent agent is in a target state;

The processing module is used for processing the information through the model to be trained to obtain the occurrence probability of the action of the intelligent agent, wherein the action is used for enabling the intelligent agent to enter the next state of the target state from the target state.

17. An action prediction device, the device comprising:

the processing module is used for processing the information through a model to be trained to obtain the occurrence probability of the action of the intelligent agent, wherein the action is used for enabling the intelligent agent to enter the next state of the target state from the target state;

and the determining module is used for determining the action to be executed based on the occurrence probability of the action and the occurrence probability of the preset action.

18. A model training apparatus, the apparatus comprising a memory and a processor; the memory stores code, the processor being configured to execute the code, the model training apparatus performing the method of any of claims 1 to 6 when the code is executed.

19. An action prediction device, characterized in that the device comprises a memory and a processor; the memory stores code, the processor being configured to execute the code, the motion prediction apparatus performing the method of any of claims 7 to 9 when the code is executed.

20. A computer storage medium storing one or more instructions which, when executed by one or more computers, cause the one or more computers to implement the method of any one of claims 1 to 9.

21. A computer program product, characterized in that it stores instructions that, when executed by a computer, cause the computer to implement the method of any one of claims 1 to 9.