Disclosure of Invention
The invention provides a multi-agent-based recommendation method and device, which are used for obtaining final recommendation values from individual (agent) and group (agent set) consideration respectively, so that more objective and accurate recommendation is realized.
In a first aspect, the present disclosure provides a multi-agent based recommendation method, comprising:
acquiring historical state information of the current agent under the condition that the current input information of the current agent is determined, wherein the historical state information comprises the historical input information of the current agent and list information related to other agents;
inputting the historical state information of the current agent and the current input information into a policy network, and generating a first real value of the current agent and first real values of other agents;
processing the first real value of the current agent and the first real values of other agents to obtain a feedback value;
inputting the feedback value of the current agent, the feedback values of the other agents and a preset discount factor into an evaluation network, and outputting evaluation value vectors corresponding to the current agent and the other agents;
and inputting the evaluation value vector into a policy network, outputting the final recommended value of other agents relative to the current agent, and determining a recommended value agent set.
According to the multi-agent-based recommendation method provided by the present disclosure, the processing based on the first real values of the current agent and the first real values of the other agents to obtain the feedback value includes:
based on the historical state information and the current input information of the current agent, obtaining the initial recommended values of the other agents corresponding to the input information of each time relative to the current agent through the policy network;
labeling a sample label for each input information, wherein the label value of the sample label is 1 or 0;
sampling a first real value of the current agent to obtain a sampling value;
updating the feedback value based on the sampling value and the tag value.
According to the multi-agent-based recommendation method provided by the present disclosure, the inputting the feedback value of the current agent, the feedback values of the other agents and a preset discount factor into an evaluation network, and outputting evaluation value vectors corresponding to the current agent and the other agents includes:
inputting the feedback value of the current agent, the feedback values of the other agents and a preset discount factor into an evaluation network, and outputting evaluation values corresponding to the current agent and the other agents;
and evaluating the evaluation value through an evaluation network, and outputting evaluation value vectors corresponding to the current agent and the other agents.
According to the multi-agent-based recommendation method provided by the present disclosure, the inputting the evaluation value vector into a policy network and outputting the final recommendation values of other agents relative to the current agent includes:
inputting the evaluation value vector into a strategy network to obtain an updated strategy network;
generating a second true value of the current agent and second true values of other agents based on the updated policy network, the historical state information of the current agent, and the current input information;
determining the weight values of other agents relative to the current agent based on a preset agent utility matrix; wherein the agent utility matrix comprises a weight value of each agent;
and outputting the final recommended value of the other agents relative to the current agent based on the second real value of the current agent, the second real values of the other agents and the weight value.
According to the multi-agent based recommendation method provided by the present disclosure, the determining a recommendation value agent set comprises:
comparing the recommended values of the other agents relative to the current agent with a preset threshold value;
if the recommendation value is larger than or equal to a preset threshold value, recommending a recommendation value agent set for the current agent, wherein the recommendation value agent set is the sum of agents relevant to the input information and the historical state information of the current agent;
and if the recommended value is smaller than the preset threshold value, randomly recommending the agent for the current agent.
According to the multi-agent-based recommendation method provided by the present disclosure, updating the feedback value based on the sampling value and the tag value is realized by the following formula:
wherein p is
t Representing the number of historical positive samples, n, in the historical state information
t Indicating the number of historical negative examples, G, in the historical state information
t The tag value representing the t-th time,
indicating the recommended value of agent i at the t-th time.
According to a multi-agent based recommendation method provided by the present disclosure, the method further comprises:
updating a weight value in the agent utility matrix based on a first loss function and a second loss function; wherein the first loss function is:
the second loss function is:
wherein p is t Representing the number of historical positive samples, n, in the historical state information t Indicating the number of historical negative examples, G, in the historical state information t The value of the tag, o, at the t-th time t Representing recommended values, mu representing the policy of each agent, i and j representing agents, s ij Representing the similarity between the state of agent i and the state of agent j, the agent utility matrix is decomposed into two small matrices, denoted A and B, having dimensions Nxd and dXN, respectively, with d being less than N, a i Is the ith row of matrix A, a j Is the jth row of matrix A, b i Is the ith column of matrix B, B j Is the jth column of matrix B.
In a second aspect, the present disclosure provides a multi-agent based recommendation device comprising:
the intelligent agent comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring historical state information of a current intelligent agent under the condition that current input information of the current intelligent agent is determined, and the historical state information comprises historical input information of the current intelligent agent and list information related to other intelligent agents;
the generating module is used for inputting the historical state information and the current input information of the current intelligent agent into the strategy network and generating a first true value of the current intelligent agent and first true values of other intelligent agents;
the processing module is used for processing the current first real value of the agent and the first real values of other agents to obtain a feedback value;
the input module is used for inputting the feedback value of the current agent, the feedback values of the other agents and a preset discount factor into an evaluation network and outputting evaluation value vectors corresponding to the current agent and the other agents;
and the determining module is used for inputting the evaluation value vector into a policy network, outputting the final recommended value of other intelligent agents relative to the current intelligent agent and determining a recommended value intelligent agent set.
In a third aspect, the present disclosure provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the multi-agent based recommendation method according to any of the above when executing the program.
In a fourth aspect, the present disclosure provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the multi-agent based recommendation method as in any one of the above.
In a fifth aspect, the present disclosure provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the multi-agent based recommendation method as defined in any one of the above.
The present disclosure provides a multi-agent based recommendation method and apparatus, by determining current input information and historical state information of a current agent, wherein the historical state information includes historical input information of the current agent and list information related to other agents; inputting the historical state information of the current agent and the current input information into a policy network, and generating a first real value of the current agent and first real values of other agents; processing the first real value of the current agent and the first real values of other agents to obtain a feedback value; inputting the feedback value of the current agent, the feedback values of other agents and a preset discount factor into an evaluation network, and outputting evaluation value vectors corresponding to the current agent and other agents; and inputting the evaluation value vector into a policy network, outputting the final recommended value of other agents relative to the current agent, and determining a recommended value agent set. The feedback value is obtained through the first real value of the current agent and the first real values of other agents, the evaluation value vector is input into the strategy network, the final recommended values of other agents relative to the current agent are output, and the agent recommended for the current agent is determined in other agents, so that the final recommended values are obtained from individual (agent) and group (agent set) consideration, and the current agent can be recommended more objectively and accurately.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all, of the embodiments of the present disclosure. Every other embodiment that can be obtained by a person of ordinary skill in the art without making creative efforts based on the embodiments of the present disclosure falls within the protection scope of the embodiments of the present disclosure.
The recommendation method based on the multi-agent provided by the embodiment of the disclosure is applied to a delay aggregation graph neural network architecture (MA-DGNN) of the multi-agent, the multi-agent is composed of a series of interacting agents, and the internal agents complete complex and large-scale tasks which cannot be completed by a single agent through modes of mutual communication, cooperation, competition and the like.
In the following, a description is given of a prior art aggregation Graph Neural Network (GNN), followed by a description of the improved delayed aggregation neural network (DGNN) of the present disclosure.
aggregate-Graph Neural Networks (GNNs) are a class of information processing frameworks that operate on network data in a decentralized manner, obtain useful information through repeated communication with neighbors, process the ordinary aggregate GNNs of the anchor graph, and support processing of fixed signals on the anchor graph.
The prior art also extends the application of GNN to fixed graphs to the process of processing time-varying graphs on time-varying graph support, specifically:
introducing (i, j) ∈ epsilon
n Indicates that j may send data i, epsilon at time n
n Set of edges represented on a time-varying graph, employing graph-shift operators
Description graph support G
n ,G
n Represents in which [ S
n ]
ij Possibly non-zero if and only if (j, i) ∈ ε
n Or i ═ j, which represents the sparseness of the graph. More specifically, the state of node j at time n-1 is represented as
Then through [ S ]
n ]
ij Possibly non-zero if and only if (j, i) ∈ ε
n Or the attribute i-j,
representing the set of all elements that send data to i at time n, resulting in equation (1):
based on the locality of (1), the delayed aggregation GNN constructs a recursive k-hop neighborhood aggregation sequence as shown in the formulas (2), (3) and (4). In particular, a signal sequence is defined
Has y
0n =x
n And:
y kn =S n y (k-1)(n-1) (3)
to obtain y kn =(S n S n-1 ...S n-k+1 )x n-k Thus, (3) a time-varying network sequence S is characterized n-k+1 To S n In x n-k The diffusion of (2). Next, y is polymerized kn (K ∈ {0, 1., K-1}) forms a nested state along the multi-hop neighborhood:
z in =[[y 0n ] i ;[y 1n ] i ;…;[y (K-1)n ] i ] (4)
wherein z is
in The (k + 1) th element, [ y ]
kn ]
i =[S
n S
n-1 ...S
n-k+1 x
n-k ]
i Is x
j(n-k) Average value of (2), here
Is k at time n-k
- And (4) skipping a neighborhood.
Specifying z in Has a regular temporal structure due to its nested aggregation property, z being a convolutional neural network of length L in Modelling, i.e. for L e [ L]Setting:
wherein σ (l) Is a point-by-point nonlinear function, H (l) Is a type of small support filter shared by each node with learnable parameters.
In summary, the delay aggregation neural network described by equations (3) - (5) includes strategies
The local parameterization of (2) captures the dynamic and sparse interaction of the network and allows remote communication through multi-hop information diffusion.
Referring to fig. 1, a schematic flow chart of a multi-agent-based recommendation method provided in an embodiment of the present disclosure includes:
and 110, acquiring historical state information of the current agent under the condition that the current input information of the current agent is determined, wherein the historical state information comprises the historical input information of the current agent and list information related to other agents.
In this step, an adaptive and autonomous hardware, software or other entity is referred to, for example, a robot, a drone, a game character, etc., and in the embodiment of the present disclosure, an agent is understood to be a user.
The current input information may be understood as information of a user when the message is published, including content information of the published message or information of the user related to the published message, and the like.
The list information of other agents can be understood as the tweet list information and the referee list information which are issued by the user in the past.
And 120, inputting the historical state information of the current agent and the current input information into the policy network, and generating a first real value of the current agent and first real values of other agents.
In this step, the policy network may be understood as a delay aggregation neural network, and the real value may be understood as a real number.
Specifically, historical state information of the current user and current input information are input into the delay aggregation neural network, and a first real value of the current user and first real values of other users are generated.
And 130, processing the current first real value of the agent and the first real values of other agents to obtain a feedback value.
In this step, the processing refers to modeling, and is specifically realized by a Markov Decision Process (MDP).
The markov decision process is a mathematical model of sequential decisions for simulating stochastic strategies and returns achievable by an agent in an environment where the system state has markov properties. The MDP is built based on a set of interactive objects, namely agents and environments, with elements including status, actions, feedback values and rewards. In the simulation of MDP, the agent perceives the current system state and acts on the environment in terms of reward values, thereby changing the state of the environment and receiving rewards, the accumulation of which over time is referred to as rewards.
The feedback value is obtained based on a markov decision process, and can also be understood as a feedback value calculated by a model to improve the system performance.
And 140, inputting the feedback value of the current agent, the feedback values of the other agents and a preset discount factor into an evaluation network, and outputting evaluation value vectors corresponding to the current agent and the other agents.
In this step, a preset discount factor γ is set to 0.99.
The evaluation network can be understood as performing centralized evaluation on each agent, absorbing the cascade of the state and the action of each agent by the evaluating agent in the evaluation network, evaluating the state-action pair of each agent by the evaluating agent, and outputting an evaluation value vector so as to update the strategy network.
And 150, inputting the evaluation value vector into a policy network, outputting the final recommended value of other agents relative to the current agent, and determining a recommended value agent set.
In this step, the set of recommended-value agents refers to one or more agents that are referenced by the current agent in the list information of other agents.
The recommendation method based on the multi-agent comprises the steps of determining current input information and historical state information of a current agent, wherein the historical state information comprises the historical input information of the current agent and list information related to other agents; inputting the historical state information of the current agent and the current input information into a policy network, and generating a first real value of the current agent and first real values of other agents; processing the first real value of the current agent and the first real values of other agents to obtain a feedback value; inputting the feedback value of the current agent, the feedback values of other agents and a preset discount factor into an evaluation network, and outputting evaluation value vectors corresponding to the current agent and other agents; and inputting the evaluation value vector into a policy network, outputting the final recommended value of other agents relative to the current agent, and determining a recommended value agent set. The feedback value is obtained through the first real value of the current agent and the first real values of other agents, the evaluation value vector is input into the strategy network, the final recommended values of other agents relative to the current agent are output, and the agent recommended for the current agent is determined in other agents, so that the final recommended values are obtained from individual (agent) and group (agent set) consideration, and the current agent can be recommended more objectively and accurately.
Based on any of the above embodiments, referring to fig. 2, a schematic flow chart for obtaining a feedback value provided in the embodiments of the present disclosure includes:
and 210, obtaining initial recommendation values of the other agents corresponding to the input information each time relative to the current agent through the policy network based on the historical state information of the current agent and the current input information.
And 220, labeling a sample label for each input information, wherein the label value of the sample label is 1 or 0.
In this step, when an agent j mentioned in the tweet list information by the agent i belongs to the agent in the callee list information, the agent j is marked as a positive sample with a label value of 1, otherwise, the agent j is marked as a negative sample with a label value of 0.
And 230, sampling the first real value of the current agent to obtain a sampling value.
In this step, each agent outputs a real value, and thus each agent corresponds to a sampling value. When the first real value of the current agent is adopted, a binary value is obtained, and the binary value obeys the bernoulli distribution, which can be expressed as:
current agent i ∈ [ N ]]Outputting a first true value
Thereby sampling a binary value
And 240, updating the feedback value based on the sampling value and the label value.
In this step, since the number of negative samples far exceeds the number of positive samples in the real data set, it is necessary to adjust the terms in the feedback values to solve the problem of unbalanced data sets, and therefore, the feedback values are updated by the acquired sampling values and tag values to make the data sets relatively balanced.
Based on any of the above embodiments, the step 140 specifically includes the following steps 141 to 142:
141, inputting the feedback value of the current agent, the feedback values of the other agents, and a predetermined discount factor into an evaluation network, and outputting evaluation values corresponding to the current agent and the other agents.
In the step, the method is specifically realized by the following formula:
a minimax target is customized for each agent to update the evaluation network at the tth time:
wherein,
representing the minimax target for each agent,
representing a loss of strategy, pi being in state s
t And strategy μ to obtain a
t,i Probability of(s)
t Is the concatenation of the state of each agent at the time of the t, a
t,i (i∈[N]) Is the action of agent i, r
t,i Represents a feedback value, gamma represents a discount factor, a'
i Representing the actions of agent i in the target policy network, μ being the policy of each agent, μ
i (s
t ) Is mu(s)
t ) μ' is the target policy network of M3DDPG, ≧ represents the optimal solution, M3DDPG represents the minummax multi-agent depth certainty policy gradient.
In order to for each agent i e N]Deducing
An end-to-end solution is adopted to replace the inner loop minimization process in equation (1) with a single step gradient descent. Specifically, y in (1) of reacquiring
i Comprises the following steps:
wherein,
denotes a gradient symbol, a'
k Represents the action of an agent k in the target policy network, belongs to
j Representing the minimum value of the loss of the strategy.
To yield yj and corresponding
Then, for each i e [ N ∈ [ ]]We will each
Linear combination is performed, and the evaluation value in the step is used
And (4) showing.
The step realizes that in the face of multiple non-stationarities from environment and other agents, a minimum maximum target is customized for each agent, so that the delay aggregation graph neural network architecture of the multiple agents learns a robust strategy during training.
And 142, evaluating the evaluation value through an evaluation network, and outputting evaluation value vectors corresponding to the current agent and the other agents.
In this step, the evaluation network evaluates the state-action pair of each agent, and outputs evaluation value vectors corresponding to the current agent and the other agents, thereby updating the policy network.
Referring to fig. 3, a schematic flow chart for obtaining a final recommended value according to the embodiment of the present disclosure includes:
and 310, inputting the evaluation value vector into a policy network to obtain an updated policy network.
In this step, a delayed aggregation neural network is used as the policy network.
And 320, generating a second true value of the current agent and second true values of other agents based on the updated policy network, the historical state information of the current agent and the current input information.
330, determining the weight values of other agents relative to the current agent based on a preset agent utility matrix; wherein the agent utility matrix comprises a weight value for each agent.
In this step, the preset agent utility matrix models the dynamic relationship between each pair of users, and is updated by a neural network matrix decomposition method, with weights based on the similarity between users and optimized independently of the DGNN, rather than being combined to be updated slowly with the DGNN with one feedback at a time.
And 340, outputting the final recommended value of the other agents relative to the current agent based on the second real value of the current agent, the second real values of the other agents and the weight value.
In this step, a preset agent utility matrix is used to aggregate the second true value of each agent and output a linear weighted value o t To make a final decision, the final recommended value is o t And (4) showing.
Based on any of the above embodiments, the step 150 specifically includes the following steps 151 to 153:
and 151, comparing the recommended value of the other agents relative to the current agent with a preset threshold value.
In this step, the preset threshold value is thre t Represents that o is t And thre t A comparison is made.
And 152, if the recommendation value is greater than or equal to a preset threshold value, recommending a recommendation value agent set for the current agent, wherein the recommendation value agent set is the sum of the agents related to the input information and the historical state information of the current agent.
In this step, if o t ≥thre t Then is the current agent i t Recommending the last mentioned K 0 Personal agent, K 0 One or more;
153, if the recommended value is smaller than the preset threshold, randomly recommending the agent for the current agent.
In this step, if o t <thre t Otherwise, it is the current agent i t And recommending the intelligent agent randomly.
Based on any of the above embodiments, updating the feedback value based on the sampling value and the tag value is implemented by the following formula:
wherein p is
t Indicating the number of historical positive samples in the historical state information, n
t Indicating the number of historical negative examples, G, in the historical state information
t The tag value representing the t-th time,
indicating the recommended value of agent i at the t-th time.
In this step, when an agent j mentioned in the tweet list information by the agent i belongs to the agent in the callee list information, the agent j is marked as a positive sample with a label value of 1, otherwise, the agent j is marked as a negative sample with a label value of 0.
The number of positive samples is the number of samples having a tag value of 1 in the entire sample data, and the number of negative samples is the number of samples having a tag value of 0 in the entire sample data book.
Based on any of the above embodiments, the method further comprises:
updating a weight value in the agent utility matrix based on a first loss function and a second loss function; wherein the first loss function is:
the second loss function is:
wherein p is t Representing the number of historical positive samples, n, in the historical state information t Indicating the number of historical negative examples, G, in the historical state information t The value of the tag, o, at the t-th time t Representing recommended values, mu representing the policy of each agent, i and j representing agents, s ij Representing the similarity between the states of agent i and agent j, the agent utility matrix is decomposed into two small matrices, denoted A and B, respectively, each having a size NX d and d x N, and d is less than N, a i Is the ith row of matrix A, a j Is the jth row of matrix A, b i Is the ith column of matrix B, B j Is the jth column of matrix B.
In this step, the weight values are updated based on a preset agent utility matrix, where N is present in the preset agent utility matrix 2 A parameter that decomposes the preset agent utility matrix into two small matrices a and B, having sizes nxd and dxn, respectively, where d is less than N. To estimate these two matrices, we apply a neural network matrix decomposition method, whose loss is composed of the two loss functions described above.
Based on any of the above embodiments, before executing the multi-agent-based recommendation method, the delay aggregation graph neural network architecture (MA-DGNN) provided in the embodiments of the present disclosure needs to be trained, and in order to obtain other agents with greater similarity and better decision quality, when collecting sample agents in a playback buffer in the delay aggregation graph neural network architecture, a method of customizing an empirical playback mechanism is adopted, and specifically, the factors to be considered include: (i) time; (ii) whether the user feedback is positive or negative; (iii) the number of times a user mentions a certain user; (iv) the number of times a user is mentioned by other users; (v) average performance of the user assisting other users in making decisions; (vi) average similarity between user state and other user states.
For each sample agent i e length
buffer I calculate the six factors mentioned above
For each factor j ∈ [6 ]]We are in
Executes the softmax function and obtains
Then for each sampling process, the samples of i in the playback buffer (i e [ length [ ])
buffer ]) Corresponding to a sampling probability of
In order to improve the diversity of experience playback, the sampling method is adopted to obtain the training samples in a batch
In addition, the
Batches are obtained by state clustering. Specifically, the samples in the buffer are divided into 10 groups according to their states, and the number of samples is randomly and uniformly selected from each group. Furthermore, during sampling, we measured the standard error of the true label for each set of samples and obtained the average std of 10 standard errors
avg,t . If std
avg,t Greater than history std
avg,i }
i∈[t-1] 80% of the original min, then it is determined that significant non-stationarity may exist, thereby removing the oldest min {100, std%
avg,t 100) to keep track of the user's latest preferences.
Further, to add further description to the implementation of the present disclosure, referring to fig. 4, which is a block diagram of a multi-agent-based recommendation method provided in an embodiment of the present disclosure, the process in fig. 5 is implemented by using the MDP, the policy network, and the evaluation network in fig. 4, and referring to fig. 5, a specific flow diagram of the multi-agent-based recommendation method provided in the embodiment of the present disclosure includes the following steps 510 to 590:
and 510, acquiring current input information and historical state information of the current agent through MDP, wherein the historical state information comprises historical input information of the current agent and list information related to other agents, and the list information of other agents comprises tweet list information and list information of the referred persons.
And 520, inputting the historical state information of the current agent and the current input information into the policy network, and generating a first real value of the current agent and the first real values of other agents.
And 530, obtaining the initial recommended value of other agents corresponding to the input information each time relative to the current agent through the policy network based on the historical state information of the current agent and the current input information.
And 540, labeling a sample label for each input information, wherein the label value of the sample label is 1 or 0, sampling the first real value of the current intelligent agent to obtain a sampling value, and updating the feedback value based on the sampling value and the label value.
And 550, inputting the feedback value of the current agent, the feedback values of other agents and a preset discount factor into an evaluation network, outputting evaluation values corresponding to the current agent and other agents, evaluating the evaluation values through the evaluation network, and outputting evaluation value vectors corresponding to the current agent and other agents.
560, inputting the evaluation value vector into the policy network, and outputting the final recommended value o of other agents relative to the current agent t And determining a recommended value agent set.
570, mixing t And a predetermined threshold thre t A comparison is made.
580, if o t ≥thre t Then is the current agent i t Recommending the last mentioned K 0 Personal agent, K 0 One or more may be used.
590 if o t <thre t Otherwise, it is the current agent i t And recommending the intelligent agent randomly.
In the following, a multi-agent based recommendation device provided by an embodiment of the present disclosure is described, and the multi-agent based recommendation device described below and the multi-agent based recommendation method described above may be referred to correspondingly.
Referring specifically to fig. 6, a schematic structural diagram of a multi-agent based recommendation device provided in an embodiment of the present disclosure is shown, where the device includes:
an obtaining module 610, configured to, in a case that current input information of a current agent is determined, obtain historical state information of the current agent, where the historical state information includes historical input information of the current agent and list information related to other agents.
And the generating module 620 is configured to input the current historical state information of the agent and the current input information into the policy network, and generate the first true value of the current agent and the first true values of the other agents.
The processing module 630 is configured to perform processing based on the first actual value of the current agent and the first actual values of the other agents to obtain the feedback value.
An input module 640, configured to input the feedback value of the current agent, the feedback values of the other agents, and a preset discount factor into an evaluation network, and output evaluation value vectors corresponding to the current agent and the other agents.
And the determining module 650 is configured to input the evaluation value vector into a policy network, output a final recommended value of the other agents relative to the current agent, and determine a recommended value agent set.
The recommendation device based on the multi-agent determines current input information and historical state information of a current agent, wherein the historical input information of the current agent and list information related to other agents are included in the historical state information; inputting the historical state information of the current agent and the current input information into a policy network, and generating a first real value of the current agent and first real values of other agents; processing the first real value of the current agent and the first real values of other agents to obtain a feedback value; inputting the feedback value of the current agent, the feedback values of other agents and a preset discount factor into an evaluation network, and outputting evaluation value vectors corresponding to the current agent and other agents; and inputting the evaluation value vector into a policy network, outputting the final recommended value of other agents relative to the current agent, and determining a recommended value agent set. The feedback value is obtained through the first real value of the current agent and the first real values of other agents, the evaluation value vector is input into the strategy network, the final recommended values of other agents relative to the current agent are output, and the agent recommended for the current agent is determined in other agents, so that the final recommended values are obtained from individual (agent) and group (agent set) consideration, and the current agent can be recommended more objectively and accurately.
Based on any of the above embodiments, the processing module 630 is specifically configured to:
based on the historical state information and the current input information of the current agent, obtaining the initial recommended values of the other agents corresponding to the input information of each time relative to the current agent through the policy network;
labeling a sample label for each input information, wherein the label value of the sample label is 1 or 0;
sampling a first real value of the current agent to obtain a sampling value;
updating the feedback value based on the sampling value and the tag value.
Based on any of the above embodiments, the input module 640 is specifically configured to:
inputting the feedback value of the current agent, the feedback values of the other agents and a preset discount factor into an evaluation network, and outputting evaluation values corresponding to the current agent and the other agents;
and evaluating the evaluation value through an evaluation network, and outputting evaluation value vectors corresponding to the current agent and the other agents.
Based on any of the embodiments above, the determining module 650 is specifically configured to:
inputting the evaluation value vector into a strategy network to obtain an updated strategy network;
generating a second true value of the current agent and second true values of other agents based on the updated policy network, the historical state information of the current agent, and the current input information;
determining the weight values of other agents relative to the current agent based on a preset agent utility matrix; wherein the agent utility matrix comprises a weight value of each agent;
and outputting the final recommended value of the other agents relative to the current agent based on the second real value of the current agent, the second real values of the other agents and the weight value.
Based on any of the above embodiments, the determining module 650 is further configured to:
comparing the recommended values of the other agents relative to the current agent with a preset threshold value;
if the recommendation value is larger than or equal to a preset threshold value, recommending a recommendation value agent set for the current agent, wherein the recommendation value agent set is the sum of agents relevant to the input information and the historical state information of the current agent;
and if the recommended value is smaller than the preset threshold value, randomly recommending the agent for the current agent.
Based on any of the above embodiments, updating the feedback value based on the sampling value and the tag value is implemented by the following formula:
wherein p is
t Representing the number of historical positive samples, n, in the historical state information
t Indicating the number of historical negative examples, G, in the historical state information
t The tag value representing the t-th time,
indicating the recommended value of agent i at the t-th time.
The multi-agent-based recommendation device provided by the embodiment of the disclosure further comprises an updating module, which is specifically configured to:
updating a weight value in the agent utility matrix based on a first loss function and a second loss function; wherein the first loss function is:
the second loss function is:
wherein p is t Representing the number of historical positive samples, n, in the historical state information t Indicating the number of historical negative examples, G, in the historical state information t The value of the tag, o, at the t-th time t Representing recommended values, mu representing the policy of each agent, i and j representing agents, s ij Representing the similarity between the state of agent i and the state of agent j, the agent utility matrix is decomposed into two small matrices, denoted A and B, having dimensions Nxd and dXN, respectively, with d being less than N, a i Is the ith row of matrix A, a j Is the jth row of matrix A, b i Is the ith column of matrix B, B j Is the jth column of matrix B.
Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a multi-agent based recommendation method comprising: acquiring historical state information of the current agent under the condition that the current input information of the current agent is determined, wherein the historical state information comprises the historical input information of the current agent and list information related to other agents; inputting the historical state information of the current agent and the current input information into a policy network, and generating a first real value of the current agent and first real values of other agents; processing based on the first real value of the current agent and the first real values of other agents to obtain a feedback value; inputting the feedback value of the current agent, the feedback values of the other agents and a preset discount factor into an evaluation network, and outputting evaluation value vectors corresponding to the current agent and the other agents; and inputting the evaluation value vector into a policy network, outputting the final recommended values of other agents relative to the current agent, and determining a recommended value agent set.
In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present disclosure also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the multi-agent based recommendation method provided by the above methods, including: acquiring historical state information of the current agent under the condition that the current input information of the current agent is determined, wherein the historical state information comprises the historical input information of the current agent and list information related to other agents; inputting the historical state information of the current agent and the current input information into a policy network, and generating a first real value of the current agent and first real values of other agents; processing the first real value of the current agent and the first real values of other agents to obtain a feedback value; inputting the feedback value of the current agent, the feedback values of the other agents and a preset discount factor into an evaluation network, and outputting evaluation value vectors corresponding to the current agent and the other agents; and inputting the evaluation value vector into a policy network, outputting the final recommended value of other agents relative to the current agent, and determining a recommended value agent set.
In yet another aspect, the present disclosure also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the multi-agent based recommendation method provided above, including: acquiring historical state information of the current agent under the condition that the current input information of the current agent is determined, wherein the historical state information comprises the historical input information of the current agent and list information related to other agents; inputting the historical state information of the current agent and the current input information into a policy network, and generating a first real value of the current agent and first real values of other agents; processing the first real value of the current agent and the first real values of other agents to obtain a feedback value; inputting the feedback value of the current agent, the feedback values of the other agents and a preset discount factor into an evaluation network, and outputting evaluation value vectors corresponding to the current agent and the other agents; and inputting the evaluation value vector into a policy network, outputting the final recommended value of other agents relative to the current agent, and determining a recommended value agent set.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present disclosure, not to limit it; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present disclosure.