CN120048123A

CN120048123A - Vehicle-road collaborative dynamic scheduling system and method for multi-agent reinforcement learning

Info

Publication number: CN120048123A
Application number: CN202510521087.5A
Authority: CN
Inventors: 鲜利; 邓艺舟; 田文; 谭双娅
Original assignee: Chongqing Kairui Robot Technology Co ltd
Current assignee: Chongqing Kairui Robot Technology Co ltd
Priority date: 2025-04-24
Filing date: 2025-04-24
Publication date: 2025-05-27
Anticipated expiration: 2045-04-24
Also published as: CN120048123B

Abstract

The invention discloses a vehicle-road collaborative dynamic scheduling system and a vehicle-road collaborative dynamic scheduling method for multi-agent reinforcement learning, which relate to the field of artificial intelligence and intelligent traffic and comprise a perception layer, a decision layer and an execution layer. The sensing layer is provided with a road side and a vehicle sensing module, the sensing layer is provided with environment and vehicle information, the decision layer receives the information, the information is mapped to a multidimensional topological space through a topological dynamic state representation unit, the nonlinear dimensionality reduction and manifold space representation is realized, a double-network collaborative decision unit comprises a vehicle dynamic distribution network VDDPG and a road side dynamic scheduling network RDDPG, a vehicle and road side scheduling strategy is respectively generated, strategy matrix decomposition is converted into low-rank representation, a multi-agent collaborative optimization unit is used for constructing an agent relation graph, a communication strategy is optimized based on information entropy, a collaborative scheduling strategy is inferred by using variation, an optimized collaborative scheduling scheme is generated, an execution layer receiving scheme is realized, and a vehicle execution module and a road side control module execute instructions and feed back results to the sensing layer to form closed-loop control.

Description

Vehicle-road collaborative dynamic scheduling system and method for multi-agent reinforcement learning

Technical Field

The invention relates to the field of artificial intelligence and intelligent traffic, in particular to a vehicle-road collaborative dynamic scheduling system and a vehicle-road collaborative dynamic scheduling method based on a multi-agent reinforcement learning technology, which are used for realizing collaborative optimization management of intelligent vehicles and road side facilities.

Background

With the acceleration of the urban process, the problem of urban traffic jam is increasingly severe, and the travel efficiency and the life quality of residents are seriously affected. The traditional traffic management mode mainly depends on the periodic control and manual intervention of a fixed signal lamp, and cannot adapt to the dynamically-changed traffic flow, so that the traffic resource utilization rate is low. Along with the rapid development of the internet of vehicles technology and artificial intelligence in recent years, the cooperation of vehicles and roads is gradually rising as the core technology of the next-generation intelligent transportation system.

At present, the vehicle-road cooperation technology is mainly divided into three types, namely a centralized control-based method for uniformly dispatching vehicles and road side facilities through a central server, a distributed control-based method for realizing cooperation through local decisions of the vehicles and the road side facilities, and a reinforcement learning-based method for learning an optimal control strategy through interaction of an agent and an environment. However, the prior art has the defects that the calculation complexity of the centralized control method is high, the real-time requirement is difficult to meet, the global optimum cannot be ensured by the distributed control method, and the problem of high-dimensional state space and multi-agent collaborative decision in the vehicle road environment is difficult to process by the existing reinforcement learning method.

In order to solve the above-mentioned problems, a multi-agent reinforcement learning system capable of efficiently processing a cooperative decision of a vehicle and a road in a complex traffic environment is needed.

Disclosure of Invention

The invention aims to provide a vehicle-road collaborative dynamic scheduling system and a vehicle-road collaborative dynamic scheduling method for multi-agent reinforcement learning, and aims to solve the problems that in the prior art, complex traffic environment state characterization, vehicle-road collaborative decision, multi-agent information interaction and the like are difficult to process efficiently.

The invention discloses a vehicle-road collaborative dynamic scheduling system for multi-agent reinforcement learning, which comprises the following steps:

The sensing layer comprises a road side sensing module and a vehicle sensing module and is used for collecting environment state information and vehicle information;

The decision layer is in communication connection with the perception layer and comprises:

The topological dynamic state characterization unit is used for receiving the environment state information and the vehicle information, mapping the environment state information and the vehicle information to a multidimensional topological space to form a topological state representation, and forming a manifold space representation through nonlinear dimension reduction;

The dual-network collaborative decision-making unit is configured to receive the manifold space representation, and includes a vehicle dynamic allocation network VDDPG and a roadside dynamic scheduling network RDDPG, where the VDDPG is configured to generate a vehicle scheduling policy, the RDDPG is configured to generate a roadside scheduling policy, and convert the vehicle scheduling policy and the roadside scheduling policy into a low-rank representation through policy matrix decomposition;

the multi-agent collaborative optimization unit is used for constructing an agent relation diagram, optimizing a communication strategy in the agent relation diagram based on information entropy, coordinating the vehicle scheduling strategy and the road side scheduling strategy through a variation inference method, and generating an optimized collaborative scheduling scheme;

And the execution layer is in communication connection with the decision layer and comprises a vehicle execution module and a road side control module, and is used for receiving the cooperative scheduling scheme, executing a corresponding scheduling instruction and feeding back an execution result to the perception layer to form closed-loop control.

Preferably, the topology dynamic state characterization unit includes:

The state acquisition subunit is used for receiving the environment state information and the vehicle information, and preprocessing the environment state information and the vehicle information to generate standardized data;

the topology characterization subunit is used for mapping the standardized data to a multidimensional topology space, extracting topology features and establishing a local coordinate system;

And the dimension conversion subunit is used for performing nonlinear dimension reduction on the data of the multidimensional topological space and generating manifold space representation which keeps the key topological relation.

Preferably, the dual-network collaborative decision-making unit includes:

VDDPGActor a network for receiving a manifold space representation associated with a vehicle, generating a vehicle dispatch strategy;

VDDPGCRITIC network for evaluating the value of the vehicle scheduling policy;

RDDPGActor a network for receiving manifold space representation related to a road side and generating a road side scheduling policy;

RDDPGCRITIC a network for evaluating the value of the roadside scheduling policy;

the strategy characterization subunit is used for converting the vehicle scheduling strategy and the road side scheduling strategy into strategy matrixes and executing matrix decomposition to generate low-rank representation;

And the collaborative optimization subunit is used for identifying potential conflict between the vehicle scheduling strategy and the road side scheduling strategy and adjusting strategy parameters through a conjugate gradient method.

Preferably, the multi-agent cooperative optimization unit includes:

The intelligent agent relation building module unit is used for building an intelligent agent relation diagram based on the current traffic condition and identifying the condition dependency relation among the intelligent agents;

the message transmission optimizing subunit is used for calculating the information entropy of each communication channel in the intelligent agent relation diagram, generating an optimal message transmission strategy and distributing communication resources;

The global consistency optimization subunit is used for decomposing the global optimization target into local sub-targets, coordinating the local decisions through a variation inference method, and verifying the global consistency of the final decisions.

Preferably, the sensing layer further includes:

the data preprocessing module is used for filtering, denoising and standardizing the environment state information and the vehicle information;

the state buffer module is used for storing historical environment state information and vehicle information and supporting time sequence analysis;

And the attention distribution module is used for adjusting the distribution strategy of the perceived resources according to the feedback of the decision layer.

Preferably, the execution layer further includes:

the instruction analysis module is used for converting the collaborative scheduling scheme into a specific control instruction;

the execution monitoring module is used for monitoring the execution condition of the control instruction and identifying an abnormal state;

The execution history recording module is used for maintaining execution history data for reference of a decision layer;

and the degradation processing module is used for executing a preset degradation strategy under the condition of communication interruption or equipment failure.

Preferably, the vehicle execution module and the roadside control module are provided with:

a timing synchronization unit for ensuring time synchronization of the vehicle control instruction and the road side control instruction;

the conflict detection unit is used for detecting potential conflicts in the execution process in real time and triggering emergency treatment;

and the cooperative effect evaluation unit is used for quantitatively evaluating the execution effect of the vehicle-road cooperative control and generating an effect evaluation report.

Preferably, the VDDPG network and the RDDPG network exchange information through a shared hidden layer, wherein:

The shared hidden layer receives common features of the manifold space representation and outputs intermediate feature representations;

the VDDPG network and the RDDPG network respectively receive the intermediate feature representations and generate corresponding strategy output by combining respective specific input features;

the VDDPG network and the RDDPG network realize collaborative parameter updating through a gradient locking mechanism, so as to ensure policy coordination consistency.

Preferably, the system further comprises:

the experience playback module is used for storing historical data of system and environment interaction and supporting offline batch learning;

the target network updating module is used for copying parameters from the main network to the target network at regular intervals, so as to ensure learning stability;

the self-adaptive learning rate adjusting module is used for dynamically adjusting the network learning rate according to training progress;

the model evaluation and deployment module is used for evaluating the performance of the model and deploying the trained model to the production environment.

The vehicle-road collaborative dynamic scheduling method for multi-agent reinforcement learning is applied to the vehicle-road collaborative dynamic scheduling system for multi-agent reinforcement learning, and comprises the following steps:

initializing a system, including initializing VDDPG network and RDDPG network parameters, establishing communication connection, loading road network topology structure and historical traffic data;

Collecting state information, including collecting environmental state information through a road side sensing module and collecting vehicle information through a vehicle sensing module;

Performing topology state characterization, including mapping state information to a multidimensional topology space, extracting topology features, forming manifold space representations by nonlinear dimension reduction;

Generating a double-network decision, wherein the double-network decision comprises VDDPGActor of generating a vehicle dispatching strategy by a network, RDDPGActor of generating a road side dispatching strategy and generating a low-rank representation by executing strategy matrix decomposition;

Executing collaborative optimization, including constructing an agent relation graph, optimizing a message passing strategy, coordinating local decisions through a variation inference method, and verifying global consistency;

executing a scheduling scheme, wherein the scheduling scheme comprises the steps of converting an optimized decision into a specific execution instruction and a vehicle and road side facility execution control instruction;

Collecting execution feedback, including monitoring execution results, updating environmental states, updating network parameters and experience pools based on the execution results;

iterative optimization, repeatedly executing the steps, and continuously optimizing the scheduling effect.

The invention adopts three innovative mechanisms of topology dynamic state representation, matrix decomposition driven double-network collaborative optimization and probability map driven multi-agent collaborative decision, and constructs a complete multi-agent reinforcement learning vehicle-road collaborative dynamic scheduling system, which has the following beneficial effects:

1) The high-dimensional complex traffic environment state is mapped to the low-dimensional manifold space through the topology dynamic state representation mechanism, key topology features are reserved, the calculation complexity is reduced, and the adaptability of the system to complex traffic scenes is improved.

2) The cooperative decision of the vehicle network and the road side network is realized through a matrix decomposition driven double-network cooperative optimization mechanism, the conflict problem of the vehicle strategy and the road side strategy is solved, and the scheduling efficiency is improved.

3) The information interaction strategy among the agents is optimized through the multi-agent collaborative decision mechanism driven by the probability map, the communication cost is reduced, and the consistency of the local decision and the global optimization target is ensured.

4) The system is designed in a layering and modularization way, has good expandability and adaptability, and can adapt to urban traffic environments with different scales and complexities.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the description of the embodiments will be briefly described below.

FIG. 1 is a schematic diagram of the overall architecture of a vehicle-road collaborative dynamic scheduling system for multi-agent reinforcement learning according to the present invention.

FIG. 2 is a schematic diagram of the topology dynamic state characterization unit of the present invention.

Fig. 3 is a schematic diagram of the structure of the dual-network collaborative decision-making unit of the present invention.

FIG. 4 is a schematic diagram of the architecture of the multi-agent co-optimization unit of the present invention.

FIG. 5 is a flow chart of the vehicle-road cooperative dynamic scheduling method of the invention.

FIG. 6 is a schematic diagram of a topology state characterization process in an embodiment of the present invention.

Fig. 7 is a schematic diagram of a dual-network collaborative optimization process in an embodiment of the invention.

FIG. 8 is a schematic diagram of a multi-agent collaborative decision-making process in an embodiment of the invention.

Detailed Description

The following detailed description of the preferred embodiments of the present invention is provided in conjunction with the accompanying drawings, it being understood that the preferred embodiments described herein are merely illustrative and explanatory of the invention, and are not restrictive of the invention.

Referring to fig. 1, the vehicle-road collaborative dynamic scheduling system for multi-agent reinforcement learning provided by the invention comprises three main parts of a perception layer 101, a decision layer 102 and an execution layer 103.

The sensing layer 101 includes a road side sensing module 111 and a vehicle sensing module 112 for collecting environmental status information and vehicle information. The road side sensing module 111 may be various sensing devices distributed in the urban road network, such as cameras, radars, signal lamp controllers, etc., for collecting environmental information such as traffic flow, signal lamp states, road network structures, etc. The vehicle awareness module 112 may be various sensors and communication devices mounted on the vehicle for collecting vehicle information such as vehicle position, speed, direction, destination, etc.

The decision layer 102 is a core part of the system and comprises a topology dynamic state characterization unit 121, a dual-network collaborative decision unit 122 and a multi-agent collaborative optimization unit 123. The topology dynamic state characterization unit 121 is configured to receive the environmental state information and the vehicle information, map these information to a multidimensional topology space to form a topology state representation, and form the manifold space representation by nonlinear degradation. The dual-network collaborative decision-making unit 122 includes a vehicle dynamic allocation network VDDPG and a roadside dynamic scheduling network RDDPG, which are respectively used to generate a vehicle scheduling policy and a roadside scheduling policy, and convert these policies into a low-rank representation through policy matrix decomposition. The multi-agent collaborative optimization unit 123 is configured to construct an agent relationship graph, optimize a communication policy, coordinate decisions of each agent, and generate a final collaborative scheduling scheme.

The execution layer 103 includes a vehicle execution module 131 and a roadside control module 132, which are configured to receive the cooperative scheduling scheme, execute a corresponding scheduling instruction, and feed back an execution result to the perception layer 101 to form closed-loop control. The vehicle execution module 131 is responsible for controlling the traveling behavior of the vehicle, such as adjusting speed, rerouting, etc. The roadside control module 132 is responsible for controlling the operating state of the roadside facility, such as adjusting signal lights timing, limiting entry, etc.

Referring to fig. 2, the topology dynamic state characterization unit 121 includes a state acquisition subunit 1211, a topology characterization subunit 1212, and a dimension conversion subunit 1213.

The state acquisition subunit 1211 is configured to receive the environmental state information and the vehicle information from the sensing layer 101, and perform preprocessing on these raw data to generate standardized data. In one embodiment of the invention, the preprocessing includes denoising, filtering, normalization, and the like. For example, for vehicle speed data, average filtering may be used to remove noise, and then maximum and minimum normalization may be performed to map the speed value to the [0,1] interval.

The topology characterization subunit 1212 is configured to map the standardized data to a multidimensional topology space, extract topology features, and establish a local coordinate system. Specifically, a state space of a vehicle environment is first definedWherein the vehicle state (position, speed, direction) and the road side state (traffic light state, road network topology) respectively form subspacesAnd. For each status pointAnd realizing accurate expression of the vehicle road environment through the local coordinate graph. Preferably, the topology characterization subunit 1212 also builds a slave state spaceTo policy spaceDifferential embryo mapping of (C)The topology invariance between the state change and the strategy adjustment is ensured, so that the system keeps stability in a complex traffic environment. The differential stratospheric map can be defined as follows:

,

Wherein, As a point of the state of the device,As a matrix of weights, the weight matrix,As a function of the non-linear activation,Is a bias vector. By the mapping, points with similar topological structures in the state space can be mapped to similar points in the strategy space, so that the continuity and stability of the strategy are ensured.

The dimension conversion subunit 1213 is configured to perform nonlinear dimension reduction on the data of the multidimensional topological space to generate a manifold space representation that retains the key topological relationships. In one embodiment of the invention, a Local Linear Embedding (LLE) algorithm may be employed to achieve nonlinear dimension reduction. The core idea of LLE algorithm is to maintain the local linear relationship of the data points while reducing the dimension globally non-linearly. Specifically, for each data pointFirst find itNearest neighbor pointThen calculate the weight matrixSo thatCan be reconstructed from the linear combination of its neighbors:

,

Wherein, Representation pointsPoint-to-pointIs satisfied by the contribution weight of (1). Then, find representations in the low-dimensional space that satisfy the same weight relationships

,

This results in a representation of the original high-dimensional data on the low-dimensional manifold. In practical application, the number of neighbors can be dynamically adjusted according to the complexity of traffic scenesFor example, in the case of large traffic flows, a larger value can be selectedValues (e.gTo capture more interactions, and to select smaller ones in case of small traffic flowValues (e.g) To reduce the amount of computation.

The policy characterization subunit 1225 is configured to convert the vehicle scheduling policy and the roadside scheduling policy into a policy matrix, and perform matrix decomposition to generate a low-rank representation. Specifically, the policies of two networks are represented as a higher order matrixIt is then decomposed into a combination of core tensors and factor matrices by a Tucker decomposition:

,

Wherein, As a function of the core tensor,、、As a matrix of factors,Representing edge numberTensor-matrix product of individual modes. In this way, the complexity of the policy representation can be substantially reduced. For example, for oneIf it is reduced to the rankThe parameter can be selected fromDown toThe storage and computation requirements are greatly reduced.

The collaborative optimization subunit 1226 is configured to identify a potential conflict between a vehicle scheduling policy and a road side scheduling policy, and adjust policy parameters by a conjugate gradient method. In one particular embodiment, collaborative optimization may be expressed as the following optimization problem:

,

Wherein, AndParameters of VDDPG and RDDPG networks respectively,AndWhich are independent loss functions of the two networks respectively,Is a joint loss function, used for measuring the coordination degree of two network output strategies,Is a trade-off coefficient for adjusting the ratio of independent optimization to collaborative optimization.

,

Wherein, Representing a vehicleIs used for the action of (a),Representing roadside facilitiesIs used for the action of (a),Representing a measure of conflict between two actions,Representing a vehicleRoad side facilityProbability of interaction between.

Referring to fig. 4, the multi-agent collaborative optimization unit 123 includes an agent relationship creation subunit 1231, a message passing optimization subunit 1232, and a global consistency optimization subunit 1233.

The agent relationship creation module 1231 is used to create an agent relationship graph based on the current traffic conditions, and identify the condition dependency relationship between agents. Specifically, the multi-vehicle and roadside devices are modeled as markov random fields g= (V, E), where node V represents an agent (e.g., vehicle, signal light, etc.), and edge E represents an interaction relationship between agents (e.g., a following relationship between vehicles, a control relationship between vehicles and signal light, etc.). Through conditional random field theory, the dependency relationship among multiple agents can be captured, and formalized expressed as:

,

Wherein, Representing a set of states of the agent,Represents the observed variable(s),Is a normalization factor that is used to normalize the data,Is defined in a groupA potential function of the upper. In the present system, the position, speed, etc. of the vehicle may be used as a state variable, and the environment awareness information may be used as an observation variable. The message transmission optimizing subunit 1232 is configured to calculate information entropy of each communication channel in the agent relationship graph, generate an optimal message transmission policy, and allocate communication resources. Based on the information entropy theory, the information value of each communication channel can be estimated:

,

Wherein, Expressed in known agentIn the case of a state, the agentThe lower the conditional entropy of the state, the higher the communication value. Based on this, a communication policy can be designed to preferentially allocate resources to communication channels with high information value.

For example, in practical application, when two vehicles are relatively close in distance and relatively high in speed, communication value between the two vehicles is relatively high and should be guaranteed preferentially, and when two vehicles are relatively far apart or are in different road sections, communication frequency can be reduced to save resources. Experiments prove that the communication overhead can be reduced by more than 40% on the premise of keeping 90% of communication effect by adopting the communication strategy based on the information entropy.

The global consistency optimization subunit 1233 is configured to decompose the global optimization objective into local sub-objectives, coordinate the local decisions by using a variation inference method, and verify global consistency of the final decisions. In particular, by the variational Bayesian method, complex posterior distributions can be distributedApproximately simpler distribution:

,

The goal is to minimize the KL divergence between the two distributions:

,

optimizing local distribution of agents by iteration And finally, achieving globally consistent decision.

In one embodiment of the invention, the global optimization objective may be set to minimize system overall delay time, maximize traffic flow, minimize energy consumption, etc. The target can be decomposed into local targets of all the intelligent agents, and the consistency of local decisions and global targets is ensured through a variation inference method. For example, for the goal of minimizing the overall delay time of the system, it may be broken down into local goals of minimizing the transit delay at each intersection, optimizing the routing of each vehicle, etc. Through a messaging mechanism, these local decisions can be coordinated to jointly achieve a global goal.

In one embodiment of the present invention, the perception layer 101 further includes a data preprocessing module, a state caching module, and an attention distribution module.

The data preprocessing module is used for filtering, denoising and standardizing the environmental state information and the vehicle information. In a specific implementation, a kalman filter can be used for filtering vehicle position and speed data, average filtering is used for smoothing traffic flow data, and then maximum and minimum value standardization or Z-score standardization is carried out, so that data of different types and scales can be processed under the same frame.

The state buffer module is used for storing historical environment state information and vehicle information and supporting time sequence analysis. By maintaining a state sequence in a time window, the system can analyze the variation trend of traffic parameters, predict the future traffic condition and make scheduling decisions in advance. For example, by analyzing traffic flow changes over the past 30 minutes, one can predict the flow trend for the next 15 minutes, adjusting the signal timing accordingly.

The attention allocation module is configured to adjust an allocation policy of the perceived resource according to feedback from the decision layer 102. In the case of limited resources, the perceived need for different regions, different times, is different. The attention distribution module can dynamically adjust the distribution of the perceived resources according to the current traffic conditions and decision requirements, such as increasing sampling frequency in areas with large traffic flow, improving data precision at key intersections, and the like.

In another embodiment of the present invention, the execution layer 103 further includes an instruction parsing module, an execution monitoring module, an execution history module, and a degradation processing module.

The instruction analysis module is used for converting the cooperative scheduling scheme into a specific control instruction. For example, a high-level command to reduce traffic flow is converted to specific control parameters that adjust the signal lights for 30 seconds for red and 45 seconds for green.

The execution monitoring module is used for monitoring the execution condition of the control instruction and identifying an abnormal state. An emergency handling mechanism may be triggered when the system detects that the execution deviation exceeds a threshold. For example, when the magnitude of the actual deceleration of the vehicle is less than the demand for the command, the system may issue a warning and adjust the subsequent command.

The execution history module is used for maintaining execution history data for reference by the decision layer 102. By analyzing the historical execution data, the system can learn the actual effect of instruction execution and optimize the decision model. For example, by analyzing the actual effects of different timing schemes, the relationship between traffic flow and signal timing is learned.

The degradation processing module is used for executing a preset degradation strategy under the condition of communication interruption or equipment failure. The system designs a multi-level degradation strategy to ensure that basic functionality is maintained in various abnormal situations. For example, when the vehicle road communication is interrupted, the vehicle can switch to a local decision mode, and when the central server fails, the road side controller can adopt a preset fixed timing scheme.

In still another embodiment of the present invention, a timing synchronization unit, a collision detection unit, and a synergistic effect evaluation unit are provided between the vehicle execution module 131 and the roadside control module 132.

The timing synchronization unit is used for ensuring time synchronization of the vehicle control command and the road side control command. In a distributed system, time synchronization of different nodes is a critical issue. The time sequence synchronization unit adopts Network Time Protocol (NTP) to realize clock synchronization of each node in the system, and ensures that the control instruction is executed according to the correct time sequence.

The conflict detection unit is used for detecting potential conflicts in the execution process in real time and triggering emergency treatment. For example, when the system detects that a plurality of vehicles possibly simultaneously drive into the same road section to cause congestion, the scheduling scheme can be adjusted in advance, so that collision is avoided.

The cooperative effect evaluation unit is used for quantitatively evaluating the execution effect of the vehicle-road cooperative control and generating an effect evaluation report. The system designs multidimensional evaluation indexes including average delay time, energy consumption, system throughput and the like, and comprehensively evaluates the scheduling effect through the indexes to provide basis for system optimization.

In one embodiment of the invention, information exchange is achieved between VDDPG and RDDPG networks through a shared hidden layer.

In particular, the shared hidden layer receives common features of the manifold spatial representation, outputting an intermediate feature representation. The VDDPG network and RDDPG network each receive this intermediate feature representation and, in combination with the respective specific input features, generate corresponding policy outputs. This design enables two networks to share a basic understanding of the environment while maintaining their own expertise.

In addition, VDDPG network and RDDPG network realize collaborative parameter updating through gradient locking mechanism, and ensure policy coordination consistency. The core idea of the gradient lock mechanism is to consider the interaction of two networks in the gradient update process:

,

Wherein, Is a factor between 0 and 1 for controlling the intensity of the co-optimization. In practical application, the system can be dynamically adjusted according to the stability of the systemFor example, a smaller value (e.g., 0.1) is used in the early stage of training to ensure convergence, and gradually increases to 0.5 or higher as training proceeds to enhance the synergistic effect.

In one embodiment of the invention, the system further comprises an experience playback module, a target network update module, an adaptive learning rate adjustment module, and a model evaluation and deployment module.

The experience playback module is used for storing historical data of system and environment interaction and supporting offline batch learning. Specifically, the current state of the agent's interaction with the environment, the selected action, the next state, and the rewards obtained are stored as a quadruple (s, a, s', r) in an experience pool. In the training process, batch data are randomly extracted from the experience pool for learning, so that time sequence correlation among samples is broken, and learning stability and learning efficiency are improved.

The target network updating module is used for copying parameters from the main network to the target network regularly, so that learning stability is ensured. In deep reinforcement learning, using a single network for both value estimation and target calculation may lead to instability, and therefore a slowly updated target network is typically maintained. The parameter updating of the target network adopts a soft updating strategy:

,

Wherein, Is a soft update coefficient, typically taking a small value such as 0.01 to ensure stability of the target network.

The self-adaptive learning rate adjustment module is used for dynamically adjusting the network learning rate according to training progress. In the initial stage of training, a larger learning rate (e.g. 0.001) can be used for quick exploration, and the learning rate is gradually reduced (e.g. to 0.0001) to realize fine adjustment as training progresses. In addition, the learning rate can be adaptively adjusted according to the change trend of the loss function, for example, when the loss is continuous for a plurality of rounds and does not drop, the learning rate is reduced.

The model evaluation and deployment module is used for evaluating the performance of the model and deploying the trained model to the production environment. The system designed a complete model evaluation flow, including offline evaluation and online a/B testing. Before deployment, the model is subjected to pressure test and safety evaluation, so that the model can work normally under various conditions.

The invention also provides a vehicle-road collaborative dynamic scheduling method for multi-agent reinforcement learning, which is applied to the vehicle-road collaborative dynamic scheduling system for multi-agent reinforcement learning and comprises the following steps:

1. Initializing the system, including initializing VDDPG network and RDDPG network parameters, establishing communication connections, loading road network topology and historical traffic data. In this step, network parameters may be initialized randomly or a pre-trained model, the communication connection is ensured to be stable by adopting standard protocols such as MQTT, and road network data and historical traffic data are used for preliminary configuration of the system and model pre-training.

2. And collecting state information, including collecting environmental state information through a road side sensing module and collecting vehicle information through a vehicle sensing module. The environment state information includes traffic flow, signal lamp state, road network structure, etc. and the vehicle information includes position, speed, direction, destination, etc. The system acquires the data in real time through a sensor network, and the sampling frequency is dynamically adjusted according to the complexity of the scene, and is generally 5-10 Hz.

3. Performing topology state characterization includes mapping state information to a multidimensional topology space, extracting topology features, and shaping the space representation by nonlinear decontamination. The step adopts the topological dynamic state characterization mechanism described in detail above to compress the high-dimensional complex traffic environment state into a low-dimensional representation, and key topological features are reserved.

4. Generating a dual network decision includes VDDPGActor a network generating a vehicle scheduling policy, RDDPGActor a network generating a roadside scheduling policy, and performing a policy matrix decomposition to generate a low-rank representation. The method adopts the double-network collaborative decision mechanism described in detail above, respectively processes the vehicle strategy and the road side strategy through two special networks, and reduces the complexity through matrix decomposition.

5. Performing collaborative optimization, including building an agent relationship graph, optimizing a message passing strategy, coordinating local decisions by a variation inference method, and verifying global consistency. The step adopts the multi-agent cooperative decision mechanism described in detail above to ensure that the local decisions of all agents can be coordinated and consistent, and the global optimization target is realized together.

6. Executing the scheduling scheme, wherein the scheduling scheme comprises the step of converting the optimized decision into a specific execution instruction and a vehicle and road side facility execution control instruction. The vehicle control instructions comprise speed adjustment, path planning and the like, and the road side control instructions comprise signal lamp timing adjustment, lane allocation and the like. The system ensures that the instructions are executed at the correct timing and monitors the execution in real time.

7. The execution feedback is collected, including monitoring execution results, updating environmental conditions, updating network parameters and experience pools based on the execution results. The system collects the environmental change and the vehicle state after execution through the sensor network, calculates the difference between the actual effect and the expected effect, and generates a reward signal for updating network parameters.

8. Iterative optimization, repeatedly executing the steps, and continuously optimizing the scheduling effect. The system continuously operates, continuously learns and adapts to the changed traffic environment, and gradually improves the dispatching efficiency and effect.

The multi-agent reinforcement learning vehicle-road collaborative dynamic scheduling system provided by the invention has good performance in practical application. Through test verification in a plurality of urban traffic scenes, the system can obviously improve traffic efficiency, reduce congestion and reduce energy consumption.

Specifically, compared with the traditional fixed timing signal control, the system can reduce the average delay time of a vehicle by more than 30%, compared with the simple self-adaptive signal control, the delay time can be reduced by 15%, and under a complex traffic scene, the computing resource requirement of the system is reduced by 50% compared with a centralized decision architecture, the communication bandwidth requirement is reduced by 40%, and the expandability of the system is greatly improved.

In addition, the system has good adaptability and robustness, and can cope with abnormal conditions such as abrupt change of traffic flow, equipment failure, communication interruption and the like, and ensure the stable operation of the traffic system.

The foregoing description is only of the preferred embodiments of the invention and is not intended to limit the invention thereto. Various obvious changes and modifications to the present invention may be made by those skilled in the art without departing from the scope of the invention, and such changes and modifications are intended to be within the scope of the invention.

Claims

1. A multi-agent reinforcement learning vehicle-road cooperative dynamic scheduling system, characterized by including:

The perception layer includes a roadside perception module and a vehicle perception module, which are used to collect environmental status information and vehicle information;

The decision layer is connected to the perception layer in communication, and includes:

A topological dynamic state representation unit, configured to receive the environmental state information and the vehicle information, map the environmental state information and the vehicle information to a multidimensional topological space to form a topological state representation, and generate a manifold space representation by nonlinear dimensionality reduction;

A dual-network collaborative decision-making unit, used to receive the manifold space representation, including a vehicle dynamic allocation network VDDPG and a roadside dynamic dispatch network RDDPG, wherein the VDDPG is used to generate a vehicle dispatch strategy, and the RDDPG is used to generate a roadside dispatch strategy, and convert the vehicle dispatch strategy and the roadside dispatch strategy into a low-rank representation through strategy matrix decomposition;

A multi-agent collaborative optimization unit, used to construct an agent relationship graph, optimize the communication strategy in the agent relationship graph based on information entropy, coordinate the vehicle scheduling strategy and the roadside scheduling strategy through a variational inference method, and generate an optimized collaborative scheduling solution;

The execution layer is in communication with the decision layer, and includes a vehicle execution module and a roadside control module, which are used to receive the collaborative scheduling scheme, execute corresponding scheduling instructions, and feed back the execution results to the perception layer to form a closed-loop control.

2. The multi-agent reinforcement learning vehicle-road cooperative dynamic scheduling system according to claim 1, characterized in that the topological dynamic state representation unit comprises:

A state acquisition subunit, used for receiving the environmental state information and the vehicle information, and preprocessing the environmental state information and the vehicle information to generate standardized data;

A topological characterization subunit, used to map the standardized data into a multidimensional topological space, extract topological features, and establish a local coordinate system;

The dimension conversion subunit is used to perform nonlinear dimensionality reduction on the data in the multidimensional topological space to generate a manifold space representation that retains key topological relationships.

3. The vehicle-road cooperative dynamic scheduling system based on multi-agent reinforcement learning according to claim 1, characterized in that the dual-network collaborative decision-making unit comprises:

VDDPGActor network, used to receive the manifold space representation related to vehicles and generate vehicle scheduling strategies;

VDDPGCritic network, used to evaluate the value of the vehicle dispatching strategy;

RDDPGActor network, used to receive the manifold space representation related to the roadside and generate the roadside dispatch strategy;

RDDPGCritic network, used to evaluate the value of the roadside dispatch strategy;

A strategy representation subunit, used for converting the vehicle dispatch strategy and the roadside dispatch strategy into a strategy matrix, and performing matrix decomposition to generate a low-rank representation;

The collaborative optimization subunit is used to identify potential conflicts between the vehicle scheduling strategy and the roadside scheduling strategy, and adjust strategy parameters through the conjugate gradient method.

4. The vehicle-road cooperative dynamic scheduling system based on multi-agent reinforcement learning according to claim 1, characterized in that the multi-agent collaborative optimization unit comprises:

The agent relationship modeling subunit is used to build an agent relationship graph based on the current traffic conditions and identify the conditional dependencies between agents;

The message transmission optimization subunit is used to calculate the information entropy of each communication channel in the agent relationship graph, generate the optimal message transmission strategy, and allocate communication resources;

The global consistency optimization subunit is used to decompose the global optimization objective into local sub-objectives, coordinate local decisions through variational inference methods, and verify the global consistency of the final decision.

5. The multi-agent reinforcement learning vehicle-road cooperative dynamic scheduling system according to claim 1, characterized in that the perception layer also includes:

A data preprocessing module, used for filtering, denoising and standardizing the environmental status information and the vehicle information;

The state cache module is used to store historical environment state information and vehicle information and support time series analysis;

The attention allocation module is used to adjust the allocation strategy of perception resources according to the feedback from the decision layer.

6. The multi-agent reinforcement learning vehicle-road cooperative dynamic scheduling system according to claim 1, characterized in that the execution layer also includes:

An instruction parsing module, used to convert the collaborative scheduling scheme into specific control instructions;

An execution monitoring module is used to monitor the execution of control instructions and identify abnormal conditions;

Execution history module, used to maintain execution history data for reference by decision-makers;

The degradation processing module is used to execute the preset degradation strategy in the event of communication interruption or equipment failure.

7. The multi-agent reinforcement learning vehicle-road cooperative dynamic scheduling system according to claim 1, characterized in that the vehicle execution module and the roadside control module are provided with:

A timing synchronization unit, used to ensure the time synchronization between vehicle control instructions and roadside control instructions;

Conflict detection unit, used to detect potential conflicts in the execution process in real time and trigger emergency processing;

The collaborative effect evaluation unit is used to quantitatively evaluate the execution effect of vehicle-road collaborative control and generate an effect evaluation report.

8. The multi-agent reinforcement learning vehicle-road cooperative dynamic scheduling system according to claim 1, characterized in that the VDDPG network and the RDDPG network exchange information by sharing a hidden layer, wherein:

The shared hidden layer receives the common features represented in the manifold space and outputs an intermediate feature representation;

The VDDPG network and the RDDPG network respectively receive the intermediate feature representation, and generate corresponding strategy outputs in combination with their respective specific input features;

The VDDPG network and the RDDPG network implement collaborative parameter updates through a gradient locking mechanism to ensure strategy coordination and consistency.

9. The vehicle-road cooperative dynamic scheduling system based on multi-agent reinforcement learning according to claim 1, characterized in that the system further comprises:

The experience replay module is used to store historical data of the interaction between the system and the environment and supports offline batch learning;

The target network update module is used to periodically copy parameters from the main network to the target network to ensure learning stability;

Adaptive learning rate adjustment module, used to dynamically adjust the network learning rate according to the progress of training;

The model evaluation and deployment module is used to evaluate model performance and deploy the trained model to the production environment.

10. A vehicle-road cooperative dynamic scheduling method based on multi-agent reinforcement learning, applied to a vehicle-road cooperative dynamic scheduling system based on multi-agent reinforcement learning according to any one of claims 1 to 9, characterized in that it comprises the following steps:

Initialize the system, including initializing VDDPG network and RDDPG network parameters, establishing communication connections, and loading road network topology and historical traffic data;

Collecting status information, including collecting environmental status information through a roadside perception module and collecting vehicle information through a vehicle perception module;

Perform topological state representation, including mapping state information into a multidimensional topological space, extracting topological features, and generating a manifold space representation through nonlinear dimensionality reduction;

Generate dual network decisions, including VDDPGActor network to generate vehicle dispatch strategy, RDDPGActor network to generate roadside dispatch strategy, and execute strategy matrix decomposition to generate low-rank representation;

Perform collaborative optimization, including building agent relationship graphs, optimizing message passing strategies, coordinating local decisions through variational inference methods, and verifying global consistency;

Execute the dispatch plan, including converting the optimized decision into specific execution instructions, and executing control instructions for vehicles and roadside facilities;

Collect execution feedback, including monitoring execution results, updating environment status, and updating network parameters and experience pool based on execution results;

Iterate optimization and repeat the above steps to continuously optimize the scheduling effect.