CN118200135A

CN118200135A - Method for optimizing transmission performance of AOC optical module by using deep reinforcement learning

Info

Publication number: CN118200135A
Application number: CN202410607959.5A
Authority: CN
Inventors: 彭德军; 高国祥; 王峻岭; 陈享郭
Original assignee: Sichuan Guangwei Communication Co ltd
Current assignee: Sichuan Guangwei Communication Co ltd
Priority date: 2024-05-16
Filing date: 2024-05-16
Publication date: 2024-06-14
Anticipated expiration: 2044-05-16
Also published as: CN118200135B

Abstract

The invention relates to the technical field of optical modules, in particular to a method for optimizing transmission performance of an AOC optical module by using deep reinforcement learning, which comprises the following steps: step 1: establishing an AOC optical module reinforcement learning environment model which comprises a state space, an action space, a state transition probability and a return function; step 2: estimating an action value function and a strategy value function; step 3: modifying the strategy using a strategy gradient approach, wherein the strategy parameters are updated by a strategy gradient that maximizes the expected return; step 4: updating the action value function using distributed reinforcement learning; step 5: based on the improved policy value function, a policy is selected that maximizes the action value function to optimize transmission performance of the AOC optical module. According to the invention, through intelligent autonomous learning and optimization, the self-adaptive improvement of the performance of the AOC optical module is realized, the data transmission efficiency and reliability are maximized, the performance is continuously improved, the automatic operation and maintenance are realized, and the cost is reduced.

Description

Method for optimizing transmission performance of AOC optical module by using deep reinforcement learning

Technical Field

The invention relates to the technical field of optical modules, in particular to a method for optimizing transmission performance of an AOC optical module by using deep reinforcement learning.

Background

With the rapid development of information technology, high-speed data transmission has become increasingly important. As a high-speed and high-bandwidth data transmission mode, the optical communication technology plays a vital role in the fields of data centers, communication networks, large-scale computing and the like. The optical module is used as a key component of an optical communication system, and the performance of the optical module has a decisive influence on the stability and reliability of the whole system. In this context, deep reinforcement learning techniques have been developed aimed at optimizing AOC optical module transmission performance to meet modern communication requirements.

An AOC optical module is a high-speed, high-bandwidth optical communication device that is commonly used for high-speed data transmission inside a data center. It consists of a transmitter and a receiver, which are capable of converting electrical signals into optical signals and transmitting data between optical fibers. However, in practical applications, the performance of an AOC optical module may be affected by various factors, such as transmission speed, signal-to-noise ratio, power consumption, and the like. Therefore, an intelligent method is needed to optimize the transmission performance of the AOC optical module to improve the efficiency and reliability of data transmission.

Traditionally, methods of optimizing AOC optical modules have generally relied on manual design and adjustment parameters, which require significant manpower and time, and may not fully exploit the potential of the AOC optical module. Furthermore, conventional approaches often fail to accommodate complex communication environments and changing demands, and thus a more intelligent approach is needed to address these issues. Reinforcement learning is a method of learning an optimal action strategy based on agent interactions with an environment. It has the following potential, which can be used to optimize the transmission performance of an AOC optical module:

autonomous learning: reinforcement learning allows AOC light module agents to autonomously learn optimal strategies based on interactions with the environment without requiring manual design parameters.

The adaptability: reinforcement learning can dynamically adjust strategies to adapt to different situations according to changing communication environments and requirements.

Optimizing performance: reinforcement learning can improve data transmission efficiency and reliability by continuously trying different strategies to optimize the performance of the AOC optical module.

And (3) automation: the reinforcement learning can realize automatic optimization of the AOC optical module, reduce the requirement of manual intervention and improve the automation degree of the system.

Disclosure of Invention

The invention aims to provide a method for optimizing the transmission performance of an AOC optical module by using deep reinforcement learning, which realizes the self-adaptive improvement of the performance of the AOC optical module by intelligent autonomous learning and optimization, maximizes the data transmission efficiency and reliability, continuously improves the performance, realizes automatic operation and maintenance and reduces the cost.

In order to solve the technical problems, the invention provides a method for optimizing transmission performance of an AOC optical module by using deep reinforcement learning, which comprises the following steps:

Step 1: establishing an AOC optical module reinforcement learning environment model which comprises a state space, an action space, a state transition probability and a return function; the state space represents a set of possible transmission speeds of the AOC optical module; the action space represents a set of actions that may be taken to optimize the transmission speed of the AOC optical module; the state transition probability is a probability distribution of transition to the next state after a given action is executed in a given state; the reward function represents a function that calculates a percentage of performance improvement or reduction of the AOC optical module when a given action is performed in a given state and transferred to the state; the percentage of performance improvement or reduction of the AOC optical module is reported; the rewards include: expected returns and actual returns;

Step 2: estimating an action value function and a strategy value function; the action value function represents the expected return for performing a given action in a given state; the policy value function represents the sum of the expected rewards obtained for each action under a given policy, starting from the current state to execute the plurality of actions contained in the policy, and following the policy until the end; each policy is a set of multiple actions in sequence;

Step 3: modifying the strategy using a strategy gradient approach, wherein the strategy parameters are updated by a strategy gradient that maximizes the expected return; calculating a new policy value function using a monte carlo tree search; circularly executing the step until the set first execution times are reached;

Step 4: updating the action value function using distributed reinforcement learning; using the updated action value function, and updating the strategy parameters again by calculating the strategy gradient so as to improve the strategy; circularly executing the step until the set second execution times are reached;

Step 5: based on the improved policy value function, a policy is selected that maximizes the action value function to optimize transmission performance of the AOC optical module.

Further, a state space is set asThe action space is/>The state transition probability is/>And the return function is/>; State space/>Each element in (2) represents the possible transmission speed of an AOC optical module, in terms of state/>To represent; action space/>Each element in (2) represents an action that may be taken to optimize the transmission speed of the AOC optical module, with action/>To represent; wherein, action/>The included categories include: increase, decrease and remain unchanged.

Further, in step 2, the motion value function is estimated using the following formula：

；

Wherein,Expressed in state/>Select action/>Actual return obtained after,/>Is a discount factor; /(I)Is the next state; /(I)As a function of the policy value.

Further, in step 2, the policy value function is estimated using the following formula:

；

Wherein, Expressed in state/>Down select action/>Is a probability of (2).

Further, the loss function of the action value function in step 2 is expressed by using the following formula:

；

Wherein, Representing an estimated action value function that gives the Q network parameter/>At the time of state/>Down execution action/>Expected return on time; /(I)Expressed in state/>Down execution action/>The actual return obtained later; /(I)The next state indicates that the action/>, is being performedA state to which a transition is made; /(I)Representing a discount factor for balancing the importance of the actual return and the expected return to a set value; /(I)Representing a given next state/>Under, perform action/>The maximum expected return that can be obtained is then; /(I)Representing Q network parameters, updated by gradient descent to minimize the loss function/>；Representing a target Q network parameter; /(I)Representing an experience playback buffer in which historical experiences during interaction with the environment are stored, including state/>Action/>Report/>And next state/>; Expected value/>Representation for empirically replaying buffers/>All four tuples sampled in the middle, the average of the following expressions is calculated:

；

expected value Represents a loss function/>It measures the mean square error between the estimated value of the Q network and the Q value of the target; by minimizing this expected value, the Q network is trained.

Further, the method for improving the strategy by using the strategy gradient method in the step 3 comprises the following steps: by adjusting policy parametersTo improve the strategy to maximize the expected return; a policy gradient method is used, wherein the policy gradient is related to a policy parameterThe gradient of which is oriented to increase the expected return; first for sampling trace/>Estimating the expected return of (2); trackIs according to the current strategy/>A generated sequence of states and actions; then, calculate the parameters/>, about the policyThe direction of the gradient is the direction that maximizes the desired return; finally, based on learning rate/>And gradients to update policy parameters/>。

Further, the policy parameters are updated using the following formula：

；

Wherein,Representation of policies/>Policy parameters of (2); /(I)Representation of parameters concerning policies/>Is a gradient of (2); : expressed in terms of policy/> Generated track/>Taking an expected value; /(I): Representing the time step/>, in the trajectorySummation is performed, wherein/>Is the maximum length of the track; /(I)Representing policy in State/>Down execution action/>A gradient of logarithmic probability of (a); /(I)Representing a dominance function, defined asWherein/>Is a state action pair/>Is used to determine the expected return of (1),Is state/>Policy value function of (2); /(I)The updated strategy parameters; /(I)Representing time step/>A state at that time; /(I)Representing time step/>Action at that time; a new policy value function is calculated using a monte carlo tree search by the following formula:

；

Wherein, Representing a policy value function, represented in state/>Down execution current policy/>The expected return obtained; representing the number of independent D-MCTS search trees; /(I) Represents the/>Individual D-MCTS search trees are in state/>Is used for the estimation of the estimated value of (a).

Further, when the motion value function is updated by using the distributed reinforcement learning in step 4, the following formula is used:

；

Wherein, As a function of the action value, expressed in Q network parameter as/>At the time of state/>Down execution action/>Expected return on time; /(I)The updated action value function is used; Representing the next state/> Under, perform action/>An estimate of the maximum motion value that can be obtained.

Further, using the updated action value function, updating the policy parameters again by calculating the policy gradient, so as to improve the policy, using the following formula:

。

The method for optimizing the transmission performance of the AOC optical module by using deep reinforcement learning has the following beneficial effects: traditional AOC optical module optimization methods often have difficulty adapting to rapidly changing communication requirements. The method of the invention can dynamically adjust the strategy according to the continuously changing communication environment and requirements so as to ensure that the AOC optical module is always in the optimal performance state. This means that the AOC optical module can maintain high efficiency and reliability in different communication scenarios, providing better support for various applications. The invention introduces deep reinforcement learning, so that the AOC optical module can intelligently and autonomously learn and optimize the performance of the AOC optical module. By establishing a reinforcement learning environment model, the AOC optical module is allowed to select the optimal transmission speed and parameter configuration according to real-time state information. Compared with the traditional method of manually adjusting parameters, the intelligent optimization method has obvious advantages and can be better adapted to complex communication environments. The deep reinforcement learning method aims at maximizing the transmission performance of the AOC optical module. By using reinforcement learning algorithms to select the optimal transmission speed and parameter configuration, it can be ensured that the AOC optical module is operating in an optimal state at any given moment. This will significantly improve the efficiency and reliability of data transmission.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for optimizing transmission performance of an AOC optical module using deep reinforcement learning according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1: referring to fig. 1, a method of optimizing AOC optical module transmission performance using deep reinforcement learning, the method comprising:

In particular, the state space represents a set of possible transmission speeds of the AOC optical module. This is a representation of the state of the system, which may include various parameters such as optical signal strength, noise level, data transmission rate, etc. The choice of state space is critical to reinforcement learning because it determines the different states that the system can explore during the learning process. The action space represents the set of actions that may be taken to optimize the transmission speed of the AOC optical module. This may include different options to adjust the transmission rate to achieve better performance. The design of the action space affects how the reinforcement learning algorithm searches for the best strategy. State transition probabilities describe the probability distribution that a system will transition to the next state after a given action is performed in a given state. This probability distribution reflects the uncertainty and dynamics of the environment. Reinforcement learning algorithms rely on these probabilities to estimate future rewards. The reward function represents a function that calculates the percentage of AOC light module performance improvement or reduction when a given action is performed in a given state and transferred to a new state. The return function is a reinforcement learning return signal that directs the learning algorithm to learn in a direction that maximizes the expected return.

The action value function represents the expected return obtained in performing a given action in a given state. By calculating an action value function for each action, it can be determined which action should be taken in a particular state to obtain the best performance. The policy value function represents the sum of the expected rewards obtained for each action under a given policy, starting from the current state, to execute the plurality of actions contained in the policy, and following the policy until the end. The policy value function helps evaluate the effect of the overall policy, thereby selecting the optimal policy.

The policy gradient method directs the improvement of policies by calculating the policy gradients. The policy gradient tells the probability of which actions should be increased to increase the overall return. In this way, the agent can select actions more intelligently to achieve better performance. In reinforcement learning, it is necessary to explore new actions in order to understand the environment and find the best strategy. Policy gradient methods can balance trade-offs between exploration and utilization by varying policy parameters. It may encourage more exploration at an early stage and then gradually shift to a more stable utilization optimization strategy. Using a monte carlo tree search to calculate a new policy value function can help the agent to better estimate the value of each action and state, thereby better guiding policy improvement.

The action value function represents the expected return for taking a particular action in a given state. It is used to evaluate the value of each action, thereby helping the agent to select the optimal action. Distributed reinforcement learning divides the training process into multiple parallel subtasks. Each subtask learns the action value function independently and then improves learning efficiency by collaboration and sharing information. In distributed learning, each subtask uses different empirical data to update its local action value function. This may be accomplished by various reinforcement learning algorithms, such as Q-learning or deep Q-network (DQN). By updating the action value function, the agent can better understand the value of each action, thereby improving the policy. Policies are typically based on action value functions to select optimal actions, so updating action value functions may indirectly improve policies. Distributed learning methods are generally more robust by learning using multiple agents. If one agent encounters difficulty or falls into a locally optimal solution, the other agents may provide more information to help overcome the problem.

The optimization strategy is implemented by selecting an action that maximizes the action value function (Q function). The Q function represents the expected return for taking a particular action in a given state. Thus, the agent will select the action with the highest Q value in each state. Selecting the optimal strategy means that the agent will take the best action to obtain the greatest performance improvement in the transmission of the AOC optical module. This helps to improve transmission speed, reduce performance index such as bit error rate. By improved policy value functions, the agent can make decisions automatically without human intervention. This is very useful for complex systems and large-scale data.

Example 2: set the state space asThe action space is/>The state transition probability is/>And the return function is/>; State space/>Each element in (2) represents the possible transmission speed of an AOC optical module, in terms of state/>To represent; action space/>Each element in (2) represents an action that may be taken to optimize the transmission speed of the AOC optical module, with action/>To represent; wherein, action/>The included categories include: increase, decrease and remain unchanged.

Specifically, the state spaceRepresenting the set of possible transmission speeds of the AOC optical module. Each state/>Representing one possible transmission speed of an AOC optical module, can be seen as a different scenario or configuration in the problem. Action space/>Representing a set of actions that may be taken to optimize the transmission speed of an AOC optical module. Here, the action space includes action types such as increasing, decreasing, and remaining unchanged. Each action/>Representing an operating or regulation strategy for the speed of transmission. State transition probability/>Describes the state/>, in a given stateDown execution action/>After that, the AOC optical module will transition to the next state/>Is a probability distribution of (c). It reflects the law of change of state after taking action in the environment. In this context it is meant that in this context,For determining transition probabilities between states. Reward function/>For measuring state/>Execute action downwardsAnd transition to state/>The immediate return or returns thereafter. It represents the degree of performance improvement or degradation that an AOC optical module achieves under certain operations. The return function is a return signal of the decision process in reinforcement learning and plays a role in guiding the behavior of the intelligent agent.

Embodiment 2 defines an action spaceIncluding increasing, decreasing, and remaining unchanged. The definition of these action types has the following principles and roles: defines a discrete operation space, namely an action set/>This is a common method in reinforcement learning. Discrete operating space means that each action represents a particular strategy or behavior, rather than an infinite number of possibilities within a continuous range. This may simplify the modeling and solving of the problem. Operations are defined as action types that increase, decrease, and remain unchanged, etc., such that each action has an explicit meaning in operation. This increases the interpretability of the problem, enabling the intelligent body to more easily understand and select the appropriate operation. Different operation types are defined, allowing the agent to select the appropriate operation in each state. These types of operations reflect a set of policies that may be taken in practical applications, such as the manner in which the transmission speed is adjusted. One key problem in reinforcement learning is the balance of exploration and utilization. Defining different operation types helps the agent better explore different strategies to find the optimal strategy. For example, an agent may attempt to increase speed, decrease speed, or remain unchanged to learn which strategy performs best in different circumstances. The definition of the operation type helps to better model the problem as a reinforcement learning task. Defining the operation space explicitly as a discrete set of actions helps the algorithm to better handle state transitions and reward calculations, making it easier to find the optimal strategy.

Example 3: in step 2, the motion value function is estimated using the following formula：

；

Specifically, action value function: Action value function/>Expressed in a given state/>Action is taken/>And follow the strategy/>In the event that a cumulative return is expected to be obtained. It measures the value of taking a particular action in a particular state. /(I)Expressed in state/>Select action/>The actual instant return obtained later. This reward is an environmentally provided reward signal for evaluating the action taken/>Good or bad(s) of (a). /(I)Is a value between 0 and 1 for discounting the impact of future rewards. It determines how much attention an agent pays back in the future. Higher/>The value represents a more important future return, lower/>The value represents a more important instant return. In reinforcement learning, an agent takes action/>Then go to the next state/>。/>Representing the execution of an action/>New state after that. Policy value function/>Expressed in policy/>Under the state/>The policy is started to execute until the sum of expected rewards obtained at the end. It is used to evaluate policy/>Performance in different states. The main role of the formula is to estimate the state/>Take action/>Action value/>. This estimate is based on the actual return/>And discount factor/>And next state/>Expected value of/>. Action value function/>May be used for policy improvement. The agent may select an optimal action by comparing the action values of the different actions to maximize the expected return. This helps optimize policy/>The intelligent agent can select the best action under different states. Discount factor/>The degree of consideration of the agent for future returns is affected. Higher/>The agent is encouraged to take more into account future rewards, thereby exploring longer term benefits better. Lower/>The immediate return is more focused and the short term benefit is more focused.

Example 4: in step 2, the policy value function is estimated using the following formula:

；

Wherein, Expressed in state/>Down select action/>Is a probability of (2).

Specifically, the policy value estimation calculates the policy under considerationUnder the state/>Policy value/>. This value is in state/>Take different actions/>Wherein the weights are determined by policy/>And (5) determining. /(I)Performance assessment of policies under different conditions is provided. By comparing policy values in different states, the merits of the policies can be determined. The goal is to maximize/>I.e., maximizing the expected return. By evaluating different strategies/>Can choose the optimal strategy, i.e. maximize/>. This helps the agent improve the strategy to optimize performance.

Example 5: the loss function of the action value function in step 2 is expressed using the following formula:

；

Specifically, the core of the loss function is. This section measures the Q network's pair in state/>Down execution action/>Mean square error between the estimated Q value and the target Q value. By minimizing this error, the Q network will gradually approach the true Q value. Expected value/>Representation pair experience playback buffer/>The historical experience stored in the memory is randomly sampled and then the expected value of the loss function is calculated. This step is to ensure that the loss function can take into account the different scenarios of agent interactions with the environment. Minimizing loss function by gradient descentParameters of Q network/>May be updated to improve the estimation of the Q value. This process is repeated, and the Q network gradually approaches the optimal Q function through successive iterations. Target Q network/>The introduction of (c) helps stabilize the training. Target Q valueIs in the next state/>In selecting optimal action/>The maximum expected return that can be obtained later is used to calculate the target value in the loss function, making the training more stable.

Example 6: the method for improving the strategy by using the strategy gradient method in the step 3 comprises the following steps: by adjusting policy parametersTo improve the strategy to maximize the expected return; a policy gradient method is used, wherein the policy gradient is related to a policy parameterThe gradient of which is oriented to increase the expected return; first for sampling trace/>Estimating the expected return of (2); trackIs according to the current strategy/>A generated sequence of states and actions; then, calculate the parameters/>, about the policyThe direction of the gradient is the direction that maximizes the desired return; finally, based on learning rate/>And gradients to update policy parameters/>。

Specifically, policyIs a parameterized policy function that defines the state/>Take action/>Is a probability of (2). Policy parameters/>The parameters of the policy functions are represented, which may be weights and biases of the neural network, etc. The strategy gradient method is a reinforcement learning algorithm, which directly adjusts strategy parameters/>To improve the strategy to maximize the desired return. The core idea of these methods is to find the update direction of a policy parameter so that the expected return increases. Sampling trajectory/>Is by the current policy/>A series of states and action sequences are generated. It represents an interactive process of an agent in an environment, including starting from an initial state, selecting actions according to a strategy, then observing a return and transitioning to the entire sequence of the next state.

The function of the strategy gradient method is to pass strategy parametersTo improve the strategy to maximize the desired return: first, for sampling trace/>Is estimated. The expected return is the expected cumulative value of the return over the entire trajectory, which represents the average performance of the policy in the environment. Next, calculate the parameters/>, about the policyI.e. the strategy gradient. The policy gradient tells how to adjust the policy parameters to maximize the desired return. The direction of the gradient is the direction in which the expected return is increased. Finally, use learning rate/>And policy gradients to update policy parameters/>. This update process adjusts the policy towards being able to obtain a higher desired return, thereby improving the policy.

Example 7: updating policy parameters using the following formula：

；

In particular, the method comprises the steps of,Representation of policies/>Is a parameter of (a). This policy defines the state/>Down select action/>Is a probability of (2). The strategy gradient method is a reinforcement learning algorithm, which directly adjusts strategy parameters/>To improve the strategy to maximize the desired return. The core idea of the strategy gradient approach is to find the direction of a gradient that can increase the expected return. />, in the formulaExpressed in terms of policy/>Generated track/>And taking the expected value. Track/>Representing an agent's interaction in the environment, including starting from an initial state, selecting actions according to a policy, and then observing the return and transitioning to the entire sequence of the next state. />, in the formulaRepresentation of parameters concerning policies/>Is a gradient of (a). Policy gradient tells how to adjust policy parameters/>To maximize the desired return. The direction of the gradient is the direction in which the expected return is increased. Dominance function/>Measure the state/>Down execution action/>Relative to the expected value of the policy. It is defined asWherein/>Is state-action pair/>Is used to determine the expected return of (1),Is state/>Policy value function of (2).

Updating policy parameters using policy gradient methodsTo improve the strategy to maximize the desired return: first, by being in track/>Take the expected value/>To estimate the expected return. This expected return represents the performance of the policy in the average case. Next, calculate the parameters/>, about the policyI.e. the strategy gradient. Policy gradient tells how to adjust policy parameters/>To maximize the desired return. The direction of the gradient is the direction in which the expected return is increased. Finally, use learning rate/>And policy gradients to update policy parameters/>. The update direction is the direction that maximizes the expected return. By iterating this process repeatedly, policy parameters/>Gradually converging to a value that can achieve a higher desired return, thereby improving the strategy.

Policy value functionExpressed in state/>Down execution current policy/>The expected return obtained, i.e. policy in state/>The following properties. It is an important index in reinforcement learning to evaluate the quality of strategies. /(I)Representing the number of separate D-MCTS search trees, each for estimating a policy value function. Each independent D-MCTS search tree/>For being in state/>The value of the policy value function is estimated down. D-MCTS is a search algorithm that selects actions in a search tree that simulate environmental interactions to estimate policy values.

Computing new policy value functions using Monte Carlo tree search (D-MCTS): By taking all independent D-MCTS search trees/>, in the formulaCalculates a policy value function/>. This results in a more accurate policy value function estimate. D-MCTS is a search algorithm in reinforcement learning that is used to select actions, simulate environmental interactions, and estimate a strategy value function. The D-MCTS builds a search tree by simulating the trajectory multiple times, each time simulating starting from the current state, selecting actions according to policies, simulating environmental interactions, evaluating the value of the actions and selecting optimal actions. To obtain a more stable and accurate strategy value estimate, multiple independent D-MCTS search trees are used and their estimates are averaged. This may reduce the variance of the estimate and improve the accuracy of the estimate.

Example 8: in step 4, when the motion value function is updated by using distributed reinforcement learning, the following formula is used:

；

Specifically, action value functionExpressed in state/>Down execution action/>Expected return in time, wherein/>Is a parameter of the Q network. It is used to estimate the sum of the values of the given strategy/>The value of each state-action pair below. Updated action value function/>Is an updated action value function after executing the step 4, and is expressed in state/>Down execution action/>New expected return estimates at that time. /(I)For learning rate, the updated step size is controlled, typically a small positive number.Expressed in state/>The actual return obtained after the action is performed. This return is an environmental provided return signal.Is a discount factor used to weigh the importance of current rewards and future rewards. It is a value between 0 and 1.Expressed in the next state/>Next, select action/>An estimate of the maximum motion value that can be obtained. This value is used to estimate the future return.

Updating motion value functions using Q-learning method in distributed reinforcement learningTo better estimate the value of each state-action pair: the right side of the formula represents the updated calculation process. According to the current action value function/>And perform action/>Actual return after/>And estimation of future returnsCalculate a new estimate/>. Learning rate/>For adjusting the step size of the update. A smaller learning rate may make the update more stable, but may require more iterations to converge. This formula uses the core idea of the Q-learning method, namely to improve the current motion value estimation by estimating the maximum return in the future. This helps the agent learn to select the optimal actions in different states. In the formulaExpressed in the next state/>An estimate of the maximum motion value that can be obtained is selected. This estimate takes into account future rewards and helps to adjust the value of the current state-action pair.

Example 9: and when the strategy parameters are updated again by calculating the strategy gradient by using the updated action value function so as to improve the strategy, the following formula is used:

。

specifically, policy parameters Representation of policies/>Is a parameter of (a). This policy defines the state/>Down select action/>Is a probability of (2). Learning rate/>Is a positive number, controlling the size of the update step. />, in the formulaExpressed in terms of policy/>Generated track/>And taking the expected value. Track/>Representing an agent's interaction in the environment, including starting from an initial state, selecting actions according to a policy, and then observing the return and transitioning to the entire sequence of the next state. />, in the formulaRepresentation of parameters concerning policies/>Is a gradient of (a). Policy gradient tells how to adjust policy parameters/>To maximize the expected return. /(I)This is in part the logarithmic probability of the policy with respect to the policy parameters/>Is a gradient of (a). It is represented in state/>Down execution action/>Logarithmic probability of (2) with respect to policy parameters/>Is a gradient of (a). /(I)Is to use the updated action value function/>Estimated state-action pair/>Is of value (c).

First, by being on trackTake the expected value/>To estimate the expected return. Next, calculate parameters about the policyI.e. the strategy gradient. This gradient tells how to adjust the policy parameters/>To maximize the expected return. The direction of the gradient is the direction in which the expected return is increased. In calculating the gradient, an updated action value function is usedTo estimate state-action pairs/>Is of value (c). This value is used to measure the contribution of each state-action to the expected return. Finally, use learning rate/>And policy gradients to update policy parameters/>. The update direction is the direction that maximizes the expected return. By iterating this process repeatedly, policy parameters/>Gradually converging to a value that can achieve a higher desired return, thereby improving the strategy.

The present invention has been described in detail above. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims

1. A method for optimizing transmission performance of an AOC optical module using deep reinforcement learning, the method comprising:

2. The method for optimizing transmission performance of an AOC optical module using deep reinforcement learning of claim 1, wherein the state space is set to beThe action space is/>The state transition probability is/>And return function as; State space/>Each element in (2) represents the possible transmission speed of an AOC optical module, in terms of state/>To represent; action space/>Each element in (2) represents an action that may be taken to optimize the transmission speed of the AOC optical module, with action/>To represent; wherein, action/>The included categories include: increase, decrease and remain unchanged.

3. The method for optimizing transmission performance of an AOC optical module using deep reinforcement learning of claim 2, wherein in step 2, the action value function is estimated using the following formula：

；

4. A method for optimizing transmission performance of an AOC optical module using deep reinforcement learning as claimed in claim 3, wherein in step 2, the policy value function is estimated using the following formula:

；

Wherein, Expressed in state/>Down select action/>Is a probability of (2).

5. The method for optimizing transmission performance of an AOC optical module using deep reinforcement learning of claim 4, wherein the loss function of the action value function in step 2 is expressed using the following formula:

；

6. The method for optimizing transmission performance of an AOC optical module using deep reinforcement learning of claim 5, wherein the method for improving the policy using the policy gradient method in step 3 comprises: by adjusting policy parametersTo improve the strategy to maximize the expected return; a strategy gradient approach is used, wherein the strategy gradient is related to the strategy parameters/>The gradient of which is oriented to increase the expected return; first for sampling trace/>Estimating the expected return of (2); track/>Is according to the current policyA generated sequence of states and actions; then, calculate the parameters/>, about the policyThe direction of the gradient is the direction that maximizes the desired return; finally, based on learning rate/>And gradients to update policy parameters/>。

7. The method for optimizing transmission performance of an AOC optical module using deep reinforcement learning of claim 6, wherein the policy parameters are updated using the following formula：

；

Wherein,Representation of policies/>Policy parameters of (2); /(I)Representation of parameters concerning policies/>Is a gradient of (2); /(I): Expressed in terms of policy/>Generated track/>Taking an expected value; /(I): Representing the time step/>, in the trajectorySummation is performed, wherein/>Is the maximum length of the track; /(I)Representing policy in State/>Down execution action/>A gradient of logarithmic probability of (a); /(I)Representing a dominance function, defined as/>WhereinIs a state action pair/>Expected return of/>Is state/>Policy value function of (2); /(I)The updated strategy parameters; /(I)Representing time step/>A state at that time; /(I)Representing time step/>Action at that time; a new policy value function is calculated using a monte carlo tree search by the following formula:

；

Wherein, Representing a policy value function, represented in state/>Down execution current policy/>The expected return obtained; /(I)Representing the number of independent D-MCTS search trees; /(I)Represents the/>Individual D-MCTS search trees are in state/>Is used for the estimation of the estimated value of (a).

8. The method of optimizing transmission performance of an AOC optical module using deep reinforcement learning of claim 7, wherein when using distributed reinforcement learning to update the action value function in step 4, the following formula is used:

；

Wherein, As a function of the action value, expressed in Q network parameter as/>At the time of state/>Down execution action/>Expected return on time; /(I)The updated action value function is used; /(I)Representing the next state/>Under, perform action/>An estimate of the maximum motion value that can be obtained.

9. The method for optimizing transmission performance of an AOC optical module using deep reinforcement learning of claim 8, wherein the policy parameters are updated again by calculating a policy gradient using the updated action value function to improve the policy using the following formula:

。