CN112637965A

CN112637965A - Game-based Q learning competition window adjusting method, system and medium

Info

Publication number: CN112637965A
Application number: CN202011620219.3A
Authority: CN
Inventors: 俞晖; 毛中杰; 王政
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-09
Anticipated expiration: 2040-12-30
Also published as: CN112637965B

Abstract

The present invention provides a game-based Q-learning competition window adjustment method, system and medium, including: step 1: initialize network node settings, perform ad hoc networking and establish a routing table; Jump the number of neighbor nodes, and broadcast to neighbor nodes; Step 3: Calculate the weight of the node in the network and broadcast it; Step 4: Adopt different back-off strategies according to network differences; Step 5: Each node in the network is generated according to Step 4 Q-learning is performed on the set of contention window states obtained, and the optimal contention window interval is output, and communication is performed accordingly; Step 6: After the network topology changes or the service load fluctuates greatly, repeat steps 2-5. The invention uses game theory to analyze the network scene, determines the state sets of different nodes for Q-learning, and then uses the Q-learning algorithm to generate decisions and update the competition window interval to optimize the overall network performance.

Description

Game-based Q learning competition window adjusting method, system and medium

Technical Field

The invention relates to the technical field of mobile ad hoc networks, in particular to a method, a system and a medium for adjusting a Q learning competition window based on a game.

Background

In recent years, with the rapid development of wireless communication technology, the mobile ad hoc network is widely applied to various fields, such as public network-free areas of earthquake relief, remote investigation and the like, by virtue of the characteristics of high mobility, strong adaptability, low cost and the like. Meanwhile, the characteristics of strong dynamic property and large topology difference of the mobile ad hoc network bring certain challenges to network design, and especially, high requirements are provided for the design of a back-off algorithm in an access protocol.

Currently, widely applied Access technologies in mobile ad hoc networks are Carrier Sense Multiple Access (CSMA/CA) protocol with Collision avoidance and Binary Exponential Back-off (BEB) algorithm. However, the existing backoff algorithm cannot adapt to the characteristics of high dynamic property and large topological difference of the mobile self-organizing network, and due to the fact that parameters in a network protocol are numerous and the algorithm is complicated, the parameters may influence a certain final index, and the influence is difficult to express through a mathematical formula to find a closed optimal solution, so that the mobile self-organizing network is optimized by a machine learning method through a large amount of research. The research of the vehicle-mounted communication MAC layer channel access technology based on multi-agent Q Learning introduces a Q-Learning algorithm into a vehicle-mounted self-organizing network, and provides a dynamic adjustment competition window algorithm based on the Q-Learning: in the communication process, the vehicle node always selects the CW value which enables the accumulated reward value to be maximum to carry out backoff, so that the effects of improving the transmission success rate and reducing the time delay are achieved, but the defect of large topology difference in the mobile ad hoc network is not considered.

In a mobile ad hoc network, due to the mobility of nodes and the complexity of a network scenario, a multi-hop communication mode often occurs. In such a distributed communication network, the load of different nodes in the network varies due to factors such as the geographical location of the nodes. For a node in a network center, the node has more neighbor nodes and correspondingly bears more forwarding services, so that the service load of the node is larger; for nodes at the edge of the network, there are fewer neighbor nodes, and the nodes often do not or do less forwarding traffic, resulting in a smaller traffic load. The game-based reinforcement learning backoff window design scheme is provided aiming at the differences of the nodes, and the overall network performance is improved by setting the size of a targeted backoff window interval for different nodes.

Patent document CN107426772A (application number: CN201710537493.6) discloses a dynamic contention window adjustment method, device and apparatus based on Q learning, the method includes: A. initializing channel access parameters and initial annealing temperature; B. transmitting and acquiring a first throughput of data packet transmission under the size of an initial contention window; C. under the first throughput, adopting a simulated annealing algorithm to obtain the size of a first competition window; D. acquiring a second throughput of data packet transmission under the size of the first contention window; E. updating the Q value table according to a preset formula for the second throughput and the target throughput, and updating the initial annealing temperature by using preset conditions to obtain the updated annealing temperature; and when the updated annealing temperature is higher than the minimum threshold value, taking the updated annealing temperature as the initial annealing temperature, executing the steps B to E, repeatedly updating the Q value table and obtaining the updated annealing temperature, and stopping updating the Q value table until the updated annealing temperature is lower than or equal to the minimum threshold value, thus obtaining the optimal competition window.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method, a system and a medium for adjusting a game-based Q learning competition window.

The game-based Q learning competition window adjusting method provided by the invention comprises the following steps:

step 1: initializing network node settings, including a communication protocol, a network topology, a service arrival model and a physical layer standard, performing ad hoc networking and establishing a routing table by a network in a mode of broadcasting routing information after communication is started, and in an initial state, performing backoff by the node according to the size of a default contention window, and then adjusting the size of the contention window according to an action output by a Q learning algorithm to perform backoff;

step 2: the whole network node acquires the number of one-hop neighbor nodes through a routing table, broadcasts the number to the neighbor nodes through RTS/CTS signaling, receives RTS/CTS information broadcast by the neighbor nodes, acquires and calculates the average number of the neighbor nodes, and calculates the size of a service load in unit time by inquiring the service to be sent in a cache region and calculates a corresponding load factor;

and step 3: calculating the weight of the node in the network, broadcasting, calculating a network difference index by the node according to the maximum weight and the minimum weight in a one-hop range, and then playing games according to the network difference index;

and 4, step 4: if the network difference is larger than a preset threshold value, a balanced backoff strategy is adopted, the node with the maximum weight in the network adopts a smaller competition window state set in the reinforcement learning, and other nodes adopt a larger competition window state set in the reinforcement learning; otherwise, adopting a default backoff strategy, and adopting the same competition window state set by all nodes in the network;

and 5: each node in the network performs Q learning according to the competition window state set generated in the step 4, outputs an optimal competition window interval and performs communication according to the optimal competition window interval;

step 6: and (5) repeatedly executing the step 2-5 after the network topology structure is changed or the service load is greatly fluctuated.

Preferably, the importance of the node in the network is confirmed by inquiring the routing table and counting the flow, and for any node k, the number N of one-hop neighbor nodes of the node is acquired by inquiring the routing table_kAcquiring the number sigma of neighbor nodes of the neighbor nodes in the one-hop communication range through the RTS/CTS additional bit information_iN_iAnd calculating the average neighbor number of the neighbor nodes

For any node k, counting the size L of the traffic load in the current time period_kAccording to the traffic load L_kFor channel transmission rate R_kWhether the node is a heavy service node is divided according to the relative size of the node, and the load factor l is used_kRepresents:

according to the load factor l_kThe number N of neighbor nodes_kAnd neighborsAverage number of neighbors of a node

Calculating the weight:

wherein b is a constant to ensure the basic weight of the edge node in the network.

Preferably, for the node with the maximum weight in the one-hop communication range, the weight y is broadcasted through RTS/CTS control signaling_iKnowing the minimum node weight in a hop range as

Calculating a difference index

Determining the difference of the network nodes according to the value of G;

if the G value exceeds a preset threshold value, judging that the difference of the network is large, and adopting a balanced backoff strategy: the node with the largest weight adopts a smaller competition window interval, namely a smaller left value under the same right value; the node with the smallest weight adopts a larger competition window interval, namely a larger left value under the same right value.

Preferably, the node is used as an agent, and the interval left value of the contention window in the backoff algorithm is used as the environment state set, i.e. CW_min＝2^s-1 left value CW of contention window per unit time_minAs a set of actions

The network transmission success rate and the average time delay are taken as optimization targets to learn, and the updating formula is as follows:

Q(S,A)←Q(S,A)+α[r+γmax_aQ(S',a)-Q(S,A)]

wherein, gamma is a discount factor and represents the influence of the action adopted in the past on the current action; r is reward, which represents the size of a reward function evaluated by the transmission success rate and the average delay index obtained by taking the action A in the current state S; alpha is a convergence factor and is a main factor influencing the convergence rate; s' is a past state.

The game-based Q learning competition window adjusting system provided by the invention comprises:

module M1: initializing network node settings, including a communication protocol, a network topology, a service arrival model and a physical layer standard, performing ad hoc networking and establishing a routing table by a network in a mode of broadcasting routing information after communication is started, and in an initial state, performing backoff by the node according to the size of a default contention window, and then adjusting the size of the contention window according to an action output by a Q learning algorithm to perform backoff;

module M2: the whole network node acquires the number of one-hop neighbor nodes through a routing table, broadcasts the number to the neighbor nodes through RTS/CTS signaling, receives RTS/CTS information broadcast by the neighbor nodes, acquires and calculates the average number of the neighbor nodes, and calculates the size of a service load in unit time by inquiring the service to be sent in a cache region and calculates a corresponding load factor;

module M3: calculating the weight of the node in the network, broadcasting, calculating a network difference index by the node according to the maximum weight and the minimum weight in a one-hop range, and then playing games according to the network difference index;

module M4: if the network difference is larger than a preset threshold value, a balanced backoff strategy is adopted, the node with the maximum weight in the network adopts a smaller competition window state set in the reinforcement learning, and other nodes adopt a larger competition window state set in the reinforcement learning; otherwise, adopting a default backoff strategy, and adopting the same competition window state set by all nodes in the network;

module M5: each node in the network performs Q learning according to the competition window state set generated by the module M4, outputs an optimal competition window interval and performs communication according to the optimal competition window interval;

module M6: after the network topology changes or the traffic load greatly fluctuates, the modules M2-M5 are called in sequence.

according to the load factor l_kThe number N of neighbor nodes_kAnd average neighbor number of neighbor nodes

Calculating the weight:

Calculating a difference index

Determining the difference of the network nodes according to the value of G;

Q(S,A)←Q(S,A)+α[r+γmax_aQ(S',a)-Q(S,A)]

According to the present invention, a computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method as described above.

Compared with the prior art, the invention has the following beneficial effects:

the method analyzes the network scene by using the game theory, determines the state set of Q learning of different nodes, generates a decision by using a Q learning algorithm, and updates the competition window interval so as to optimize the overall network performance.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a diagram of a cooperative gaming system model;

FIG. 2 is a cooperative gaming simulation topology;

FIG. 3 is a transmission success rate curve for a cooperative game;

FIG. 4 is a graph of average delay for a cooperative game;

fig. 5 is a transmission success rate curve of a central node of the cooperative game;

FIG. 6 is a graph of average delay for a central node of a cooperative game;

fig. 7 is a transmission success rate curve of the edge nodes of the cooperative game;

FIG. 8 is an average plot of edge nodes for the cooperative game;

fig. 9 is a flow diagram of a cooperative gaming implementation.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example (b):

the invention provides a design scheme for using a game-based reinforcement learning competition window in a mobile ad hoc network, which comprises the following steps: the network node inquires the number of neighbor nodes and the load size of the network node by inquiring a routing table and a traffic statistic mode, broadcasts the information by an RTS/CTS mechanism, further determines the difference of the network, judges which game strategy is adopted according to the difference of the network, determines a state set of reinforcement learning according to the corresponding game strategy, then performs reinforcement learning, finally determines the interval of a competition window, and performs communication according to the state set of reinforcement learning. The method adopts a mode that the system carries out communication and training at the same time, namely, the system carries out reinforcement learning and communication according to the current state set, when the network load changes or the topology changes, the network node carries out decision-making again to determine the state set, then carries out reinforcement learning and communication according to the new state set, and continuously repeats the processes until the iteration is terminated.

The invention adopts Q-Learning algorithm as the implementation method of reinforcement Learning, and the optimization object is the left interval of the competition window CW, namely the CW_minAiming at improving the success rate of system transmission and reducing the average time delay, the communication environment is modeled as follows:

and a state S: current minimum contention window size, CW, of a node in communication_min＝2^s-1. The minimum competition window size directly determines the lower limit of the back-off time and indirectly influences the average value of the back-off time, the collision of a channel and the time delay of data are influenced significantly, and the set of the state s is given by a game strategy.

Action A: the action setting in the present invention is to adjust the size of the state s,

in a communication system, a network node needs to communicate according to existing parameters for a period of time before communication performance can be obtained, so that the change of actions is periodic, and after equal time intervals, the state s is adjusted.

Reward R: the invention takes the index variable quantity obtained by the periodic communication of the network nodes as the reward, and the index is the transmission success rate PDR and the average DELAY DELAY. In order to effectively reflect the degree of change of indexes with different dimensions, the indexes are normalized as follows:

R＝μ*PDR+(1-μ)*DELAY

in the formula, mu is a weight factor and represents the proportion of the transmission success rate variation in the reward; PDR_now,PDR_pastRespectively indicating the transmission success rate in the current unit time and the transmission success rate in the previous unit time; delay_now,delay_pastRespectively, the average time delay in the current unit time and the average time delay in the previous unit time.

Transition probability P: in the invention, all nodes independently perform the Q learning algorithm, and the selected action can be accurately executed by the communication equipment, so that the state transition probability is 1.

In a distributed network scenario, the service loads of different nodes are different, the influence of the set of the state s on the network performance is amplified, if the interval of the state set is large, the situation that the reinforcement learning effect of different nodes vibrates repeatedly may be caused, and if the difference of the state set is small, the backoff algorithm is degenerated into a mechanism with fixed parameters, and cannot adapt to the change of the environment. The invention provides a competition window design scheme based on a double-headed monopoly model in a game theory so as to adapt to a differential network topology structure.

The double-headed monopoly model of the Gonuo theory on industry organization indicates that the market capacity is limited, and the production volume of different companies determines the price of the product, further influencing the profit of the company. This is similar to the access problem for the communication channel. If the access probability of different nodes to the channel is low, the channel is not fully utilized, and if each node frequently accesses the channel, channel collision is caused, and the channel utilization rate is reduced.

The influence of the backoff window on the average time delay and the transmission success rate can be observed, under a certain service load, the larger the backoff window is, the lower the collision probability is, but the average time delay is also reduced, and the transmission success rate and the average time delay are in negative correlation, so that a Gunuo double-headed monopoly model is used for describing decision behaviors of two nodes, then a game behavior of multiple nodes is extended, and finally a cooperative game model is designed for a network scene with larger difference.

Modeling the game theory:

let q be₁，q₂Indicating the capability of the nodes 1,2 to access the channel (by adjusting the backoff parameter settings), respectively, the total capability of the access channel in the network Q₁+q₂Let us order

The capability of the network to successfully transmit under the backoff parameter is shown, wherein a represents the maximum transmission capability of the channel, and k is a constant. Let the signaling overhead of the network be constant c, and c<a. According to the goono assumption, two nodes make decisions on access channel capabilities simultaneously.

Assuming that the profit of the node is in positive linear correlation with the number of successfully transmitted data packets in unit time, in a normal two-participant standard game, the profit u of the participant i is_i(s_i,s_j) Can be written as:

u_i(q_i,q_j)＝q_i[k(a-Q-c)]…………(1)

in the Gunuo model of double-headed monopoly, a pair of decision combinations

For nash equalization, for each node i,

the solution to the following maximization problem should be solved:

under nash balance, the access parameters of the node must satisfy:

solving the system of equations yields:

the above assumptions are satisfied.

The physical meaning of the above equalisation is that each network node, wishing to become the monopost of the channel, selects the parameter q_iMake self-income pi_i(q_i,q_j) Maximum, when Q is Q₁+q₂Before reaching the equilibrium value, each node tends to increase qi to increase its own gain; after the equilibrium value is reached, each node may continue to increase, but for each node, the behavior of deviating from the equilibrium solution unilaterally will result in the decrease of the self-income, and on the premise of non-cooperation, the above behavior of breaking the equilibrium will not occur.

In fact, in a distributed network, there may be n network nodes, let q_iDenotes the access capability of node i, and Q ═ Q₁+q₂+…+q_nAnd (3) representing the total capacity of an access channel in the network, wherein the anti-demand function p (Q) is a-Q, and the signaling overhead is a fixed value c. All nodes do not know the decision of other nodes and independently make the decision, and the decision process is

For the ith network node, the goal is to maximize its own revenue, i.e.:

solving equation (4) yields:

the above results show that in the non-cooperative game model in which n nodes share a channel, when nash equilibrium is reached, all nodes will have the same access capability, and the sum of the access capabilities of all nodes will not exceed the upper limit of the channel carrying capability. In the distributed network, each node calculates the own equilibrium strategy according to the number of the own one-hop neighbor nodes, and the parameter is used as a state set to perform reinforcement learning to access a channel.

The cooperative game model comprises the following steps: in the above game model, each node aims at maximizing its own communication performance, and selects a corresponding game balancing strategy, however, such a balancing strategy may cause rapid deterioration of performance of some nodes under the condition of unbalanced node load, so a cooperative game model is proposed to analyze the balancing strategy of each node with the goal of optimizing the overall performance of the network. To better analyze the problem, the following assumptions were made:

each node has the capacity of counting the self traffic load. Let the traffic load of node i be L_iThe channel transmission rate is R_iDividing whether the node is a heavy service node according to the relative size of the service load to the channel transmission rate, and using a load factor l_iExpressed, defined as follows:

in general, we consider that when l_iWhen the load factor is larger than or equal to 1, the node is a heavy service node, and the load factor is increased linearly along with the increase of the service load; when l is_i<1, the node is a normal node, and the load factor increases exponentially with the increase of the traffic load, so that when the node is almost unloaded, the load factor is very small and the growth trend is very small.

The node i can obtain the number N of nodes in a one-hop range by inquiring the routing table_i(ii) a The node i acquires the number of neighbor nodes of the surrounding one-hop node by broadcasting RTS/CTS control signaling, and calculates the average number of the neighbor nodes

The node i combines the load factor l according to the information_iAnd (3) obtaining the weight of the node in the network:

in the above equation (7), b is a constant, and its physical meaning is to guarantee the basic weight of the edge node in the network.

For node i, the benefit is defined as:

u_i(q_i,q_-i)＝y_iq_i(a-Q)…………(8)

the revenue function is the sum of the revenue of all nodes, i.e.:

∑_iu_i(q_i,q_-i)＝∑_iy_iq_i(a-Q)…………(9)

in quasi-static networks, the backoff window of each node must exist, so q_iP is the minimum value set by the backoff window, and the weight of the node in the network is kept unchanged in a short time, namely y_iIs a constant. Assuming that the channel transmission capacity a is constant, and

then the gain function is scaled, and the maximum value is solved to obtain:

the above equation shows that when the overall performance tends to be optimal, the capacities of the access channels of the nodes with non-maximum weights are all the minimum value p, and the access channel capacity of the node with the maximum weight can be obtained by calculation according to the actual situation.

The system model of the present invention is shown in fig. 1, the implementation flow is shown in fig. 9, and the following describes the implementation of the present invention with reference to specific examples:

step 1: for any node k, the number N of one-hop neighbor nodes of the node is obtained through the query of a routing table_kAcquiring the number sigma of neighbor nodes in a one-hop communication range through additional bit information of RTS/CTS control signaling_iN_iAnd calculating the average neighbor number of the neighbor nodes

Step 2: for any node k, counting the size L of the traffic load in the current time period_kAccording to the traffic load L_kFor channel transmission rate R_kWhether the node is a heavy service node is divided according to the relative size of the node, and the load factor l is used_kAnd (4) showing.

And step 3: for any node k, according to the load factor l_kThe number N of neighbor nodes_kAnd average neighbor number of neighbor nodes

Calculating weights

And 4, step 4: for any node k, entering a decision stage, and respectively making the following decisions according to different node conditions:

for the node with the maximum weight in the one-hop communication range, the weight y is broadcasted through RTS/CTS control signaling_i(an indication bit is added in the control signaling, and the corresponding overhead is increased), the minimum node weight in the range of one hop is obtained as

Calculating a difference index

Difference index threshold G_tDetermining the difference of the network nodes according to the value of G, and making a decision as follows:

(1) if the network nodes have large difference, G>G_tThen, for the node with the largest weight, the decision is as follows:

if the network service load is larger, adopting a default backoff strategy and adjusting a backoff window to (15,1023);

if the network service load is smaller, a balanced backoff strategy is adopted, and the parameter set of the minimum backoff window in reinforcement learning is set to be CW_minE {7,15,31}, using the CW after reinforcement learning_minMaking a decision;

and after the node with the maximum weight makes a decision, other neighbor nodes are informed through broadcasting control information.

For the nodes with non-maximum weight in the range of one hop, after the neighbor with the maximum weight is selected, the decision is as follows:

if the network service load is smaller, a balanced backoff strategy is adopted, and the parameter set of the minimum backoff window in reinforcement learning is set to be CW_minE {63,127,255}, using the CW after reinforcement learning_minMaking a decision;

and repeating the above actions by all the nodes until the whole network decision is completed.

(2) If the network node difference is small, G is less than or equal to G_tThe decision is as follows:

all nodes adopt a default backoff window size (15,1023);

and 5: and (4) repeating the steps 1-4 after the network topology structure is changed or the service load is greatly fluctuated.

Simulation setting and performance analysis:

the NS3 software network simulator is used for simulation, a network scene with large node difference is considered, as shown in figure 2, a MAC protocol is CSMA/CA, related parameters refer to an IEEE802.11a protocol standard, 9 nodes are set to participate in communication, the maximum communication hop number is 2, and the load of a central node is obviously higher than that of other nodes. The network topology is shown in fig. 2, and the parameter settings are shown in table 1.

TABLE 1

In the simulation scenario shown in fig. 2, there are 0,1,2,3,4,5,6,7,8 nodes participating in the communication. The node 0 can only communicate with the node 1, the node 1 can communicate with all nodes, and the node 1 and the nodes 2-8 are neighbors. The set communication link includes: 0- >1,1- >4,2- >0,3- >0,4- >5,5- >6,6- >7,7- >8, and the load of all communication links is the same. The node 1 is a heavy service node whose service load accounts for 3/9 of the total service load, and belongs to a node with a larger weight.

And (3) simulation result analysis:

according to the invention, a competition window design scheme adopting a game theory is simulated and compared with a default competition window design scheme, the transmission success rate and the average time delay are selected as judgment indexes, and a simulation result can be analyzed from three aspects of a whole body, a central node and a marginal node.

As shown in fig. 3 and 4, the balancing strategy adopts game decisions of different backoff windows of different nodes under the condition that the network is less than or equal to 3Mbps, and compared with the default backoff strategy, the average time delay is increased, the transmission success rate is also improved, but the amplitudes are small.

As for the central node (node 1), the network performance is as shown in fig. 5 and fig. 6, and it can be seen from fig. 5 and fig. 6 that although the traffic load of the central node is heavier, after the balanced back-off strategy is adopted, the central node obtains more opportunities to access the channel due to the extension of the back-off windows of other nodes, so that the average delay is reduced from 3.7ms to 1.7ms before, the improvement effect is obvious, and the transmission success rate is maintained at 100%.

For the edge node (node 0), the network performance is as shown in fig. 7 and fig. 8, and it can be seen from fig. 7 and fig. 8 that for the edge node, the transmission success rate can be greatly improved by about 10% by using the balanced back-off strategy under the medium-low service load, while the average delay is increased by only 2-3ms, and the delay-insensitive service can be ignored.

In summary, under low traffic load, the network employs a balanced backoff strategy, and the average delay of the network is almost the same as that of the default strategy, which results in a small increase in the transmission success rate. For the edge node, the transmission success rate can be effectively improved by adopting a balancing strategy; for the central node, the average time delay can be effectively reduced. After the medium and high load, the network switches the back-off strategy, and the performance is kept consistent with the default back-off strategy. The back-off algorithm based on the cooperative game model can improve the transmission success rate of the whole and edge nodes in a partial low-load differential network under the condition of hardly influencing average time delay, and can further reduce the time delay of the central node under the condition of maintaining the high transmission success rate of the central node. The game strategy is suitable for a network with a central node of a time delay sensitive type and other nodes of a time delay insensitive type.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. a game-based Q learning competition window adjustment method, is characterized in that, comprises:

Step 1: Initialize the network node settings, including the communication protocol, network topology, service arrival model and physical layer standards. After the communication starts, the network performs ad hoc networking and establishes a routing table by broadcasting routing information. In the initial state, the nodes follow the default The size of the competition window is used to back off, and then the size of the competition window is adjusted to back off according to the action output by the Q-learning algorithm;

Step 2: The whole network node obtains the number of one-hop neighbor nodes through the routing table, and broadcasts it to the neighbor nodes through RTS/CTS signaling. At the same time, the node receives the RTS/CTS information broadcast by its neighbor node, and learns and calculates its neighbors. The average number of neighbor nodes of the node, each node counts the service load size per unit time by querying the service to be sent in the buffer area, and obtains the corresponding load factor;

Step 3: Calculate the weight of the node in the network and broadcast it. The node calculates the network difference index through the maximum weight value and the minimum weight value within a hop range, and then conducts the game according to the network difference index;

Step 4: If the network difference is greater than the preset threshold, a balanced back-off strategy is adopted. The node with the largest weight in the network adopts a smaller competition window state set in reinforcement learning, and other nodes adopt a larger competition window state in reinforcement learning. set; otherwise, the default backoff strategy is adopted, and all nodes in the network adopt the same set of contention window states;

Step 5: Each node in the network performs Q-learning according to the contention window state set generated in step 4, outputs the optimal competition window interval, and communicates accordingly;

Step 6: Repeat steps 2-5 after the network topology changes or the service load fluctuates greatly.

2. the game-based Q-learning competition window adjustment method according to claim 1, is characterized in that, by the mode of query routing table and statistical flow, confirm the importance of node in network, for any node k, it is by routing Query the table to obtain the number N _k of one-hop neighbor nodes of this node, obtain the number of neighbor nodes ∑ _i N _i within the communication range of one hop through the RTS/CTS additional bit information, and calculate the average number of neighbor nodes of the neighbor node.

For any node k, count the service load size L _k in the current time period, and classify whether the node is a heavy service node according to the relative size of the service load L _k to the channel transmission rate R _k , which is represented by the load factor _lk :

According to the load factor l _k , the number of neighbor nodes N _k and the average number of neighbors of neighbor nodes

Calculate weights:

Among them, b is a constant, in order to ensure the basic weight of edge nodes in the network.

3. game-based Q learning competition window adjustment method according to claim 1, is characterized in that, to the node with the largest weight in one-hop communication range, broadcast weight _yi by RTS/CTS control signaling, learn one-hop range The smallest node weight in the

Calculate the difference indicator

Determine the size of the network node difference according to the value of G;

If the G value exceeds the preset threshold, it is determined that the network difference is large, and a balanced back-off strategy is adopted: the node with the largest weight adopts a smaller contention window interval, that is, a smaller left value under the same right value; the node with the smallest weight adopts a larger The contention window interval of , that is, the larger lvalue under the same rvalue.

4. The game-based Q-learning competition window adjustment method according to claim 1, wherein the node is used as the agent, and the interval left value of the competition window in the back-off algorithm is used as the environment state set, that is, CW _min =2 ^s − 1. Take the value of the left value CW _min of the competition window in unit time as the action set

Taking the network transmission success rate and average delay as the optimization goals, learning is carried out, and the update formula is:

Q(S,A)←Q(S,A)+α[r+γmax _a Q(S',a)-Q(S,A)]

Among them, γ is the discount factor, which represents the influence of the actions taken in the past on the current action; r is the reward, which represents the reward function obtained by taking the action A in the current state S, which is evaluated by the transmission success rate and the average delay index; α is the convergence factor, which is the main factor affecting the convergence speed; S' is the past state.

5. A game-based Q learning competition window adjustment system, characterized in that, comprising:

Module M1: Initialize the network node settings, including the communication protocol, network topology, service arrival model and physical layer standard. After the communication starts, the network performs ad hoc networking and establishes a routing table by broadcasting routing information. In the initial state, the nodes follow the default The size of the competition window is used to back off, and then the size of the competition window is adjusted to back off according to the action output by the Q-learning algorithm;

Module M2: The entire network node obtains the number of one-hop neighbor nodes through the routing table, and broadcasts it to the neighbor nodes through RTS/CTS signaling. At the same time, the node receives the RTS/CTS information broadcast by its neighbor nodes, and learns and calculates its neighbors. The average number of neighbor nodes of the node, each node counts the service load size per unit time by querying the service to be sent in the buffer area, and obtains the corresponding load factor;

Module M3: Calculate the weight of the node in the network and broadcast it. The node calculates the network difference index through the maximum weight value and the minimum weight value within one hop, and then conducts a game according to the network difference index;

Module M4: If the network difference is greater than the preset threshold, a balanced backoff strategy is adopted. The node with the largest weight in the network adopts a smaller competition window state set in reinforcement learning, and other nodes adopt a larger competition window state in reinforcement learning. set; otherwise, the default backoff strategy is adopted, and all nodes in the network adopt the same set of contention window states;

Module M5: each node in the network performs Q-learning according to the competition window state set generated by module M4, outputs the optimal competition window interval, and communicates accordingly;

Module M6: After the network topology changes or the service load fluctuates greatly, the modules M2-M5 are called in sequence.

6. game-based Q-learning competition window adjustment system according to claim 5, characterized in that, by querying the routing table and the mode of traffic statistics, the importance of the node in the network is confirmed, and for any node k, it is routed through Query the table to obtain the number N _k of one-hop neighbor nodes of this node, obtain the number of neighbor nodes ∑ _i N _i within the communication range of one hop through the RTS/CTS additional bit information, and calculate the average number of neighbor nodes of the neighbor node.

Calculate weights:

7. game-based Q learning competition window adjustment system according to claim 5, is characterized in that, to the node with the largest weight in one-hop communication range, broadcast weight _yi by RTS/CTS control signaling, learn one-hop range The smallest node weight in the

Calculate the difference indicator

Determine the size of the network node difference according to the value of G;

8. The game-based Q-learning competition window adjustment system according to claim 5, wherein the node is used as the agent, and the interval left value of the competition window in the backoff algorithm is used as the environment state set, that is, CW _min =2 ^s − 1. Take the value of the left value CW _min of the competition window in unit time as the action set

Q(S,A)←Q(S,A)+α[r+γmax _a Q(S',a)-Q(S,A)]

9 . A computer-readable storage medium storing a computer program, wherein, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 4 are implemented. 10 .