Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method, a system and a medium for adjusting a game-based Q learning competition window.
The game-based Q learning competition window adjusting method provided by the invention comprises the following steps:
step 1: initializing network node settings, including a communication protocol, a network topology, a service arrival model and a physical layer standard, performing ad hoc networking and establishing a routing table by a network in a mode of broadcasting routing information after communication is started, and in an initial state, performing backoff by the node according to the size of a default contention window, and then adjusting the size of the contention window according to an action output by a Q learning algorithm to perform backoff;
step 2: the whole network node acquires the number of one-hop neighbor nodes through a routing table, broadcasts the number to the neighbor nodes through RTS/CTS signaling, receives RTS/CTS information broadcast by the neighbor nodes, acquires and calculates the average number of the neighbor nodes, and calculates the size of a service load in unit time by inquiring the service to be sent in a cache region and calculates a corresponding load factor;
and step 3: calculating the weight of the node in the network, broadcasting, calculating a network difference index by the node according to the maximum weight and the minimum weight in a one-hop range, and then playing games according to the network difference index;
and 4, step 4: if the network difference is larger than a preset threshold value, a balanced backoff strategy is adopted, the node with the maximum weight in the network adopts a smaller competition window state set in the reinforcement learning, and other nodes adopt a larger competition window state set in the reinforcement learning; otherwise, adopting a default backoff strategy, and adopting the same competition window state set by all nodes in the network;
and 5: each node in the network performs Q learning according to the competition window state set generated in the step 4, outputs an optimal competition window interval and performs communication according to the optimal competition window interval;
step 6: and (5) repeatedly executing the step 2-5 after the network topology structure is changed or the service load is greatly fluctuated.
Preferably, the importance of the node in the network is confirmed by inquiring the routing table and counting the flow, and for any node k, the number N of one-hop neighbor nodes of the node is acquired by inquiring the routing table
kAcquiring the number sigma of neighbor nodes of the neighbor nodes in the one-hop communication range through the RTS/CTS additional bit information
iN
iAnd calculating the average neighbor number of the neighbor nodes
For any node k, counting the size L of the traffic load in the current time period
kAccording to the traffic load L
kFor channel transmission rate R
kWhether the node is a heavy service node is divided according to the relative size of the node, and the load factor l is used
kRepresents:
according to the load factor l
kThe number N of neighbor nodes
kAnd neighborsAverage number of neighbors of a node
Calculating the weight:
wherein b is a constant to ensure the basic weight of the edge node in the network.
Preferably, for the node with the maximum weight in the one-hop communication range, the weight y is broadcasted through RTS/CTS control signaling
iKnowing the minimum node weight in a hop range as
Calculating a difference index
Determining the difference of the network nodes according to the value of G;
if the G value exceeds a preset threshold value, judging that the difference of the network is large, and adopting a balanced backoff strategy: the node with the largest weight adopts a smaller competition window interval, namely a smaller left value under the same right value; the node with the smallest weight adopts a larger competition window interval, namely a larger left value under the same right value.
Preferably, the node is used as an agent, and the interval left value of the contention window in the backoff algorithm is used as the environment state set, i.e. CW
min=2
s-1 left value CW of contention window per unit time
minAs a set of actions
The network transmission success rate and the average time delay are taken as optimization targets to learn, and the updating formula is as follows:
Q(S,A)←Q(S,A)+α[r+γmaxaQ(S',a)-Q(S,A)]
wherein, gamma is a discount factor and represents the influence of the action adopted in the past on the current action; r is reward, which represents the size of a reward function evaluated by the transmission success rate and the average delay index obtained by taking the action A in the current state S; alpha is a convergence factor and is a main factor influencing the convergence rate; s' is a past state.
The game-based Q learning competition window adjusting system provided by the invention comprises:
module M1: initializing network node settings, including a communication protocol, a network topology, a service arrival model and a physical layer standard, performing ad hoc networking and establishing a routing table by a network in a mode of broadcasting routing information after communication is started, and in an initial state, performing backoff by the node according to the size of a default contention window, and then adjusting the size of the contention window according to an action output by a Q learning algorithm to perform backoff;
module M2: the whole network node acquires the number of one-hop neighbor nodes through a routing table, broadcasts the number to the neighbor nodes through RTS/CTS signaling, receives RTS/CTS information broadcast by the neighbor nodes, acquires and calculates the average number of the neighbor nodes, and calculates the size of a service load in unit time by inquiring the service to be sent in a cache region and calculates a corresponding load factor;
module M3: calculating the weight of the node in the network, broadcasting, calculating a network difference index by the node according to the maximum weight and the minimum weight in a one-hop range, and then playing games according to the network difference index;
module M4: if the network difference is larger than a preset threshold value, a balanced backoff strategy is adopted, the node with the maximum weight in the network adopts a smaller competition window state set in the reinforcement learning, and other nodes adopt a larger competition window state set in the reinforcement learning; otherwise, adopting a default backoff strategy, and adopting the same competition window state set by all nodes in the network;
module M5: each node in the network performs Q learning according to the competition window state set generated by the module M4, outputs an optimal competition window interval and performs communication according to the optimal competition window interval;
module M6: after the network topology changes or the traffic load greatly fluctuates, the modules M2-M5 are called in sequence.
Preferably, the importance of the node in the network is confirmed by inquiring the routing table and counting the flow, and for any node k, the number N of one-hop neighbor nodes of the node is acquired by inquiring the routing table
kAcquiring the number sigma of neighbor nodes of the neighbor nodes in the one-hop communication range through the RTS/CTS additional bit information
iN
iAnd calculating the average neighbor number of the neighbor nodes
For any node k, counting the size L of the traffic load in the current time period
kAccording to the traffic load L
kFor channel transmission rate R
kWhether the node is a heavy service node is divided according to the relative size of the node, and the load factor l is used
kRepresents:
according to the load factor l
kThe number N of neighbor nodes
kAnd average neighbor number of neighbor nodes
Calculating the weight:
wherein b is a constant to ensure the basic weight of the edge node in the network.
Preferably, for the node with the maximum weight in the one-hop communication range, the weight y is broadcasted through RTS/CTS control signaling
iKnowing the minimum node weight in a hop range as
Calculating a difference index
Determining the difference of the network nodes according to the value of G;
if the G value exceeds a preset threshold value, judging that the difference of the network is large, and adopting a balanced backoff strategy: the node with the largest weight adopts a smaller competition window interval, namely a smaller left value under the same right value; the node with the smallest weight adopts a larger competition window interval, namely a larger left value under the same right value.
Preferably, the node is used as an agent, and the interval left value of the contention window in the backoff algorithm is used as the environment state set, i.e. CW
min=2
s-1 left value CW of contention window per unit time
minAs a set of actions
The network transmission success rate and the average time delay are taken as optimization targets to learn, and the updating formula is as follows:
Q(S,A)←Q(S,A)+α[r+γmaxaQ(S',a)-Q(S,A)]
wherein, gamma is a discount factor and represents the influence of the action adopted in the past on the current action; r is reward, which represents the size of a reward function evaluated by the transmission success rate and the average delay index obtained by taking the action A in the current state S; alpha is a convergence factor and is a main factor influencing the convergence rate; s' is a past state.
According to the present invention, a computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method as described above.
Compared with the prior art, the invention has the following beneficial effects:
the method analyzes the network scene by using the game theory, determines the state set of Q learning of different nodes, generates a decision by using a Q learning algorithm, and updates the competition window interval so as to optimize the overall network performance.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example (b):
the invention provides a design scheme for using a game-based reinforcement learning competition window in a mobile ad hoc network, which comprises the following steps: the network node inquires the number of neighbor nodes and the load size of the network node by inquiring a routing table and a traffic statistic mode, broadcasts the information by an RTS/CTS mechanism, further determines the difference of the network, judges which game strategy is adopted according to the difference of the network, determines a state set of reinforcement learning according to the corresponding game strategy, then performs reinforcement learning, finally determines the interval of a competition window, and performs communication according to the state set of reinforcement learning. The method adopts a mode that the system carries out communication and training at the same time, namely, the system carries out reinforcement learning and communication according to the current state set, when the network load changes or the topology changes, the network node carries out decision-making again to determine the state set, then carries out reinforcement learning and communication according to the new state set, and continuously repeats the processes until the iteration is terminated.
The invention adopts Q-Learning algorithm as the implementation method of reinforcement Learning, and the optimization object is the left interval of the competition window CW, namely the CWminAiming at improving the success rate of system transmission and reducing the average time delay, the communication environment is modeled as follows:
and a state S: current minimum contention window size, CW, of a node in communicationmin=2s-1. The minimum competition window size directly determines the lower limit of the back-off time and indirectly influences the average value of the back-off time, the collision of a channel and the time delay of data are influenced significantly, and the set of the state s is given by a game strategy.
Action A: the action setting in the present invention is to adjust the size of the state s,
in a communication system, a network node needs to communicate according to existing parameters for a period of time before communication performance can be obtained, so that the change of actions is periodic, and after equal time intervals, the state s is adjusted.
Reward R: the invention takes the index variable quantity obtained by the periodic communication of the network nodes as the reward, and the index is the transmission success rate PDR and the average DELAY DELAY. In order to effectively reflect the degree of change of indexes with different dimensions, the indexes are normalized as follows:
R=μ*PDR+(1-μ)*DELAY
in the formula, mu is a weight factor and represents the proportion of the transmission success rate variation in the reward; PDRnow,PDRpastRespectively indicating the transmission success rate in the current unit time and the transmission success rate in the previous unit time; delaynow,delaypastRespectively, the average time delay in the current unit time and the average time delay in the previous unit time.
Transition probability P: in the invention, all nodes independently perform the Q learning algorithm, and the selected action can be accurately executed by the communication equipment, so that the state transition probability is 1.
In a distributed network scenario, the service loads of different nodes are different, the influence of the set of the state s on the network performance is amplified, if the interval of the state set is large, the situation that the reinforcement learning effect of different nodes vibrates repeatedly may be caused, and if the difference of the state set is small, the backoff algorithm is degenerated into a mechanism with fixed parameters, and cannot adapt to the change of the environment. The invention provides a competition window design scheme based on a double-headed monopoly model in a game theory so as to adapt to a differential network topology structure.
The double-headed monopoly model of the Gonuo theory on industry organization indicates that the market capacity is limited, and the production volume of different companies determines the price of the product, further influencing the profit of the company. This is similar to the access problem for the communication channel. If the access probability of different nodes to the channel is low, the channel is not fully utilized, and if each node frequently accesses the channel, channel collision is caused, and the channel utilization rate is reduced.
The influence of the backoff window on the average time delay and the transmission success rate can be observed, under a certain service load, the larger the backoff window is, the lower the collision probability is, but the average time delay is also reduced, and the transmission success rate and the average time delay are in negative correlation, so that a Gunuo double-headed monopoly model is used for describing decision behaviors of two nodes, then a game behavior of multiple nodes is extended, and finally a cooperative game model is designed for a network scene with larger difference.
Modeling the game theory:
let q be
1,q
2Indicating the capability of the
nodes 1,2 to access the channel (by adjusting the backoff parameter settings), respectively, the total capability of the access channel in the network Q
1+q
2Let us order
The capability of the network to successfully transmit under the backoff parameter is shown, wherein a represents the maximum transmission capability of the channel, and k is a constant. Let the signaling overhead of the network be constant c, and c<a. According to the goono assumption, two nodes make decisions on access channel capabilities simultaneously.
Assuming that the profit of the node is in positive linear correlation with the number of successfully transmitted data packets in unit time, in a normal two-participant standard game, the profit u of the participant i isi(si,sj) Can be written as:
ui(qi,qj)=qi[k(a-Q-c)]…………(1)
in the Gunuo model of double-headed monopoly, a pair of decision combinations
For nash equalization, for each node i,
the solution to the following maximization problem should be solved:
under nash balance, the access parameters of the node must satisfy:
solving the system of equations yields:
the above assumptions are satisfied.
The physical meaning of the above equalisation is that each network node, wishing to become the monopost of the channel, selects the parameter qiMake self-income pii(qi,qj) Maximum, when Q is Q1+q2Before reaching the equilibrium value, each node tends to increase qi to increase its own gain; after the equilibrium value is reached, each node may continue to increase, but for each node, the behavior of deviating from the equilibrium solution unilaterally will result in the decrease of the self-income, and on the premise of non-cooperation, the above behavior of breaking the equilibrium will not occur.
In fact, in a distributed network, there may be n network nodes, let qiDenotes the access capability of node i, and Q ═ Q1+q2+…+qnAnd (3) representing the total capacity of an access channel in the network, wherein the anti-demand function p (Q) is a-Q, and the signaling overhead is a fixed value c. All nodes do not know the decision of other nodes and independently make the decision, and the decision process is
For the ith network node, the goal is to maximize its own revenue, i.e.:
solving equation (4) yields:
the above results show that in the non-cooperative game model in which n nodes share a channel, when nash equilibrium is reached, all nodes will have the same access capability, and the sum of the access capabilities of all nodes will not exceed the upper limit of the channel carrying capability. In the distributed network, each node calculates the own equilibrium strategy according to the number of the own one-hop neighbor nodes, and the parameter is used as a state set to perform reinforcement learning to access a channel.
The cooperative game model comprises the following steps: in the above game model, each node aims at maximizing its own communication performance, and selects a corresponding game balancing strategy, however, such a balancing strategy may cause rapid deterioration of performance of some nodes under the condition of unbalanced node load, so a cooperative game model is proposed to analyze the balancing strategy of each node with the goal of optimizing the overall performance of the network. To better analyze the problem, the following assumptions were made:
each node has the capacity of counting the self traffic load. Let the traffic load of node i be LiThe channel transmission rate is RiDividing whether the node is a heavy service node according to the relative size of the service load to the channel transmission rate, and using a load factor liExpressed, defined as follows:
in general, we consider that when liWhen the load factor is larger than or equal to 1, the node is a heavy service node, and the load factor is increased linearly along with the increase of the service load; when l isi<1, the node is a normal node, and the load factor increases exponentially with the increase of the traffic load, so that when the node is almost unloaded, the load factor is very small and the growth trend is very small.
The node i can obtain the number N of nodes in a one-hop range by inquiring the routing table
i(ii) a The node i acquires the number of neighbor nodes of the surrounding one-hop node by broadcasting RTS/CTS control signaling, and calculates the average number of the neighbor nodes
The node i combines the load factor l according to the information
iAnd (3) obtaining the weight of the node in the network:
in the above equation (7), b is a constant, and its physical meaning is to guarantee the basic weight of the edge node in the network.
For node i, the benefit is defined as:
ui(qi,q-i)=yiqi(a-Q)…………(8)
the revenue function is the sum of the revenue of all nodes, i.e.:
∑iui(qi,q-i)=∑iyiqi(a-Q)…………(9)
in quasi-static networks, the backoff window of each node must exist, so q
iP is the minimum value set by the backoff window, and the weight of the node in the network is kept unchanged in a short time, namely y
iIs a constant. Assuming that the channel transmission capacity a is constant, and
then the gain function is scaled, and the maximum value is solved to obtain:
the above equation shows that when the overall performance tends to be optimal, the capacities of the access channels of the nodes with non-maximum weights are all the minimum value p, and the access channel capacity of the node with the maximum weight can be obtained by calculation according to the actual situation.
The system model of the present invention is shown in fig. 1, the implementation flow is shown in fig. 9, and the following describes the implementation of the present invention with reference to specific examples:
step 1: for any node k, the number N of one-hop neighbor nodes of the node is obtained through the query of a routing table
kAcquiring the number sigma of neighbor nodes in a one-hop communication range through additional bit information of RTS/CTS control signaling
iN
iAnd calculating the average neighbor number of the neighbor nodes
Step 2: for any node k, counting the size L of the traffic load in the current time periodkAccording to the traffic load LkFor channel transmission rate RkWhether the node is a heavy service node is divided according to the relative size of the node, and the load factor l is usedkAnd (4) showing.
And step 3: for any node k, according to the load factor l
kThe number N of neighbor nodes
kAnd average neighbor number of neighbor nodes
Calculating weights
And 4, step 4: for any node k, entering a decision stage, and respectively making the following decisions according to different node conditions:
for the node with the maximum weight in the one-hop communication range, the weight y is broadcasted through RTS/CTS control signaling
i(an indication bit is added in the control signaling, and the corresponding overhead is increased), the minimum node weight in the range of one hop is obtained as
Calculating a difference index
Difference index threshold G
tDetermining the difference of the network nodes according to the value of G, and making a decision as follows:
(1) if the network nodes have large difference, G>GtThen, for the node with the largest weight, the decision is as follows:
if the network service load is larger, adopting a default backoff strategy and adjusting a backoff window to (15,1023);
if the network service load is smaller, a balanced backoff strategy is adopted, and the parameter set of the minimum backoff window in reinforcement learning is set to be CWminE {7,15,31}, using the CW after reinforcement learningminMaking a decision;
and after the node with the maximum weight makes a decision, other neighbor nodes are informed through broadcasting control information.
For the nodes with non-maximum weight in the range of one hop, after the neighbor with the maximum weight is selected, the decision is as follows:
if the network service load is larger, adopting a default backoff strategy and adjusting a backoff window to (15,1023);
if the network service load is smaller, a balanced backoff strategy is adopted, and the parameter set of the minimum backoff window in reinforcement learning is set to be CWminE {63,127,255}, using the CW after reinforcement learningminMaking a decision;
and repeating the above actions by all the nodes until the whole network decision is completed.
(2) If the network node difference is small, G is less than or equal to GtThe decision is as follows:
all nodes adopt a default backoff window size (15,1023);
and 5: and (4) repeating the steps 1-4 after the network topology structure is changed or the service load is greatly fluctuated.
Simulation setting and performance analysis:
the NS3 software network simulator is used for simulation, a network scene with large node difference is considered, as shown in figure 2, a MAC protocol is CSMA/CA, related parameters refer to an IEEE802.11a protocol standard, 9 nodes are set to participate in communication, the maximum communication hop number is 2, and the load of a central node is obviously higher than that of other nodes. The network topology is shown in fig. 2, and the parameter settings are shown in table 1.
TABLE 1
In the simulation scenario shown in fig. 2, there are 0,1,2,3,4,5,6,7,8 nodes participating in the communication. The node 0 can only communicate with the node 1, the node 1 can communicate with all nodes, and the node 1 and the nodes 2-8 are neighbors. The set communication link includes: 0- >1,1- >4,2- >0,3- >0,4- >5,5- >6,6- >7,7- >8, and the load of all communication links is the same. The node 1 is a heavy service node whose service load accounts for 3/9 of the total service load, and belongs to a node with a larger weight.
And (3) simulation result analysis:
according to the invention, a competition window design scheme adopting a game theory is simulated and compared with a default competition window design scheme, the transmission success rate and the average time delay are selected as judgment indexes, and a simulation result can be analyzed from three aspects of a whole body, a central node and a marginal node.
As shown in fig. 3 and 4, the balancing strategy adopts game decisions of different backoff windows of different nodes under the condition that the network is less than or equal to 3Mbps, and compared with the default backoff strategy, the average time delay is increased, the transmission success rate is also improved, but the amplitudes are small.
As for the central node (node 1), the network performance is as shown in fig. 5 and fig. 6, and it can be seen from fig. 5 and fig. 6 that although the traffic load of the central node is heavier, after the balanced back-off strategy is adopted, the central node obtains more opportunities to access the channel due to the extension of the back-off windows of other nodes, so that the average delay is reduced from 3.7ms to 1.7ms before, the improvement effect is obvious, and the transmission success rate is maintained at 100%.
For the edge node (node 0), the network performance is as shown in fig. 7 and fig. 8, and it can be seen from fig. 7 and fig. 8 that for the edge node, the transmission success rate can be greatly improved by about 10% by using the balanced back-off strategy under the medium-low service load, while the average delay is increased by only 2-3ms, and the delay-insensitive service can be ignored.
In summary, under low traffic load, the network employs a balanced backoff strategy, and the average delay of the network is almost the same as that of the default strategy, which results in a small increase in the transmission success rate. For the edge node, the transmission success rate can be effectively improved by adopting a balancing strategy; for the central node, the average time delay can be effectively reduced. After the medium and high load, the network switches the back-off strategy, and the performance is kept consistent with the default back-off strategy. The back-off algorithm based on the cooperative game model can improve the transmission success rate of the whole and edge nodes in a partial low-load differential network under the condition of hardly influencing average time delay, and can further reduce the time delay of the central node under the condition of maintaining the high transmission success rate of the central node. The game strategy is suitable for a network with a central node of a time delay sensitive type and other nodes of a time delay insensitive type.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.