WO2025171818A1 - Data processing method, apparatus and device, and computer-readable storage medium - Google Patents
Data processing method, apparatus and device, and computer-readable storage mediumInfo
- Publication number
- WO2025171818A1 WO2025171818A1 PCT/CN2025/077841 CN2025077841W WO2025171818A1 WO 2025171818 A1 WO2025171818 A1 WO 2025171818A1 CN 2025077841 W CN2025077841 W CN 2025077841W WO 2025171818 A1 WO2025171818 A1 WO 2025171818A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- network
- target
- risk
- state information
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/04—Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
Definitions
- the present application relates to the fields of financial risk control and artificial intelligence technology, and in particular to a data processing method, apparatus, device, and computer-readable storage medium.
- the total value of the hedging portfolio, over the life of the financial derivative contract is equal to or not less than the value of the financial derivative—the value of one or more payments (i.e., the payment amount) stipulated in the financial derivative contract to the buyer of the financial derivative.
- the payment amount is determined by the price of the underlying asset. For example, if the financial derivative is a CSI 300 Index put option, the underlying asset is the CSI 300 Index.
- the contractually agreed-upon amount payable to the option buyer is equal to the difference between the contractually agreed-upon strike price and the market price of the underlying asset (i.e., the CSI 300 Index), and the maximum of zero.
- the hedging error of a hedge portfolio is the difference between the value of the hedge portfolio and the value of the financial derivative. When the hedging error is less than zero—that is, when the value of the hedge portfolio falls below the value of the derivative—the securities firm incurs a loss. Therefore, securities firms need to effectively manage the financial risks associated with selling financial derivatives, particularly the tail risk of hedging error, which is the risk of the firm incurring large losses.
- the main purpose of this application is to provide a data processing method, apparatus, device and computer-readable storage medium, which aims to solve the technical problem that the current model training method cannot reduce the time cost, computing power cost and hardware resource cost of training while considering the tail risk of the hedging error of the hedge portfolio, and solve the technical problem that the existing method considering tail risk can only use data generated by simulation based on parameter model for training, but cannot directly use real market observation data for training, thereby failing to directly and effectively use the information of real market data, and failing to avoid modeling errors and parameter estimation errors.
- the present application provides a data processing method, which includes the following steps:
- a feedback sample set is obtained by performing multiple interactions with the environment using a policy network to be trained, a value function network to be trained, and a tail risk network to be trained, wherein the policy network is used to output policy information based on state information at a target moment, and the policy information is used to generate an adjustment action for the financial derivatives hedging portfolio at the target moment.
- the feedback sample set includes multiple samples, each of which includes sample data used to calculate a loss value, at least one of which is calculated based on reward information, the reward information is calculated based on a hedging error of the financial derivatives hedging portfolio and a tail risk of the hedging error, the tail risk is calculated based on risk information, and the risk information is predicted by the tail risk network based on initial state information corresponding to the financial derivative and/or state information corresponding to the target moment;
- the trained strategy network is obtained for adjusting the financial derivatives hedging portfolio based on the strategy network.
- the step of using the policy network to be trained, the value function network to be trained, and the tail risk network to be trained to interact with the environment multiple times to obtain a feedback sample set includes:
- trajectory data corresponding to a target financial derivative in a preset environmental data set including a price trajectory, a delivery amount trajectory, a risk-free interest rate trajectory, a price trajectory of each asset within the contract period, and a cash flow trajectory of each asset within the contract period for the target financial derivative;
- the to-be-trained value function network, and the to-be-trained tail risk network to interact at least once with an environment defined by the trajectory data corresponding to the target financial derivative, to obtain a sub-feedback sample set corresponding to the target financial derivative, wherein the sub-feedback sample set includes a plurality of samples, each of which includes sample data for calculating a loss value;
- the feedback sample set is generated by using the sub-feedback sample sets corresponding to a plurality of financial derivatives with different initial state information.
- the value function network is configured to output a state value based on state information input at a target time, at least one item of sample data is calculated based on the state value, and the step of using the to-be-trained policy network, the to-be-trained value function network, and the to-be-trained tail risk network to interact at least once with an environment defined by the trajectory data corresponding to the target financial derivative to obtain a sub-feedback sample set corresponding to the target financial derivative includes:
- initial state information corresponding to the target financial derivative according to the trajectory data corresponding to the target financial derivative, wherein the initial state information is state information corresponding to the start time of the contract period;
- the reward information is calculated based on the hedging error of the target financial derivative hedging portfolio at the moment immediately following the target time and the tail risk of the hedging error, wherein the tail risk is calculated based on risk information, which is calculated by inputting initial state information corresponding to the target financial derivative and/or state information corresponding to the target time into a tail risk network to be trained;
- the first sample includes sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network;
- the second sample includes sample data for calculating the risk information predicted loss value corresponding to the tail risk network
- a sub-feedback sample set corresponding to the target financial derivative is generated based on the first sample and the second sample.
- the second sample when the risk information is calculated by respectively inputting the initial state information corresponding to the target financial derivative and the state information corresponding at the target time into the tail risk network to be trained, the second sample includes the initial state information of the target financial derivative, the state information corresponding to the target time, and the hedging error corresponding to the end time of the contract period.
- the second sample when the risk information is calculated by inputting the state information corresponding to the target financial derivative at the target time into the tail risk network to be trained, the second sample includes the state information corresponding to the target financial derivative at the target time and the hedging error corresponding to the end time of the contract period.
- the method before the step of obtaining trajectory data corresponding to a target financial derivative in the preset environmental data set, the method further includes:
- the environmental data set is obtained according to the trajectory data corresponding to each financial derivative with different initial state information.
- the feedback sample set includes a plurality of first samples and second samples, wherein the first samples include sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network, and the second samples include sample data for calculating the risk information prediction loss value corresponding to the tail risk network;
- the steps of using the feedback sample set to perform a round of training on the policy network, the value function network, and the tail risk network respectively include:
- the value function network includes two independent multi-layer feedforward neural networks, each of which includes an input layer, several hidden layers and an output layer, and the number of nodes in the input layer is consistent with that in the policy network.
- Each hidden layer contains several nodes, and the output layer has only one node for outputting a value function variable for evaluating the value of the input state information.
- the present application further provides a data processing device, comprising:
- the present application also provides a data processing device, which includes: a memory, a processor, and a data processing program stored in the memory and executable on the processor, wherein the data processing program implements the steps of the data processing method described above when executed by the processor.
- the feedback sample set includes multiple samples, each of which includes sample data for calculating a loss value, at least one sample data item is calculated based on reward information, the reward information is calculated based on a hedging error of the financial derivatives hedging portfolio and a tail risk of the hedging error, the tail risk is calculated based on risk information, and the risk information is predicted by the tail risk network based on initial state information corresponding to the financial derivative and/or state information corresponding to the target time; the policy network, the value function network, and the tail risk network are each trained for at least one round using the feedback sample set; after each network is trained for at least one round and a preset training end condition is detected to be met, the trained policy network is obtained for use in adjusting the financial derivatives hedging portfolio based on the policy network.
- the policy loss value is calculated based on reward information, which is calculated based on the tail risk of the hedging error of the financial derivative hedging portfolio
- adjustment actions made based on the trained policy network can achieve a lower tail risk of the hedging error.
- the risk information used to calculate the tail risk is predicted by the tail risk network based on the input initial state information and/or the state information corresponding to the target time, and the tail risk network is trained based on the loss value predicted by the risk information, it is possible to calculate different tail risks of the hedging error for the same financial derivative with different initial state information, thereby allowing the feedback sample set to include multiple sub-feedback sample sets corresponding to the same financial derivative with different initial state information.
- the policy network trained based on the feedback sample set is applicable to each financial derivative of the same type with different initial state information, eliminating the need to train different models for each financial derivative of the same type with different initial state information, thereby reducing the time cost, computing power cost, and hardware resource cost of training the policy network.
- existing methods for considering tail risk can only use data generated by parameter model simulation for training, and cannot directly use real market observation data for training. This makes it impossible to directly and effectively use the information of real market data, nor can it avoid modeling errors and parameter estimation errors.
- the solution provided in the embodiment of the present application can only use real market data for training, so it can avoid modeling errors and parameter estimation errors, and can also directly use the information of market data, thereby effectively reducing the tail risk of hedging errors, which helps securities companies to carry out risk management.
- FIG1 is a flow chart of an embodiment of a data processing method of the present application.
- FIG2 is a schematic diagram of the structure of the hardware operating environment involved in the embodiment of the present application.
- FIG3 is a flow chart of another embodiment of the data processing method of the present application.
- FIG4 is a flow chart of another embodiment of the data processing method of the present application.
- FIG5 is a flow chart of another embodiment of the data processing method of the present application.
- FIG6 is a flow chart of another embodiment of the data processing method of the present application.
- FIG7 is a flow chart of another embodiment of the data processing method of the present application.
- FIG1 is a flow chart of an embodiment of a data processing method of the present application.
- the embodiments of the present application provide embodiments of the data processing method. It should be noted that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that shown here.
- the execution subject of the data processing method can be a smart phone, a personal computer, a server and other devices, which are not limited in this embodiment. In this embodiment, for ease of description, the execution subject is omitted for elaboration.
- the data processing method includes the following steps:
- Step S10 using the policy network to be trained, the value function network to be trained, and the tail risk network to be trained to interact with the environment multiple times to obtain a feedback sample set, wherein the policy network is used to output policy information based on the state information at the target moment, and the policy information is used to generate an adjustment action for the financial derivatives hedging portfolio at the target moment, and the feedback sample set includes multiple samples, and the samples include various sample data for calculating the loss value, at least one sample data is calculated based on the reward information, and the reward information is calculated based on the hedging error of the financial derivatives hedging portfolio and the tail risk of the hedging error, and the tail risk is calculated according to the risk information, and the risk information is predicted by the tail risk network based on the initial state information corresponding to the financial derivative and/or the state information corresponding to the target moment.
- Financial derivatives are financial products whose value depends on the value of the underlying asset. They are an important tool for businesses and financial institutions to manage financial risk. For example, steel producers often hold iron ore inventories. To hedge against the risk of future price declines, they may purchase iron ore put options from securities firms to hedge their value. The underlying asset of these put options is iron ore, and their value depends on the price of iron ore.
- securities firms sell financial derivatives they construct a hedge portfolio to replicate the derivative.
- the hedge portfolio consists of cash and hedge assets. Before the expiration of a financial derivative contract, securities firms adjust the amount of hedge assets in the portfolio to ensure that, during the contract period, the total value of the hedge portfolio is equal to or greater than the value of the derivative.
- the payment amount is determined by the price of the underlying asset (which may be negative).
- a European iron ore put option contract stipulates that on the option's expiration date, the option seller must pay the option buyer the maximum value of the difference between the strike price agreed in the contract and the price of iron ore on the option's expiration date and zero.
- the hedging error of a hedge portfolio is the difference between the value of the hedge portfolio and the value of the financial derivative. When the hedging error is less than zero, that is, when the value of the hedge portfolio is lower than the value of the financial derivative, the securities company will bear the loss.
- the policy network can be a neural network that outputs policy information based on the state information of the financial derivative at the target time.
- the policy information is used to generate adjustments to the financial derivative's hedging portfolio at the target time.
- the target time refers to any time within the contract period of the financial derivative.
- the contract period starts at the time the financial derivative is sold, and ends at the expiration time specified in the financial derivative contract (hereinafter referred to as the contract expiration time).
- the state information corresponding to the financial derivative at the target time may include the values of each state variable corresponding to the financial derivative at the target time.
- the each state variable corresponding to the financial derivative may include information variables that affect or are correlated with the hedging error of the financial derivative's hedging portfolio, the price of the financial derivative, the price or cash flow of the hedging asset, or the price or cash flow of the underlying asset.
- the information variables may include the volatility of the underlying asset of the financial derivative, the number of each hedging asset in the hedging portfolio of the financial derivative, etc.
- the specific information variables included can be set as needed and are not limited here.
- the types of parameters that affect the price of the financial derivatives agreed in the contract of the financial derivatives are different.
- the parameter that affects the price of the European option agreed in the contract of the European option is the strike price of the European option (which can be represented by K);
- the parameters that affect the price of the knock-out call option agreed in the contract of the knock-out call option are the strike price and the knock-out barrier price of the knock-out call option;
- the parameters that affect the price of the snowball option agreed in the contract of the snowball option are the knock-in barrier price, the knock-out barrier price and the coupon rate of the snowball option.
- the status information of the target moment corresponding to the financial derivative can also include the sensitivity information of the financial derivative to the price change of the underlying asset and other information that affects the value of the derivative and the hedging error.
- the sensitivity information can include the Delta of the financial derivative (which can be represented by Gamma (can be expressed as Indicates), Vega (can be used denoted) and Theta (can be expressed as It should be noted that when the hedging asset in the hedging portfolio of a financial derivative is the same as the underlying asset of the financial derivative, St and Gt are the same, and Gt can be omitted in the status information.
- Gamma can be expressed as Indicates
- Vega can be used denoted
- Theta can be expressed as It should be noted that when the hedging asset in the hedging portfolio of a financial derivative is the same as the underlying asset of the financial derivative, St and Gt are the same, and Gt can be omitted in the status information.
- the state information s t corresponding to the European option at time t can be expressed as:
- S t and G t are the same, G t can be omitted in the status information.
- S 0 represents the price of the underlying asset of the European option at the initial moment. It can also be replaced by K.
- the state information s t at time t corresponding to the upward knock-out call option can be expressed as:
- J is the knock-out barrier agreed in the contract for the upward knock-out call option
- K is the strike price
- the state information s t corresponding to the Snowball option at time t can be expressed as:
- ⁇ (s 0 ; ⁇ ) is a function with the initial state information s 0 as input, and the tail risk network of the aforementioned neural network structure can be used.
- the dimension of the input layer is the same as the dimension of the state information s t , and it contains three linear hidden layers with 64 neurons.
- a nonlinear activation function such as Swish
- a batch normalization layer are applied after each linear hidden layer, and the output layer is 1-dimensional.
- the tail risk network can be divided into two cases: 1.
- the model is trained on samples of financial derivatives with unique initial state information.
- the value function network when the policy network is set to output two parameter variables of the probability distribution obeyed by the adjustment action, can be used to output the state value based on the state information input at the target time.
- the value function network can be represented by a multi-layer feedforward neural network, consisting of an input layer, several hidden layers and an output layer, and the number of nodes in the input layer is consistent with that of the policy network, each hidden layer contains several nodes, and the output layer has only one node for outputting a value function variable for evaluating the value of the input state information (i.e., the state value).
- the value function network can be represented as V ⁇ (s), where ⁇ is the network parameter of the value function network, the input is the state information s t , and the output is the state value; taking a fully connected neural network as an example (other types of networks can be used, such as residual neural networks), its structure can be set as follows: the input layer dimension is the same as the dimension of the state information s t , and it contains three linear hidden layers with 64 neurons. After each linear hidden layer, a nonlinear activation function (such as Swish) and a batch normalization layer are applied, and the output layer is 1-dimensional.
- a nonlinear activation function such as Swish
- a batch normalization layer are applied, and the output layer is 1-dimensional.
- the structure of the Q-function network can be set as follows: the input layer dimension is equal to the dimension of the state information s t plus the dimension of the action a t , and it contains three linear hidden layers with 64 neurons. Each linear hidden layer is followed by a nonlinear activation function (such as Swish) and a batch normalization layer, and the output layer is 1-dimensional.
- Step S20 Using the feedback sample set to perform at least one round of training on the policy network, the value function network, and the tail risk network.
- the policy network, value function network, and tail risk network are each trained for at least one round using the feedback sample set.
- the loss value of the loss function can be calculated based on the samples in the feedback sample set.
- the gradient values of the network parameters of each network are calculated based on the loss value, and the network parameters of each network are updated based on the gradient values.
- the network parameters of each network can be updated at least once. There are many specific training methods, which are not limited in this embodiment.
- Step S30 After each network is trained for at least one round and a preset training end condition is detected to be satisfied, the trained strategy network is obtained for adjusting the financial derivatives hedging portfolio based on the strategy network.
- the next round of training can be carried out. That is, the policy network, value function network and tail risk network after the latest updated network parameters interact with the environment to obtain a set of feedback samples in the new round of training, and then a new round of training is carried out on each network after the latest updated network parameters based on the feedback sample set.
- the policy network with the most recently updated network parameters can be used as the trained policy network.
- policy information can be obtained by inputting state information corresponding to a particular financial derivative at a particular moment.
- an adjustment action to be executed on the hedging portfolio of the financial derivative at that moment can be generated, and the hedging portfolio can be adjusted based on this adjustment action.
- the policy loss value is calculated based on reward information
- the reward information is calculated based on the tail risk of the hedging error of the financial derivative hedging portfolio
- adjustment actions made based on the trained policy network can achieve a lower tail risk of the hedging error.
- the risk information used to calculate the tail risk is predicted by the tail risk network based on the input initial state information and/or state information corresponding to the target moment, and the tail risk network is trained based on the loss value predicted by the risk information, different tail risks of the hedging error can be calculated for the same financial derivative with different initial state information, thereby enabling the feedback sample set to include multiple sub-feedback sample sets corresponding to the same financial derivative with different initial state information. Therefore, the strategy network trained based on the feedback sample set can be applied to all the same financial derivatives with different initial state information, so there is no need to train different models for the same financial derivatives with different initial state information, which reduces the time cost, computing power cost and hardware resource cost of training the strategy network.
- the existing methods for considering tail risk can only use data generated by simulation based on parameter models for training, and cannot directly use real market observation data for training. This makes it impossible to directly and effectively use the information of real market data, and also cannot avoid modeling errors and parameter estimation errors.
- the solution provided in the embodiment of the present application can only use real market data for training, so it can avoid modeling errors and parameter estimation errors, and can also directly use market data information, thereby effectively reducing the tail risk of hedging errors, which is helpful for securities companies to carry out risk management.
- Step S101 obtaining trajectory data corresponding to a target financial derivative in a preset environmental data set, wherein the trajectory data includes the price trajectory, delivery amount trajectory, risk-free interest rate trajectory, price trajectory of each asset in the contract period, and cash flow trajectory of each asset of the target financial derivative during the contract period.
- An environmental data set can be set in advance.
- the environmental data set can include trajectory data corresponding to multiple financial derivatives with given initial state information.
- the initial state information of each financial derivative can be different. The following takes one of the financial derivatives as an example and calls it the target financial derivative for distinction.
- the trajectory data corresponding to the target financial derivative may include the target financial derivative's price trajectory, delivery amount trajectory, risk-free interest rate trajectory, price trajectory of each asset during the contract period, and cash flow trajectory of each asset.
- the cash flow of each asset refers to the cash flow generated by each asset.
- the cash flow generated by stock assets includes stock dividends and cash distributions
- the cash flow generated by bond assets includes bond coupons.
- the trajectory data may also include the volatility trajectory of the target financial derivative's underlying asset during the contract period, information on the sensitivity of the target financial derivative's price to changes in the underlying asset price and other variables (which may include the financial derivative's Delta, Gamma, Vega, and Theta), and other information variables that affect or are correlated with the hedging error of the financial derivative's hedging portfolio, the price of the financial derivative, the price or cash flow of the hedged asset, or the price or cash flow of the underlying asset.
- the specific information variables included can be set as needed and are not limited here.
- the delivery amount Ut of the target financial derivative at time t refers to the cash value (which may be a negative value) that the seller of the financial derivative is required to pay to the buyer at time t, as specified in the financial derivative contract.
- the target financial derivative may have one or more underlying assets, as specified in the financial derivative contract.
- the assets within the contract period include the hedging assets in the target financial derivative's hedging portfolio and the target financial derivative's underlying assets. It should be noted that there may be one or more hedging assets, selected based on the financial derivative and its underlying assets. The hedging assets can be the same as the underlying assets.
- the method of obtaining the environmental data set is not limited in this embodiment.
- it can be obtained through data model simulation, or by obtaining real market information data based on a preset financial information data interface. This is not limited in this embodiment.
- obtaining trajectory data corresponding to a target financial derivative through data model simulation may specifically include:
- ⁇ , a, b, c, and d are given model parameters
- zt is an independent and identically distributed standard normal random variable
- ⁇ t is the random volatility.
- the model for the underlying asset's cash flow Dt depends on the specific definition of the underlying asset. For example, when the underlying asset is a stock, its cash flow Dt includes dividends and bonuses, and a model for Dt can be established using historical data. When the underlying asset is a bond, its cash flow includes bond coupons, and the model for Dt depends on the terms of the bond contract and market information variables such as market interest rates. Similarly, the price of the hedging asset Gt and its cash flow It can be assumed to follow a model. It should be noted that the hedging asset Gt and the underlying asset St can be identical; in the absence of cash flows, Dt or It is zero.
- the risk-free interest rate model can be assumed to be the instantaneous spot CIR model.
- T is the expiration time of the financial derivative contract. T can be specified as a fixed value or multiple different values according to the needs.
- Step S103 generating the feedback sample set by combining the sub-feedback sample sets corresponding to a plurality of financial derivatives having different initial state information.
- the state information at the starting time is s 0 .
- the tail risk is calculated by inputting s 0 and/or the state information s T-1 at time T-1 into the tail risk network to be trained, and then calculating the risk information ⁇ .
- T is the end time of the contract period of the target financial derivative.
- the calculation method of reward information at other times is
- ⁇ 1 >0 and ⁇ 2 ⁇ 0 are preset constants. It should be noted that the definition of the reward information variable r t is not limited to the above five types.
- ⁇ (s 0 ; ⁇ ) is the risk information corresponding to the initial state information s 0 , and represents the ⁇ quantile of the opposite number of the hedging error W T.
- G t+1 (G t+1,1 ,...,G t+1,n )
- G t+1,i is the price of the i-th hedge asset at time t+1
- Z t+1 is the price of the financial derivative at time t+1 after excluding U t+1
- U t+1 is the delivery amount at time t+1. All three can be obtained from the price trajectory.
- ⁇ t+1,i is the i-th component of ⁇ t+1 , that is, the amount of the i-th hedge asset before adjustment at time t+1.
- B t+1 can be calculated as follows:
- a t,i , ⁇ t,i , and G t,i are the i-th components of a t , ⁇ t , and G t , respectively.
- C t is the transaction fee at time t.
- I t+1,i is the cash flow generated by the i-th hedge asset at time t+1, obtained from the cash flow trajectory.
- hedging error tail risk may be used. For example, assuming a constant 0 ⁇ 1 ⁇ 1, the following hedging error tail risk may be used:
- each first sample generated based on a piece of trajectory data of a target financial derivative can be expressed as: That is, T first samples can be generated.
- Step S1028 Generate a second sample based on the initial state information of the target financial derivative and the hedging error corresponding to the termination time of the contract period, wherein the second sample includes sample data for calculating the risk information predicted loss value corresponding to the tail risk network.
- the method for generating the second sample based on the initial state information of the target financial derivative and the hedging error corresponding to the end time of the contract period is not limited, and different generation methods can be adopted depending on the risk information prediction loss function selected for the tail risk network.
- the risk information prediction loss function of the tail risk network can be selected according to specific needs, and therefore, the method for calculating the risk information prediction loss value is not limited.
- the second sample may include the initial state information of the target financial derivative and the hedging error corresponding to the end time of the contract period.
- the risk information predicted loss value of the tail risk network can be calculated based on the second sample, and then the gradient value of the network parameter of the tail risk network is calculated according to the risk information predicted loss value, and the network parameters of the tail risk network are updated according to the gradient value.
- the second sample may include the initial state information of the target financial derivative, the state information corresponding to the target time and the hedging error corresponding to the end time of the contract period.
- the risk information predicted loss value of the tail risk network can be calculated based on the second sample, and then the gradient value of the network parameter of the tail risk network can be calculated according to the risk information predicted loss value, and the network parameters of the tail risk network can be updated according to the gradient value.
- the second sample may include the state information corresponding to the target financial derivative at the target time and the hedging error corresponding to the termination time of the contract period.
- the risk information predicted loss value of the tail risk network can be calculated based on the second sample, and then the gradient value of the network parameter of the tail risk network is calculated according to the risk information predicted loss value, and the network parameters of the tail risk network are updated according to the gradient value.
- Step S1029 Generate a sub-feedback sample set corresponding to the target financial derivative based on the first sample and the second sample.
- the generated first sample and the second sample can be combined to obtain a sub-feedback sample set corresponding to the target financial derivative.
- the sub-feedback sample set can be generated in the following manner:
- the initial state information is the state information corresponding to the start time of the contract period.
- the start time is used as the target time, and the state information corresponding to the target time is input into the strategy network.
- the state-action value corresponding to the target moment Obtain the state-action value corresponding to the target moment; take the next moment after the target moment as the new target moment, and return to the step of inputting the state information corresponding to the target moment into the policy network to obtain the policy information of the target moment, until the next moment after the target moment is the end moment of the contract period; based on the state information, adjustment action, reward information and state information of the next moment corresponding to each moment in the contract period, generate a first sample corresponding to each moment, the first sample includes the information for calculating the policy network
- the sample data of the corresponding policy loss value and the network used to calculate the Q function Sample data of the corresponding assessed loss value; generating a second sample based on the initial state information of the target financial derivative and/or the state information corresponding to the target moment and the hedging error corresponding to the termination moment of the contract period; the second sample includes sample data for calculating the risk information predicted loss value corresponding to the tail risk network ⁇ (s; ⁇ 1 )_; generating a
- the state information at the starting time is s 0
- Randomly sample a noise term ⁇ 0 e.g., from a normal distribution Random sampling, where ⁇ 0 is a preset parameter
- adds a noise term to a 0 and obtains the adjustment action
- This tail risk is calculated by inputting s 0 and/or the state information s T-1 at time T-1 into the tail risk network to be trained, ⁇ (s; ⁇ 1 ), and then calculating the risk information.
- T is the end time of the contract period of the target financial derivative.
- the calculation method of reward information at other times is not limited here. For example, it can be calculated based on the hedging error, set to 0, or calculated based on the tail risk and hedging error.
- the trajectory data corresponding to a target financial derivative can be obtained from the environmental data set.
- the following operations are performed.
- the initial state s 0 is determined.
- the target financial derivative is a European option
- the initial state s 0 is:
- the feedback sample set may include multiple first samples and second samples.
- the first samples include sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network.
- the second samples include sample data for calculating the risk information prediction loss value corresponding to the tail risk network.
- step S20 includes:
- the policy network Q-Function Network and the parameters of the tail risk network ⁇ (s 0 ; ⁇ 1 ) are and Target Policy Network Target Q-function network
- the parameters of the target tail risk network ⁇ (s 0 ; ⁇ 2 ) are and First, the feedback sample set is obtained interactively. At the beginning of each update process, the first sample in the feedback sample set is randomly sorted, and then the m first samples are taken as a subsample set in turn. The first subsample set can be recorded as For each sample point, according to the policy network and Q-function network Calculate the corresponding target value Where ⁇ [0,1] is the discount rate. The following loss function is thus determined:
- the following function is used for tail risk:
- Step S50 obtaining the environmental data set according to the trajectory data corresponding to each financial derivative having different initial state information.
- a preset financial information data interface is a pre-configured, customized financial information data interface. Through this interface, historical, real-world information related to financial derivatives can be obtained, thereby obtaining trajectory data corresponding to these financial derivatives. Preset assets and time periods can be configured as needed and are not limited in this embodiment.
- an embodiment of the present application further provides a data processing device, the device comprising:
- the feedback sample set is generated by using the sub-feedback sample sets corresponding to a plurality of financial derivatives with different initial state information.
- the value function network is configured to output a state value based on state information input at a target time, at least one item of sample data is calculated based on the state value, and the interaction module is further configured to:
- initial state information corresponding to the target financial derivative according to the trajectory data corresponding to the target financial derivative, wherein the initial state information is state information corresponding to the start time of the contract period;
- the reward information is calculated based on the hedging error of the target financial derivative hedging portfolio at the moment immediately following the target time and the tail risk of the hedging error, wherein the tail risk is calculated based on risk information, which is calculated by inputting initial state information corresponding to the target financial derivative and/or state information corresponding to the target time into a tail risk network to be trained;
- the first sample includes sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network;
- the second sample includes sample data for calculating the risk information predicted loss value corresponding to the tail risk network
- a sub-feedback sample set corresponding to the target financial derivative is generated based on the first sample and the second sample.
- the second sample when the risk information is calculated by inputting the initial state information corresponding to the target financial derivative into the tail risk network to be trained, the second sample includes the initial state information of the target financial derivative and the hedging error corresponding to the end time of the contract period.
- the second sample when the risk information is calculated by respectively inputting the initial state information corresponding to the target financial derivative and the state information corresponding at the target time into the tail risk network to be trained, the second sample includes the initial state information of the target financial derivative, the state information corresponding to the target time, and the hedging error corresponding to the end time of the contract period.
- the second sample when the risk information is calculated by inputting the state information corresponding to the target financial derivative at the target time into the tail risk network to be trained, the second sample includes the state information corresponding to the target financial derivative at the target time and the hedging error corresponding to the end time of the contract period.
- the data processing device further includes:
- the feedback sample set includes a plurality of first samples and second samples, wherein the first samples include sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network, and the second samples include sample data for calculating the risk information prediction loss value corresponding to the tail risk network;
- the training module is also used to:
- the second sample includes initial state information corresponding to a target financial derivative and a hedging error of the hedging combination of the target financial derivative at the time of contract termination of the target financial derivative;
- the training module is also used to:
- the status information corresponding to a financial derivative at a target time includes the cash amount in the hedging portfolio of the financial derivative at the target time, the amount of each hedging asset, the price of each hedging asset, the price of each underlying asset of the financial derivative, the remaining time to maturity of the financial derivative, the parameters affecting the price of the financial derivative stipulated in the financial derivative contract, or the ratio of the parameters affecting the price of the financial derivative stipulated in the financial derivative contract to the initial price of each underlying asset of the financial derivative, the hedging error of the hedging portfolio, and the risk-free interest rate at the target time.
- the status information corresponding to the financial derivative at the target time may also include the volatility of each underlying asset of the financial derivative at the target time.
- the value function network includes two independent multi-layer feedforward neural networks, each of which includes an input layer, several hidden layers and an output layer, and the number of nodes in the input layer is consistent with that in the policy network, each of the hidden layers contains several nodes, and the output layer has only one node for outputting a value function variable for evaluating the value of the input state information.
- the to-be-trained value function network, and the to-be-trained tail risk network to interact at least once with an environment defined by the trajectory data corresponding to the target financial derivative, to obtain a sub-feedback sample set corresponding to the target financial derivative, wherein the sub-feedback sample set includes a plurality of samples, each of which includes sample data for calculating a loss value;
- the feedback sample set is generated by using the sub-feedback sample sets corresponding to a plurality of financial derivatives with different initial state information.
- initial state information corresponding to the target financial derivative according to the trajectory data corresponding to the target financial derivative, wherein the initial state information is state information corresponding to the start time of the contract period;
- the first sample includes sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network;
- the second sample includes sample data for calculating the risk information predicted loss value corresponding to the tail risk network
- the environmental data set is obtained according to the trajectory data corresponding to each financial derivative with different initial state information.
- the feedback sample set includes a plurality of first samples and second samples, wherein the first samples include sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network, and the second samples include sample data for calculating the risk information prediction loss value corresponding to the tail risk network;
- the operation of using the feedback sample set to perform a round of training on the policy network, the value function network, and the tail risk network respectively includes:
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Business, Economics & Management (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- General Business, Economics & Management (AREA)
- Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
Abstract
Description
相关申请Related applications
本申请要求于2024年2月18号申请的、申请号为202410179678.4的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese patent application No. 202410179678.4 filed on February 18, 2024, the entire contents of which are incorporated herein by reference.
本申请涉及金融风控领域和人工智能技术领域,尤其涉及一种数据处理方法、装置、设备及计算机可读存储介质。The present application relates to the fields of financial risk control and artificial intelligence technology, and in particular to a data processing method, apparatus, device, and computer-readable storage medium.
在证券公司的金融衍生品做市业务中,实体企业为了管理金融风险,需要从证券公司买入金融衍生品。例如,钢铁生产企业常备有铁矿石库存,为了规避铁矿石价格在未来下跌的风险,需要从证券公司买入铁矿石看跌期权来保值。证券公司向实体企业或金融机构卖出金融衍生品后,需要进行有效的对冲,来防控其承担的金融风险,尤其是对冲组合的对冲误差的尾部风险。具体是需要针对金融衍生品构建由现金(即无风险资产)和对冲资产组成的对冲组合,以复制该金融衍生品,也即,通过对对冲组合中的对冲资产的数量进行调整,使得在金融衍生品的合约期限内,对冲组合的总价值等于或不低于该金融衍生品的价值,即金融衍生品合约中约定的一笔或多笔需要付给金融衍生品的买方的金额(即支付金额)的价值,支付金额由金融衍生品的标的资产价格决定。例如,金融衍生品为沪深300股指看跌期权时,其标的资产为沪深300指数。在沪深300股指看跌期权的合约到期时刻,其合约中约定的需要付给该期权的买方的金额等于合约中约定的行权价格减去标的资产(即沪深300指数)的市场价格所得的差值和零的最大值。对冲组合的对冲误差是对冲组合的价值减去金融衍生品的价值所得的差值。当对冲误差小于零时,即当对冲组合的价值低于衍生品的价值时,证券公司就要承担损失;因此,证券公司需要有效地控制卖出金融衍生品所带来的金融风险,尤其是对冲误差的尾部风险,即证券公司遭受大额损失的风险。目前,可以通过训练模型来帮助进行对冲组合中对冲资产的调整,以降低对冲误差的尾部风险。但是,目前训练模型的方法在考虑对冲组合的对冲误差的尾部风险的同时,所训练得到的模型不能够适用于具有不同初始状态信息(例如不同标的初始价格、不同行权价格和不同到期时刻)的同一种金融衍生品,从而需要针对各个具有不同初始状态信息的同一种金融衍生品分别训练不同的模型,从而导致训练所耗费的时间成本、算力成本和硬件资源成本较高。此外,已有的考虑尾部风险的方法,都只能用基于参数模型模拟生成的数据来进行训练,而不能直接使用真实的市场观测数据来进行训练,这就不能直接有效地使用真实的市场数据的信息,也不能避免建模的错误和参数估计的错误,也导致证券公司无法有效地降低对冲误差的尾部风险。In the financial derivatives market-making business of securities firms, real-world enterprises need to purchase financial derivatives from securities firms to manage financial risk. For example, steel producers often hold iron ore inventories. To hedge against the risk of future price declines, they need to purchase iron ore put options from securities firms to protect their value. After selling financial derivatives to real-world enterprises or financial institutions, securities firms need to effectively hedge to mitigate the financial risks they assume, particularly the tail risk of hedging errors within the hedge portfolio. Specifically, they need to construct a hedge portfolio for the financial derivative, consisting of cash (i.e., a risk-free asset) and hedging assets to replicate the financial derivative. Specifically, by adjusting the number of hedging assets in the hedge portfolio, the total value of the hedging portfolio, over the life of the financial derivative contract, is equal to or not less than the value of the financial derivative—the value of one or more payments (i.e., the payment amount) stipulated in the financial derivative contract to the buyer of the financial derivative. The payment amount is determined by the price of the underlying asset. For example, if the financial derivative is a CSI 300 Index put option, the underlying asset is the CSI 300 Index. At the expiration date of the CSI 300 Index put option, the contractually agreed-upon amount payable to the option buyer is equal to the difference between the contractually agreed-upon strike price and the market price of the underlying asset (i.e., the CSI 300 Index), and the maximum of zero. The hedging error of a hedge portfolio is the difference between the value of the hedge portfolio and the value of the financial derivative. When the hedging error is less than zero—that is, when the value of the hedge portfolio falls below the value of the derivative—the securities firm incurs a loss. Therefore, securities firms need to effectively manage the financial risks associated with selling financial derivatives, particularly the tail risk of hedging error, which is the risk of the firm incurring large losses. Currently, trained models can be used to help adjust the hedged assets in a hedge portfolio to reduce this tail risk. However, while current model training methods consider the tail risk of hedging errors in a hedge portfolio, the resulting models are not applicable to the same financial derivative with different initial state information (e.g., different underlying initial prices, different strike prices, and different expiration times). Consequently, different models must be trained for each financial derivative with different initial state information, resulting in high training costs in terms of time, computing power, and hardware resources. Furthermore, existing methods that consider tail risk can only train using data generated by simulations based on parametric models, rather than directly using real market observation data. This prevents direct and effective use of real market data, prevents modeling errors, and prevents errors in parameter estimation. Consequently, securities companies are unable to effectively reduce the tail risk of hedging errors.
本申请的主要目的在于提供一种数据处理方法、装置、设备及计算机可读存储介质,旨在解决目前训练模型的方法不能够在考虑对冲组合的对冲误差的尾部风险的同时,降低训练的时间成本、算力成本和硬件资源成本的技术问题,以及解决现有的考虑尾部风险的方法只能用基于参数模型模拟生成的数据来进行训练,而不能直接使用真实的市场观测数据来进行训练,从而不能直接有效地使用真实的市场数据的信息,也不能避免建模的错误和参数估计的错误的技术问题。The main purpose of this application is to provide a data processing method, apparatus, device and computer-readable storage medium, which aims to solve the technical problem that the current model training method cannot reduce the time cost, computing power cost and hardware resource cost of training while considering the tail risk of the hedging error of the hedge portfolio, and solve the technical problem that the existing method considering tail risk can only use data generated by simulation based on parameter model for training, but cannot directly use real market observation data for training, thereby failing to directly and effectively use the information of real market data, and failing to avoid modeling errors and parameter estimation errors.
为实现上述目的,本申请提供一种数据处理方法,所述方法包括以下步骤:To achieve the above objectives, the present application provides a data processing method, which includes the following steps:
采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与环境进行多次交互得到反馈样本集合,其中,所述策略网络用于基于目标时刻的状态信息输出策略信息,所述策略信息用于生成在所述目标时刻对金融衍生品对冲组合做出的调整动作,所述反馈样本集合包括多条样本,所述样本中包括用于计算损失值的各项样本数据,至少一项样本数据基于奖励信息计算得到,所述奖励信息基于金融衍生品对冲组合的对冲误差和所述对冲误差的尾部风险计算得到,所述尾部风险根据风险信息计算得到,所述风险信息由所述尾部风险网络基于金融衍生品对应的初始状态信息和/或目标时刻对应的状态信息预测得到;A feedback sample set is obtained by performing multiple interactions with the environment using a policy network to be trained, a value function network to be trained, and a tail risk network to be trained, wherein the policy network is used to output policy information based on state information at a target moment, and the policy information is used to generate an adjustment action for the financial derivatives hedging portfolio at the target moment. The feedback sample set includes multiple samples, each of which includes sample data used to calculate a loss value, at least one of which is calculated based on reward information, the reward information is calculated based on a hedging error of the financial derivatives hedging portfolio and a tail risk of the hedging error, the tail risk is calculated based on risk information, and the risk information is predicted by the tail risk network based on initial state information corresponding to the financial derivative and/or state information corresponding to the target moment;
采用所述反馈样本集合对所述策略网络、所述值函数网络和所述尾部风险网络分别进行至少一轮训练;Using the feedback sample set to perform at least one round of training on the policy network, the value function network, and the tail risk network respectively;
在对各网络分别进行至少一轮的训练并检测到满足预设训练结束条件后,得到训练完成的所述策略网络,以供基于所述策略网络进行金融衍生品对冲组合的调整。After each network is trained for at least one round and it is detected that a preset training end condition is met, the trained strategy network is obtained for adjusting the financial derivatives hedging portfolio based on the strategy network.
在一实施例中,所述采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与环境进行多次交互得到反馈样本集合的步骤包括:In one embodiment, the step of using the policy network to be trained, the value function network to be trained, and the tail risk network to be trained to interact with the environment multiple times to obtain a feedback sample set includes:
获取预设的环境数据集中一个目标金融衍生品对应的轨迹数据,所述轨迹数据包括所述目标金融衍生品在合约时段内的价格轨迹、交付金额轨迹、所述合约时段内的无风险利率轨迹、所述合约时段内各个资产的价格轨迹和各个资产的现金流轨迹;Obtaining trajectory data corresponding to a target financial derivative in a preset environmental data set, the trajectory data including a price trajectory, a delivery amount trajectory, a risk-free interest rate trajectory, a price trajectory of each asset within the contract period, and a cash flow trajectory of each asset within the contract period for the target financial derivative;
采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与所述目标金融衍生品对应的轨迹数据所限定的环境进行至少一次交互,得到所述目标金融衍生品对应的子反馈样本集合,所述子反馈样本集合中包括多条样本,所述样本中包括用于计算损失值的各项样本数据;Using the to-be-trained policy network, the to-be-trained value function network, and the to-be-trained tail risk network to interact at least once with an environment defined by the trajectory data corresponding to the target financial derivative, to obtain a sub-feedback sample set corresponding to the target financial derivative, wherein the sub-feedback sample set includes a plurality of samples, each of which includes sample data for calculating a loss value;
将多个具有不同初始状态信息的金融衍生品对应的所述子反馈样本集合生成所述反馈样本集合。The feedback sample set is generated by using the sub-feedback sample sets corresponding to a plurality of financial derivatives with different initial state information.
在一实施例中,所述值函数网络用于基于在目标时刻输入的状态信息输出状态价值,至少一项所述样本数据基于所述状态价值计算得到,所述采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与所述目标金融衍生品对应的轨迹数据所限定的环境进行至少一次交互,得到所述目标金融衍生品对应的子反馈样本集合的步骤包括:In one embodiment, the value function network is configured to output a state value based on state information input at a target time, at least one item of sample data is calculated based on the state value, and the step of using the to-be-trained policy network, the to-be-trained value function network, and the to-be-trained tail risk network to interact at least once with an environment defined by the trajectory data corresponding to the target financial derivative to obtain a sub-feedback sample set corresponding to the target financial derivative includes:
根据所述目标金融衍生品对应的轨迹数据确定所述目标金融衍生品对应的初始状态信息,所述初始状态信息为所述合约时段的起始时刻对应的状态信息;determining initial state information corresponding to the target financial derivative according to the trajectory data corresponding to the target financial derivative, wherein the initial state information is state information corresponding to the start time of the contract period;
将所述起始时刻作为目标时刻,将所述目标时刻对应的状态信息输入所述策略网络,得到所述目标时刻的策略信息,并根据所述策略信息生成在所述目标时刻对金融衍生品对冲组合做出的调整动作;Using the starting time as the target time, inputting state information corresponding to the target time into the strategy network to obtain strategy information at the target time, and generating an adjustment action for the financial derivatives hedging portfolio at the target time based on the strategy information;
根据所述调整动作和所述目标金融衍生品对应的轨迹数据,计算得到所述目标时刻的下一时刻的状态信息和做出所述调整动作后所述对冲组合在所述目标时刻的下一时刻的对冲误差;Calculating, based on the adjustment action and the trajectory data corresponding to the target financial derivative, state information at a moment immediately following the target moment and a hedging error of the hedging portfolio at a moment immediately following the target moment after the adjustment action is performed;
将所述目标时刻的状态信息输入待训练的值函数网络,得到所述目标时刻对应的状态价值;Inputting the state information of the target moment into the value function network to be trained to obtain the state value corresponding to the target moment;
计算在所述目标时刻做出所述调整动作后的奖励信息,其中,当所述目标时刻的下一时刻为所述合约时段的终止时刻时,所述奖励信息根据所述目标金融衍生品对冲组合在所述目标时刻的下一时刻的对冲误差和所述对冲误差的尾部风险计算得到,其中,所述尾部风险是根据风险信息计算得到的,所述风险信息是通过将所述目标金融衍生品对应的初始状态信息和/或目标时刻对应的状态信息输入待训练的尾部风险网络计算得到的;Calculating reward information after performing the adjustment action at the target time, wherein, when the moment immediately following the target time is the end time of the contract period, the reward information is calculated based on the hedging error of the target financial derivative hedging portfolio at the moment immediately following the target time and the tail risk of the hedging error, wherein the tail risk is calculated based on risk information, which is calculated by inputting initial state information corresponding to the target financial derivative and/or state information corresponding to the target time into a tail risk network to be trained;
将所述目标时刻的下一时刻作为新的所述目标时刻,并返回执行所述将所述目标时刻对应的状态信息输入所述策略网络,得到所述目标时刻的策略信息的步骤,直到所述目标时刻的下一时刻为所述合约时段的终止时刻为止;Taking the moment after the target moment as the new target moment, and returning to the step of inputting the state information corresponding to the target moment into the policy network to obtain the policy information for the target moment, until the moment after the target moment is the end moment of the contract period;
基于所述合约时段内每个时刻对应的状态信息、调整动作、奖励信息和状态价值,生成每个时刻对应的第一样本,所述第一样本中包括用于计算所述策略网络对应的策略损失值的样本数据和用于计算所述值函数网络对应的评估损失值的样本数据;Generate a first sample corresponding to each moment based on the state information, adjustment action, reward information, and state value corresponding to each moment within the contract period, wherein the first sample includes sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network;
基于所述目标金融衍生品的初始状态信息和所述合约时段的终止时刻对应的对冲误差,生成第二样本,所述第二样本中包括用于计算所述尾部风险网络对应的风险信息预测损失值的样本数据;generating a second sample based on the initial state information of the target financial derivative and the hedging error corresponding to the termination time of the contract period, wherein the second sample includes sample data for calculating the risk information predicted loss value corresponding to the tail risk network;
根据所述第一样本和所述第二样本生成所述目标金融衍生品对应的子反馈样本集合。A sub-feedback sample set corresponding to the target financial derivative is generated based on the first sample and the second sample.
在一实施例中,当所述风险信息是通过将所述目标金融衍生品对应的初始状态信息输入所述待训练的尾部风险网络计算得到的情况下,所述第二样本包括目标金融衍生品的初始状态信息和合约时段的终止时刻对应的对冲误差。In one embodiment, when the risk information is calculated by inputting the initial state information corresponding to the target financial derivative into the tail risk network to be trained, the second sample includes the initial state information of the target financial derivative and the hedging error corresponding to the end time of the contract period.
在一实施例中,当所述风险信息是通过将所述目标金融衍生品对应的初始状态信息和在目标时刻对应的状态信息分别输入所述待训练的尾部风险网络计算得到的情况下,所述第二样本包括目标金融衍生品的初始状态信息、目标时刻对应的状态信息和合约时段的终止时刻对应的对冲误差。In one embodiment, when the risk information is calculated by respectively inputting the initial state information corresponding to the target financial derivative and the state information corresponding at the target time into the tail risk network to be trained, the second sample includes the initial state information of the target financial derivative, the state information corresponding to the target time, and the hedging error corresponding to the end time of the contract period.
在一实施例中,当所述风险信息是通过将所述目标金融衍生品在目标时刻对应的状态信息输入所述待训练的尾部风险网络计算得到的情况下,所述第二样本包括目标金融衍生品在目标时刻对应的状态信息和合约时段的终止时刻对应的对冲误差。In one embodiment, when the risk information is calculated by inputting the state information corresponding to the target financial derivative at the target time into the tail risk network to be trained, the second sample includes the state information corresponding to the target financial derivative at the target time and the hedging error corresponding to the end time of the contract period.
在一实施例中,所述获取预设的环境数据集中一个目标金融衍生品对应的轨迹数据的步骤之前,还包括:In one embodiment, before the step of obtaining trajectory data corresponding to a target financial derivative in the preset environmental data set, the method further includes:
对于预设时段内以预设资产为标的资产的各个具有不同初始状态信息的金融衍生品中的任一金融衍生品,基于预设金融信息数据接口采集所述金融衍生品对应的轨迹数据;For any financial derivative with a preset asset as the underlying asset and having different initial state information within a preset time period, collecting trajectory data corresponding to the financial derivative based on a preset financial information data interface;
根据各个具有不同初始状态信息的金融衍生品对应的所述轨迹数据得到所述环境数据集。The environmental data set is obtained according to the trajectory data corresponding to each financial derivative with different initial state information.
在一实施例中,反馈样本集合中包括多条第一样本和第二样本,所述第一样本中包括用于计算所述策略网络对应的策略损失值的样本数据和用于计算所述值函数网络对应的评估损失值的样本数据,所述第二样本中包括用于计算所述尾部风险网络对应的风险信息预测损失值的样本数据;In one embodiment, the feedback sample set includes a plurality of first samples and second samples, wherein the first samples include sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network, and the second samples include sample data for calculating the risk information prediction loss value corresponding to the tail risk network;
采用所述反馈样本集合对所述策略网络、所述值函数网络和所述尾部风险网络分别进行一轮训练的步骤包括:The steps of using the feedback sample set to perform a round of training on the policy network, the value function network, and the tail risk network respectively include:
采用所述第一样本计算所述策略网络对应的策略损失值和所述值函数网络对应的评估损失值;Calculating a policy loss value corresponding to the policy network and an evaluation loss value corresponding to the value function network using the first sample;
根据所述策略损失值和所述评估损失值计算所述策略网络和所述值函数网络中网络参数对应的梯度值,并根据梯度值更新网络参数;Calculating the gradient values corresponding to the network parameters in the policy network and the value function network according to the policy loss value and the evaluation loss value, and updating the network parameters according to the gradient values;
采用所述第二样本计算所述尾部风险网络对应的风险信息预测损失值,根据所述风险信息预测损失值计算所述尾部风险网络中网络参数对应的梯度值,并根据梯度值更新网络参数;Calculating a risk information predicted loss value corresponding to the tail risk network using the second sample, calculating a gradient value corresponding to a network parameter in the tail risk network according to the risk information predicted loss value, and updating the network parameter according to the gradient value;
采用所述反馈样本集合对所述策略网络、所述值函数网络和所述尾部风险网络中的网络参数分别进行至少一轮的更新后,完成对所述策略网络、所述值函数网络和所述尾部风险网络的一轮训练过程。After using the feedback sample set to update the network parameters in the policy network, the value function network and the tail risk network for at least one round, a round of training process for the policy network, the value function network and the tail risk network is completed.
在一实施例中,所述第二样本中包括一个目标金融衍生品对应的初始状态信息和所述目标金融衍生品的对冲组合在所述目标金融衍生品的合约终止时刻的对冲误差;In one embodiment, the second sample includes initial state information corresponding to a target financial derivative and a hedging error of the hedging combination of the target financial derivative at the time of contract termination of the target financial derivative;
所述采用所述第二样本计算所述尾部风险网络对应的风险信息预测损失值,根据所述风险信息预测损失值计算所述尾部风险网络中网络参数对应的梯度值,并根据梯度值更新网络参数的步骤包括:The steps of using the second sample to calculate the risk information predicted loss value corresponding to the tail risk network, calculating the gradient value corresponding to the network parameter in the tail risk network according to the risk information predicted loss value, and updating the network parameter according to the gradient value include:
将所述第二样本中的初始状态信息输入所述尾部风险网络预测得到所述第二样本对应的风险信息;Inputting the initial state information of the second sample into the tail risk network prediction to obtain risk information corresponding to the second sample;
将所述第二样本对应的风险信息和所述第二样本中的对冲误差代入预设的风险信息预测损失函数,得到所述第二样本对应的风险信息预测损失值;Substituting the risk information corresponding to the second sample and the hedging error in the second sample into a preset risk information prediction loss function to obtain a risk information prediction loss value corresponding to the second sample;
根据所述反馈样本集合中多个所述第二样本对应的风险信息预测损失值,计算所述尾部风险网络中网络参数对应的梯度值,并根据梯度值更新网络参数。Predicting loss values based on risk information corresponding to a plurality of second samples in the feedback sample set, calculating gradient values corresponding to network parameters in the tail risk network, and updating the network parameters based on the gradient values.
在一实施例中,金融衍生品对应的目标时刻的状态信息包括在所述目标时刻所述金融衍生品的对冲组合中的现金数量、各个对冲资产的数量、各个对冲资产的价格、所述金融衍生品的各个标的资产的价格、所述金融衍生品的剩余到期时间、所述金融衍生品的合约中约定的影响所述金融衍生品价格的参数或所述金融衍生品的合约中约定的影响所述金融衍生品价格的参数与所述金融衍生品的各个标的资产初始时刻的价格的比值、所述对冲组合的对冲误差和在所述目标时刻的无风险利率。In one embodiment, the status information corresponding to the financial derivative at the target moment includes the amount of cash in the hedging portfolio of the financial derivative at the target moment, the amount of each hedging asset, the price of each hedging asset, the price of each underlying asset of the financial derivative, the remaining maturity time of the financial derivative, the parameters affecting the price of the financial derivative agreed in the contract of the financial derivative or the ratio of the parameters affecting the price of the financial derivative agreed in the contract of the financial derivative to the prices of each underlying asset of the financial derivative at the initial moment, the hedging error of the hedging portfolio and the risk-free interest rate at the target moment.
在一实施例中,所述策略网络包括两个独立的多层前馈神经网络,每个所述前馈神经网络包括一个输入层,若干个隐藏层和一个输出层。In one embodiment, the policy network includes two independent multi-layer feedforward neural networks, each of which includes an input layer, several hidden layers and an output layer.
在一实施例中,所述值函数网络包括两个独立的多层前馈神经网络,每个所述前馈神经网络包括一个输入层,若干个隐藏层和一个输出层,且所述输入层节点数和所述策略网络一致,每个所述隐藏层包含若干个节点,而所述输出层仅有一个节点用于输出一个值函数变量,用于评估输入的状态信息的价值。In one embodiment, the value function network includes two independent multi-layer feedforward neural networks, each of which includes an input layer, several hidden layers and an output layer, and the number of nodes in the input layer is consistent with that in the policy network. Each hidden layer contains several nodes, and the output layer has only one node for outputting a value function variable for evaluating the value of the input state information.
为实现上述目的,本申请还提供一种数据处理装置,所述装置包括:To achieve the above objectives, the present application further provides a data processing device, comprising:
交互模块,用于采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与环境进行多次交互得到反馈样本集合,其中,所述策略网络用于基于目标时刻的状态信息输出策略信息,所述策略信息用于生成在所述目标时刻对金融衍生品对冲组合做出的调整动作,所述反馈样本集合包括多条样本,所述样本中包括用于计算损失值的各项样本数据,至少一项样本数据基于奖励信息计算得到,所述奖励信息基于金融衍生品对冲组合的对冲误差和所述对冲误差的尾部风险计算得到,所述尾部风险根据风险信息计算得到,所述风险信息由所述尾部风险网络基于金融衍生品对应的初始状态信息和/或目标时刻对应的状态信息预测得到;An interaction module, configured to employ a policy network to be trained, a value function network to be trained, and a tail risk network to be trained to interact multiple times with an environment to obtain a feedback sample set, wherein the policy network is configured to output policy information based on state information at a target moment, the policy information being configured to generate an adjustment action for the financial derivatives hedging portfolio at the target moment, the feedback sample set comprising a plurality of samples, each of which includes various sample data items used to calculate a loss value, at least one of which is calculated based on reward information, the reward information being calculated based on a hedging error of the financial derivatives hedging portfolio and a tail risk of the hedging error, the tail risk being calculated based on risk information, and the risk information being predicted by the tail risk network based on initial state information corresponding to the financial derivative and/or state information corresponding to the target moment;
训练模块,用于采用所述反馈样本集合对所述策略网络、所述值函数网络和所述尾部风险网络分别进行至少一轮训练;在对各网络分别进行至少一轮的训练并检测到满足预设训练结束条件后,得到训练完成的所述策略网络,以供基于所述策略网络进行金融衍生品对冲组合的调整。A training module is used to use the feedback sample set to perform at least one round of training on the strategy network, the value function network and the tail risk network respectively; after performing at least one round of training on each network and detecting that the preset training end conditions are met, the trained strategy network is obtained for adjustment of the financial derivatives hedging portfolio based on the strategy network.
为实现上述目的,本申请还提供一种数据处理设备,所述数据处理设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的数据处理程序,所述数据处理程序被所述处理器执行时实现如上所述的数据处理方法的步骤。To achieve the above-mentioned objectives, the present application also provides a data processing device, which includes: a memory, a processor, and a data processing program stored in the memory and executable on the processor, wherein the data processing program implements the steps of the data processing method described above when executed by the processor.
此外,为实现上述目的,本申请还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有数据处理程序,所述数据处理程序被处理器执行时实现如上所述的数据处理方法的步骤。In addition, to achieve the above-mentioned purpose, the present application also proposes a computer-readable storage medium, on which a data processing program is stored. When the data processing program is executed by a processor, the steps of the data processing method described above are implemented.
本申请实施例中,采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与环境进行多次交互得到反馈样本集合,其中,所述策略网络用于基于目标时刻的状态信息输出策略信息,所述策略信息用于生成在所述目标时刻对金融衍生品对冲组合做出的调整动作,所述反馈样本集合包括多条样本,所述样本中包括用于计算损失值的各项样本数据,至少一项样本数据基于奖励信息计算得到,所述奖励信息基于金融衍生品对冲组合的对冲误差和所述对冲误差的尾部风险计算得到,所述尾部风险根据风险信息计算得到,所述风险信息由所述尾部风险网络基于金融衍生品对应的初始状态信息和/或目标时刻对应的状态信息预测得到;采用所述反馈样本集合对所述策略网络、所述值函数网络和所述尾部风险网络分别进行至少一轮训练;在对各网络分别进行至少一轮的训练并检测到满足预设训练结束条件后,得到训练完成的所述策略网络,以供基于所述策略网络进行金融衍生品对冲组合的调整。由于在策略网络的训练过程中,策略损失值基于奖励信息计算得到,而奖励信息基于金融衍生品对冲组合的对冲误差的尾部风险计算得到,所以能够使得基于训练完成的策略网络做出的调整动作获得更低的对冲误差的尾部风险。并且,由于设置用于计算尾部风险的风险信息由尾部风险网络基于输入的初始状态信息和/或目标时刻对应的状态信息预测得到,并且尾部风险网络基于风险信息预测损失值训练得到,这使得可以对具有不同初始状态信息的同一种金融衍生品计算得到对应的不同的对冲误差的尾部风险,从而使得反馈样本集合中可以包括多个具有不同初始状态信息的同一种金融衍生品对应的子反馈样本集合。因此,基于反馈样本集合训练完成的策略网络针对各个具有不同初始状态信息的同一种金融衍生品均能够适用,从而无需针对各个具有不同初始状态信息的同一种金融衍生品训练不同的模型,降低了对策略网络进行训练所耗费的时间成本、算力成本和硬件资源成本。此外,已有的考虑尾部风险的方法,都只能用基于参数模型模拟生成的数据来进行训练,而不能直接使用真实的市场观测数据来进行训练,这就不能直接有效地使用真实的市场数据的信息,也不能避免建模的错误和参数估计的错误。本申请实施例提供的方案可以只使用真实的市场数据来进行训练,因此既能够避免建模的错误和参数估计的错误,也能够直接利用市场数据的信息,从而能够有效地降低对冲误差的尾部风险,有助于证券公司进行风险管控。In an embodiment of the present application, a policy network to be trained, a value function network to be trained, and a tail risk network to be trained are used to interact with the environment multiple times to obtain a feedback sample set, wherein the policy network is used to output policy information based on state information at a target time, and the policy information is used to generate an adjustment action for the financial derivatives hedging portfolio at the target time. The feedback sample set includes multiple samples, each of which includes sample data for calculating a loss value, at least one sample data item is calculated based on reward information, the reward information is calculated based on a hedging error of the financial derivatives hedging portfolio and a tail risk of the hedging error, the tail risk is calculated based on risk information, and the risk information is predicted by the tail risk network based on initial state information corresponding to the financial derivative and/or state information corresponding to the target time; the policy network, the value function network, and the tail risk network are each trained for at least one round using the feedback sample set; after each network is trained for at least one round and a preset training end condition is detected to be met, the trained policy network is obtained for use in adjusting the financial derivatives hedging portfolio based on the policy network. Because during the training of the policy network, the policy loss value is calculated based on reward information, which is calculated based on the tail risk of the hedging error of the financial derivative hedging portfolio, adjustment actions made based on the trained policy network can achieve a lower tail risk of the hedging error. Furthermore, because the risk information used to calculate the tail risk is predicted by the tail risk network based on the input initial state information and/or the state information corresponding to the target time, and the tail risk network is trained based on the loss value predicted by the risk information, it is possible to calculate different tail risks of the hedging error for the same financial derivative with different initial state information, thereby allowing the feedback sample set to include multiple sub-feedback sample sets corresponding to the same financial derivative with different initial state information. Therefore, the policy network trained based on the feedback sample set is applicable to each financial derivative of the same type with different initial state information, eliminating the need to train different models for each financial derivative of the same type with different initial state information, thereby reducing the time cost, computing power cost, and hardware resource cost of training the policy network. In addition, existing methods for considering tail risk can only use data generated by parameter model simulation for training, and cannot directly use real market observation data for training. This makes it impossible to directly and effectively use the information of real market data, nor can it avoid modeling errors and parameter estimation errors. The solution provided in the embodiment of the present application can only use real market data for training, so it can avoid modeling errors and parameter estimation errors, and can also directly use the information of market data, thereby effectively reducing the tail risk of hedging errors, which helps securities companies to carry out risk management.
图1为本申请数据处理方法一实施例的流程示意图;FIG1 is a flow chart of an embodiment of a data processing method of the present application;
图2为本申请实施例方案涉及的硬件运行环境的结构示意图;FIG2 is a schematic diagram of the structure of the hardware operating environment involved in the embodiment of the present application;
图3为本申请数据处理方法另一实施例的流程示意图;FIG3 is a flow chart of another embodiment of the data processing method of the present application;
图4为本申请数据处理方法另一实施例的流程示意图;FIG4 is a flow chart of another embodiment of the data processing method of the present application;
图5为本申请数据处理方法另一实施例的流程示意图;FIG5 is a flow chart of another embodiment of the data processing method of the present application;
图6为本申请数据处理方法另一实施例的流程示意图;FIG6 is a flow chart of another embodiment of the data processing method of the present application;
图7为本申请数据处理方法另一实施例的流程示意图。FIG7 is a flow chart of another embodiment of the data processing method of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization of the objectives, functional features and advantages of this application will be further explained in conjunction with embodiments and with reference to the accompanying drawings.
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described herein are only used to explain the present application and are not intended to limit the present application.
参照图1,图1为本申请数据处理方法一实施例的流程示意图。Refer to FIG1 , which is a flow chart of an embodiment of a data processing method of the present application.
本申请实施例提供了数据处理方法的实施例,需要说明的是,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。在本实施例中,数据处理方法的执行主体可以是智能手机、个人电脑、服务器等设备,在本实施例中并不做限制。在本实施例中,为便于表述,省略执行主体进行阐述。在本实施例中,所述数据处理方法包括以下步骤:The embodiments of the present application provide embodiments of the data processing method. It should be noted that although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that shown here. In this embodiment, the execution subject of the data processing method can be a smart phone, a personal computer, a server and other devices, which are not limited in this embodiment. In this embodiment, for ease of description, the execution subject is omitted for elaboration. In this embodiment, the data processing method includes the following steps:
步骤S10,采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与环境进行多次交互得到反馈样本集合,其中,所述策略网络用于基于目标时刻的状态信息输出策略信息,所述策略信息用于生成在所述目标时刻对金融衍生品对冲组合做出的调整动作,所述反馈样本集合包括多条样本,所述样本中包括用于计算损失值的各项样本数据,至少一项样本数据基于奖励信息计算得到,所述奖励信息基于金融衍生品对冲组合的对冲误差和所述对冲误差的尾部风险计算得到,所述尾部风险根据风险信息计算得到,所述风险信息由所述尾部风险网络基于金融衍生品对应的初始状态信息和/或目标时刻对应的状态信息预测得到。Step S10, using the policy network to be trained, the value function network to be trained, and the tail risk network to be trained to interact with the environment multiple times to obtain a feedback sample set, wherein the policy network is used to output policy information based on the state information at the target moment, and the policy information is used to generate an adjustment action for the financial derivatives hedging portfolio at the target moment, and the feedback sample set includes multiple samples, and the samples include various sample data for calculating the loss value, at least one sample data is calculated based on the reward information, and the reward information is calculated based on the hedging error of the financial derivatives hedging portfolio and the tail risk of the hedging error, and the tail risk is calculated according to the risk information, and the risk information is predicted by the tail risk network based on the initial state information corresponding to the financial derivative and/or the state information corresponding to the target moment.
金融衍生品是指其价值依赖于标的资产价值的金融产品,是实体企业和金融机构管理金融风险的重要工具。例如,钢铁生产企业常备有铁矿石库存,为了规避铁矿石价格在未来下跌的风险,需要从证券公司买入铁矿石看跌期权来保值。铁矿石看跌期权的标的资产是铁矿石,铁矿石看跌期权的价值依赖于铁矿石的价值。证券公司卖出金融衍生品时,需要构建对冲组合用于复制该金融衍生品,对冲组合是由现金和对冲资产构成的组合。证券公司在金融衍生品的合约到期时刻之前,需要调整对冲组合中的对冲资产的数量,以保证在金融衍生品的合约期限内,对冲组合的总价值等于或不低于金融衍生品的价值,即金融衍生品合约中约定的一笔或多笔需要支付给金融衍生品的买方的金额(即支付金额)的价值,支付金额由标的资产的价格决定(支付金额可能为负值)。例如,欧式铁矿石看跌期权合约中约定,在期权到期日,期权的卖方需要支付给期权买方的金额为合约中约定的行权价格减去铁矿石在期权到期日的价格所得的差值与零的最大值。对冲组合的对冲误差是对冲组合的价值减去金融衍生品的价值所得的差值。当对冲误差小于零时,即对冲组合的价值低于金融衍生品的价值时,证券公司就要承担损失;因此,证券公司需要有效地控制卖出金融衍生品所带来的金融风险,尤其是对冲误差的尾部风险,即证券公司遭受大额损失的风险。在本实施例中,策略网络可以是用于基于金融衍生品对应的目标时刻的状态信息输出策略信息的神经网络,策略信息用于生成目标时刻对金融衍生品的对冲组合做出的调整动作。其中,目标时刻是指金融衍生品的合约时段内的任一时刻,合约时段的起始时刻即金融衍生品的卖出时刻,终止时刻即金融衍生品的合约中规定的到期时刻(以下称合约到期时刻)。Financial derivatives are financial products whose value depends on the value of the underlying asset. They are an important tool for businesses and financial institutions to manage financial risk. For example, steel producers often hold iron ore inventories. To hedge against the risk of future price declines, they may purchase iron ore put options from securities firms to hedge their value. The underlying asset of these put options is iron ore, and their value depends on the price of iron ore. When securities firms sell financial derivatives, they construct a hedge portfolio to replicate the derivative. The hedge portfolio consists of cash and hedge assets. Before the expiration of a financial derivative contract, securities firms adjust the amount of hedge assets in the portfolio to ensure that, during the contract period, the total value of the hedge portfolio is equal to or greater than the value of the derivative. This is the value of one or more payments (i.e., the payment amount) stipulated in the contract to the buyer of the derivative. The payment amount is determined by the price of the underlying asset (which may be negative). For example, a European iron ore put option contract stipulates that on the option's expiration date, the option seller must pay the option buyer the maximum value of the difference between the strike price agreed in the contract and the price of iron ore on the option's expiration date and zero. The hedging error of a hedge portfolio is the difference between the value of the hedge portfolio and the value of the financial derivative. When the hedging error is less than zero, that is, when the value of the hedge portfolio is lower than the value of the financial derivative, the securities company will bear the loss. Therefore, securities companies need to effectively control the financial risks associated with selling financial derivatives, especially the tail risk of the hedging error, which is the risk of the securities company suffering large losses. In this embodiment, the policy network can be a neural network that outputs policy information based on the state information of the financial derivative at the target time. The policy information is used to generate adjustments to the financial derivative's hedging portfolio at the target time. The target time refers to any time within the contract period of the financial derivative. The contract period starts at the time the financial derivative is sold, and ends at the expiration time specified in the financial derivative contract (hereinafter referred to as the contract expiration time).
金融衍生品对应的目标时刻的状态信息可以包括金融衍生品对应的各个状态变量在目标时刻的取值,金融衍生品对应的各个状态变量可以包括对金融衍生品的对冲组合的对冲误差、金融衍生品的价格、对冲资产的价格或现金流或标的资产的价格或现金流产生影响或具有相关性的信息变量,例如,可以包括金融衍生品的标的资产的波动率、金融衍生品的对冲组合中的各个对冲资产的数量等信息变量,具体包括哪些信息变量,可以根据需要进行设置,在此并不做限制。例如,在一实施方式中,金融衍生品对应的目标时刻的状态信息可以包括在目标时刻该金融衍生品的对冲组合中的现金数量(可以采用Bt表示,t表示目标时刻)、对冲组合中的n个对冲资产的数量(可以采用δt=(δt,1,…,δt,n)表示)、对冲组合中的n个对冲资产的价格(可以采用Gt=(Gt,1,…,Gt,n)表示)、该金融衍生品的个标的资产的价格(可以采用表示)、该金融衍生品的个标的资产的波动率(可以采用表示)、该金融衍生品的剩余到期时间(可以采用τt表示)、该金融衍生品的合约中约定的影响该金融衍生品价格的参数或该金融衍生品的合约中约定的影响该金融衍生品价格的参数与该金融衍生品的各个标的资产初始时刻的价格的比值、该对冲组合的对冲误差(可以采用Wt表示)和在所述目标时刻的无风险利率(可以采用表示)。其中,对于不同种类的金融衍生品而言,金融衍生品的合约中约定的影响该金融衍生品价格的参数种类不同,例如:欧式期权的合约中约定的影响欧式期权的价格的参数是欧式期权的行权价格(可以采用K表示);向上敲出看涨期权的合约中约定的影响向上敲出看涨期权的价格的参数是向上敲出看涨期权的行权价格和敲出障碍价格;雪球期权的合约中约定的影响雪球期权的价格的参数是雪球期权的敲入障碍价格、敲出障碍价格和票息率。需要说明的是,可以在上述列举的各项状态信息基础上根据实际情况进行增减,在此不进行一一列举。例如,在一些实施方式中,金融衍生品对应的目标时刻的状态信息还可以包括该金融衍生品对标的资产价格变动的敏感度信息以及其他影响衍生品价值和对冲误差的信息。其中,敏感度信息可以包括金融衍生品的Delta(可以采用表示)、Gamma(可以采用表示)、Vega(可以采用表示)和Theta(可以采用表示)中的一项或多项。需要说明的是,当金融衍生品的对冲组合中的对冲资产与该金融衍生品的标的资产相同时,St和Gt相同,此时状态信息中可以省略Gt。The state information corresponding to the financial derivative at the target time may include the values of each state variable corresponding to the financial derivative at the target time. The each state variable corresponding to the financial derivative may include information variables that affect or are correlated with the hedging error of the financial derivative's hedging portfolio, the price of the financial derivative, the price or cash flow of the hedging asset, or the price or cash flow of the underlying asset. For example, the information variables may include the volatility of the underlying asset of the financial derivative, the number of each hedging asset in the hedging portfolio of the financial derivative, etc. The specific information variables included can be set as needed and are not limited here. For example, in one embodiment, the state information corresponding to the financial derivative at the target time may include the amount of cash in the hedging portfolio of the financial derivative at the target time (which can be represented by Bt , where t represents the target time), the number of n hedging assets in the hedging portfolio (which can be represented by δt = (δt , 1 , ..., δt , n )), the prices of the n hedging assets in the hedging portfolio (which can be represented by Gt = (Gt , 1 , ..., Gt , n )), the The price of the underlying asset (can be (represented by) the financial derivative The volatility of the underlying asset (can be t), the remaining maturity time of the financial derivative (which can be represented by τt ), the parameters affecting the price of the financial derivative agreed in the contract of the financial derivative or the ratio of the parameters affecting the price of the financial derivative agreed in the contract of the financial derivative to the prices of the underlying assets of the financial derivative at the initial time, the hedging error of the hedging portfolio (which can be represented by Wt ) and the risk-free interest rate at the target time (which can be represented by Represented). Among them, for different types of financial derivatives, the types of parameters that affect the price of the financial derivatives agreed in the contract of the financial derivatives are different. For example: the parameter that affects the price of the European option agreed in the contract of the European option is the strike price of the European option (which can be represented by K); the parameters that affect the price of the knock-out call option agreed in the contract of the knock-out call option are the strike price and the knock-out barrier price of the knock-out call option; the parameters that affect the price of the snowball option agreed in the contract of the snowball option are the knock-in barrier price, the knock-out barrier price and the coupon rate of the snowball option. It should be noted that the above-mentioned various status information can be increased or decreased according to actual conditions, and they are not listed one by one here. For example, in some embodiments, the status information of the target moment corresponding to the financial derivative can also include the sensitivity information of the financial derivative to the price change of the underlying asset and other information that affects the value of the derivative and the hedging error. Among them, the sensitivity information can include the Delta of the financial derivative (which can be represented by Gamma (can be expressed as Indicates), Vega (can be used denoted) and Theta (can be expressed as It should be noted that when the hedging asset in the hedging portfolio of a financial derivative is the same as the underlying asset of the financial derivative, St and Gt are the same, and Gt can be omitted in the status information.
示例性地,欧式期权对应的t时刻的状态信息st可以表示为:
For example, the state information s t corresponding to the European option at time t can be expressed as:
其中,当St和Gt相同时,状态信息中可以省略Gt,S0表示欧式期权的标的资产在初始时刻的价格,也可以替换为K。示例性地,向上敲出看涨期权对应的t时刻的状态信息st可以表示为:
When S t and G t are the same, G t can be omitted in the status information. S 0 represents the price of the underlying asset of the European option at the initial moment. It can also be replaced by K. For example, the state information s t at time t corresponding to the upward knock-out call option can be expressed as:
其中J是向上敲出看涨期权的合约里约定的敲出障碍,K是行权价格。Where J is the knock-out barrier agreed in the contract for the upward knock-out call option, and K is the strike price.
示例性地,雪球期权对应的t时刻的状态信息st可以表示为:
For example, the state information s t corresponding to the Snowball option at time t can be expressed as:
其中J是雪球期权合约里约定的敲出障碍,L是雪球期权合约里约定的敲入障碍,D是雪球期权合约里约定的股息率。Among them, J is the knock-out barrier agreed in the Snowball options contract, L is the knock-in barrier agreed in the Snowball options contract, and D is the dividend rate agreed in the Snowball options contract.
对对冲组合做出的调整动作可以通过调整后对冲组合中对冲资产的数量来体现,也即,可以基于策略信息生成调整后对冲组合中对冲资产的数量。例如,定义t时刻基于状态信息st所选择的调整动作为at,其意义为在时刻t对对冲组合进行调整后对冲组合中所包括的对冲资产的数量,即at∈Rn,n是对t冲资产的个数。策略网t络输出的策略信息可以是预先定义的能够生成调整动作的信息,例如,在一实施方式中,可以设置策略网络输出在目标时刻做出的调整动作的概率分布的参数值,根据该参数值所限定的概率分布采样得到调整动作。又如,在另一实施方式中,可以设置策略网络输出在目标时刻做出的调整动作的动作值,该动作值即调整后对冲组合中对冲资产的数量。The adjustment action made to a hedge portfolio can be reflected by the number of hedge assets in the adjusted hedge portfolio. That is, the number of hedge assets in the adjusted hedge portfolio can be generated based on the policy information. For example, the adjustment action selected based on the state information s t at time t is defined as a t , which represents the number of hedge assets included in the hedge portfolio after the adjustment at time t, i.e., a t ∈ R n , where n is the number of hedge assets at time t . The policy information output by the policy network can be predefined information capable of generating adjustment actions. For example, in one embodiment, the policy network can be configured to output a parameter value for the probability distribution of the adjustment action to be made at the target time, and the adjustment action can be sampled based on the probability distribution defined by the parameter value. For another example, in another embodiment, the policy network can be configured to output an action value for the adjustment action to be made at the target time, which is the number of hedge assets in the adjusted hedge portfolio.
策略网络、值函数网络和尾部风险网络中的网络参数需要经过多轮训练来进行更新,最终获得满足设定预期的策略网络,进而基于该符合设定预期的策略网络来进行金融衍生品的对冲组合的调整。在本实施例中,将网络参数值还未最终确定的策略网络称为待训练的策略网络,同理,也有待训练的值函数网络和待训练的尾部风险网络的概念。在具体实施方式中,在第一轮训练开始之前,待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络中的网络参数可以根据需要进行初始化,在后续的每一轮的训练开始之前,待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络中的网络参数是上一轮训练中所更新得到的网络参数。对待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络进行的各轮训练过程类似,因此,以下以一轮训练过程为例进行说明。The network parameters in the policy network, value function network, and tail risk network need to be updated through multiple rounds of training to ultimately obtain a policy network that meets the set expectations. The hedging portfolio of financial derivatives can then be adjusted based on this policy network that meets the set expectations. In this embodiment, a policy network whose network parameter values have not yet been finalized is referred to as a policy network to be trained. Similarly, there are also concepts of a value function network to be trained and a tail risk network to be trained. In a specific embodiment, before the start of the first round of training, the network parameters in the policy network to be trained, the value function network to be trained, and the tail risk network to be trained can be initialized as needed. Before the start of each subsequent round of training, the network parameters in the policy network to be trained, the value function network to be trained, and the tail risk network to be trained are the network parameters updated in the previous round of training. The training process for each round of training for the policy network to be trained, the value function network to be trained, and the tail risk network to be trained are similar. Therefore, the following description uses a single round of training as an example.
采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与环境进行交互的过程可以参照相关技术中强化学习的智能体基于策略函数与环境进行交互的过程,在此并不做限制。经过多次交互,可以获得反馈样本集合,反馈样本集合中包括多条样本,每条样本中包括至少一项样本数据,各项样本数据用于计算损失值。在具体实施方式中,根据所选取的损失函数不同,样本中包括的具体样本数据不同;在本实施例中,设置至少一项样本数据根据奖励信息计算得到,并设置奖励信息基于金融衍生品的对冲组合的对冲误差和对冲误差的尾部风险计算得到,也即,在损失函数中考虑了对冲误差的尾部风险,从而使得在金融衍生品的合约到期时刻的对冲误差的尾部风险尽可能小;本实施例中,对基于奖励信息计算样本数据,以及基于样本数据计算损失值的方式并不做限制,具体根据所选取的损失函数不同而不同;其他各项样本数据的计算方式在本实施例中也并不做限制,具体根据所选取的损失函数不同而不同。此外,本实施例中,对冲组合的对冲误差的尾部风险基于风险信息计算得到,风险信息则由待训练的尾部风险网络基于金融衍生品对应的初始状态信息和/或目标时刻对应的状态信息预测得到。金融衍生品对应的初始状态信息是指金融衍生品的合约时段中起始时刻所对应的状态信息。通过设置尾部风险网络,由尾部风险网络来基于金融衍生品对应的初始状态信息预测风险信息,并通过风险信息来计算对冲组合的对冲误差的尾部风险,并且尾部风险网络基于风险信息预测损失值训练得到,这使得可以对具有不同初始状态信息的同一种金融衍生品计算得到对应的不同的对冲误差的尾部风险,从而使得反馈样本集合中可以包括多个具有不同初始状态信息的同一种金融衍生品对应的子反馈样本集合。因此,基于反馈样本集合训练得到的符合设定预期的策略网络,可以适用于多个具有不同初始状态信息的同一种金融衍生品;这些多个具有不同初始状态信息的同一种金融衍生品可能具有不同标的初始价格、不同的合约中约定的影响金融衍生品价格的参数、不同的合约到期时刻等,从而不需要针对每个具有不同初始状态信息的同一种金融衍生品单独进行模型训练,从而提高了训练得到的用于对金融衍生品对冲组合进行调整的策略网络的适用范围,降低了训练所需的硬件设备资源成本、算力成本和时间成本。金融衍生品的种类包括但不限于欧式看涨期权、欧式看跌期权、向上敲入障碍期权、向上敲出障碍期权、回看期权、亚式看涨期权、亚式看跌期权、雪球期权、利率期权、债券期权等。The process of using the policy network to be trained, the value function network to be trained, and the tail risk network to be trained to interact with the environment can refer to the process of interaction between an intelligent agent in reinforcement learning based on a policy function and the environment in related technologies, and is not limited here. After multiple interactions, a feedback sample set can be obtained. The feedback sample set includes multiple samples, each sample including at least one sample data item, and each sample data item is used to calculate the loss value. In specific embodiments, the specific sample data included in the sample varies depending on the selected loss function. In this embodiment, at least one sample data item is calculated based on reward information, and the reward information is calculated based on the hedging error and the tail risk of the hedging error of the hedging portfolio of financial derivatives. That is, the tail risk of the hedging error is considered in the loss function, thereby minimizing the tail risk of the hedging error at the expiration of the financial derivative contract. In this embodiment, the method of calculating the sample data based on the reward information and the method of calculating the loss value based on the sample data are not limited, and specifically vary depending on the selected loss function. The method of calculating the other sample data items is also not limited in this embodiment, and specifically varies depending on the selected loss function. Furthermore, in this embodiment, the tail risk of the hedged error of the hedging portfolio is calculated based on risk information, which is predicted by the tail risk network to be trained based on the initial state information of the financial derivative and/or the state information corresponding to the target time. The initial state information corresponding to the financial derivative refers to the state information corresponding to the starting time of the contract period of the financial derivative. By setting up a tail risk network, the tail risk network predicts risk information based on the initial state information of the financial derivative, and calculates the tail risk of the hedged error of the hedging portfolio using the risk information. The tail risk network is trained based on the loss value predicted by the risk information. This allows different tail risks of hedging errors to be calculated for the same financial derivative with different initial state information, thereby allowing the feedback sample set to include multiple sub-feedback sample sets corresponding to the same financial derivative with different initial state information. Therefore, a strategy network trained on a feedback sample set that meets the set expectations can be applied to multiple financial derivatives of the same type with different initial state information. These multiple financial derivatives with different initial state information may have different underlying initial prices, different contract parameters affecting the price of the financial derivative, different contract expiration times, etc. This eliminates the need for separate model training for each financial derivative with different initial state information. This increases the applicability of the trained strategy network for adjusting financial derivative hedging portfolios and reduces the hardware resource costs, computing power costs, and time costs required for training. Financial derivatives include, but are not limited to, European call options, European put options, knock-in barrier options, knock-out barrier options, lookback options, Asian call options, Asian put options, snowball options, interest rate options, and bond options.
在具体实施方式中,可以但不限于用多层前馈神经网络表示策略网络。In a specific embodiment, the policy network may be represented by, but is not limited to, a multi-layer feedforward neural network.
在一实施方式中,策略网络可以具有两个独立的多层前馈神经网络,每个前馈神经网络由一个输入层,若干个隐藏层和一个输出层构成。其结构可以设置为:输入层维数与状态信息st的维数相同,共包含三层具有64个神经元的线性隐藏层,每一层线性隐藏层之后分别应用非线性激活函数(例如Swish)和连接批标准化层,输出层维数与动作at的维数相同。另外,可以根据目标金融衍生品的特征和输出变量的含义,通过在输出前应用常数倍的Sigmoid激活函数控制输出变量的取值范围。例如,当目标金融衍生品为以1股股票作为标的资产的欧式看涨期权且输出变量为调整动作的均值变量时,可以在输出前应用Sigmoid激活函数使得输出变量的取值范围为[0,1](目标金融衍生品为欧式看跌期权时,在输出前应用Sigmoid激活函数并乘以-1)。In one embodiment, the policy network can have two independent multi-layer feedforward neural networks, each feedforward neural network consists of an input layer, several hidden layers and an output layer. Its structure can be set to: the input layer dimension is the same as the dimension of the state information s t , and a total of three layers of linear hidden layers with 64 neurons are included. After each layer of linear hidden layer, a nonlinear activation function (such as Swish) and a batch normalization layer are applied, and the output layer dimension is the same as the dimension of the action a t . In addition, the value range of the output variable can be controlled by applying a constant multiple of the Sigmoid activation function before output according to the characteristics of the target financial derivative and the meaning of the output variable. For example, when the target financial derivative is a European call option with 1 stock as the underlying asset and the output variable is the mean variable of the adjustment action, the Sigmoid activation function can be applied before output so that the value range of the output variable is [0,1] (when the target financial derivative is a European put option, the Sigmoid activation function is applied before output and multiplied by -1).
两个多层前馈神经网络分别输出调整动作所服从的概率分布的两个参数变量。在t时刻,状态信息是st,智能体在t时以根据目标金融衍生品的情况来设置,例如当目标金融衍生品为以1股股票作为标的资产的欧式期权时,可以设置b=1。网络的结构可以设置为:输入层维数与状态信息st的维数相同,共包含三层具有64个神经元的线性隐藏层,每一层线性隐藏层之后分别应用非线性激活函数(例如Swish)和连接批标准化层,输出层维数与动作at的维数相同。两个策略网络资产的欧式看涨期权时,可以在输出前应用Sigmoid激活函数使得动作值的取值范围为[0,1](目标金融衍生品为欧式看跌期权时,在输出前应用Sigmoid激活函数并乘以-1)。The two multi-layer feedforward neural networks output two parameter variables of the probability distribution that the adjustment action obeys. At time t, the state information is s t , and the agent is at time t It can be set according to the situation of the target financial derivative. For example, when the target financial derivative is a European option with one share of stock as the underlying asset, b=1 can be set. The network structure can be set as follows: the input layer dimension is the same as the dimension of the state information s t , and it contains three linear hidden layers with 64 neurons. After each linear hidden layer, a nonlinear activation function (such as Swish) and a batch normalization layer are applied. The output layer dimension is the same as the dimension of the action a t . Two policy networks When the target asset is a European call option, the Sigmoid activation function can be applied before output so that the action value range is [0,1]. (When the target financial derivative is a European put option, the Sigmoid activation function is applied before output and multiplied by -1.)
尾部风险网络输出的风险信息可以是预先定义的能够计算得到对冲组合的对冲误差的尾部风险的信息,例如,在一实施方式中,风险信息可以是金融衍生品的对冲组合在合约到期时刻(合约时段的终止时刻)的对冲误差的相反数的α分位数,也即,风险信息可以是给定水平α(例如α=0.975)下金融衍生品的对冲组合在合约到期时刻的对冲误差的相反数的分位数。在另一实施方式中,风险信息可以是金融衍生品的对冲组合在合约到期时刻(合约时段的终止时刻)的对冲误差的相反数的分位数,其中0<α1<1为取定的常数。The risk information output by the tail risk network may be predefined information that can be used to calculate the tail risk of the hedging error of the hedging combination. For example, in one embodiment, the risk information may be the α quantile of the inverse of the hedging error of the financial derivatives hedging combination at the time of contract expiration (the end of the contract period). That is, the risk information may be the quantile of the inverse of the hedging error of the financial derivatives hedging combination at the time of contract expiration under a given level α (e.g., α=0.975). In another embodiment, the risk information may be the quantile of the inverse of the hedging error of the financial derivatives hedging combination at the time of contract expiration (the end of the contract period). Quantile, where 0<α 1 <1 is a fixed constant.
在一实施方式中,在策略网络设置为输出调整动作所服从的概率分布的两个参数变量的情况下,尾部风险网络可以表示为ω(s0;ζ),其中,s0为金融衍生品对应的初始状态信息,ζ为尾部风险网络中的网络参数。尾部风险网络可以采用神经网络结构,网络的输入为s0,输出为风险信息。以全连接的神经网络为例(可采用其他类型的网络,例如残差神经网络),其结构可以设置为:输入层维数与状态信息st的维数相同,共包含三层具有64个神经元的线性隐藏层,每一层线性隐藏层之后分别应用非线性激活函数(例如Swish)和连接批标准化层,输出层为1维。在具体实施方式中,尾部风险网络可以分为两种情况:1、针对具有唯一的初始状态信息的金融衍生品的样本来训练模型的情况;此时s0为一个固定的向量,因此ω(s0;ζ)为一个不依赖于状态信息的参数ζ,即ω(s0;ζ)=ζ。2、针对多个具有不同初始状态信息的金融衍生品的样本来训练模型的情况:此时ω(s0;ζ)是一个以初始状态信息s0为输入的函数,可以采用前述的神经网络结构的尾部风险网络。In one embodiment, when the policy network is configured to output two parameter variables of the probability distribution obeyed by the adjustment action, the tail risk network can be represented as ω(s 0 ; ζ), where s 0 is the initial state information corresponding to the financial derivative, and ζ is a network parameter in the tail risk network. The tail risk network can adopt a neural network structure, with the network input being s 0 and the output being the risk information. Taking a fully connected neural network as an example (other network types, such as residual neural networks, can also be used), its structure can be configured as follows: the input layer dimension is the same as the dimension of the state information s t , and it contains three linear hidden layers with 64 neurons. Each linear hidden layer is followed by a nonlinear activation function (such as Swish) and a batch normalization layer, and the output layer is 1-dimensional. In specific embodiments, the tail risk network can be divided into two cases: 1. The model is trained on samples of financial derivatives with unique initial state information. In this case, s 0 is a fixed vector, so ω(s 0 ; ζ) is a parameter ζ that does not depend on the state information, that is, ω(s 0 ; ζ) = ζ. 2. The case of training the model for samples of multiple financial derivatives with different initial state information: In this case, ω(s 0 ;ζ) is a function with the initial state information s 0 as input, and the tail risk network of the aforementioned neural network structure can be used.
在另一实施方式中,在策略网络设置为输出动作值的情况下,可以搭建一个尾部风险网络和一个目标尾部风险网络,分别表示为ω(s0;ζ1)和ω(s0;ζ2),其中,s0为金融衍生品对应的初始状态信息,ζ1和ζ2分别表示两个尾部风险网络中的网络参数,两个尾部风险网络的结构完全相同,且初始化的网络参数也相同。两个尾部风险网络可以采用神经网络结构,网络的输入为s0,输出为风险信息。以全连接的神经网络为例(可采用其他类型的网络,例如残差神经网络),其结构可以设置为:输入层维数与状态信息st的维数相同,共包含三层具有64个神经元的线性隐藏层,每一层线性隐藏层之后分别应用非线性激活函数(例如Swish)和连接批标准化层,输出层为1维。在具体实施方式中,尾部风险网络可以分为两种情况:1、针对具有唯一的初始状态信息的金融衍生品的样本来训练模型的情况;此时s0为一个固定的向量,因此ω(s0;ζ1)为一个不依赖于状态信息的参数ζ1且ω(s0;ζ2)为一个不依赖于状态信息的参数ζ2,即ω(s0;ζ1)=ζ1且ω(s0;ζ2)=ζ2。2、针对多个具有不同初始状态信息的金融衍生品的样本来训练模型的情况:此时ω(s0;ζ1)是一个以初始状态信息s0为输入的函数,可以采用前述的神经网络结构的尾部风险网络,且ω(s0;ζ2)与ω(s0;ζ1)的结构相同。In another embodiment, when the policy network is set to output action values, a tail risk network and a target tail risk network can be built, represented as ω(s 0 ; ζ 1 ) and ω(s 0 ; ζ 2 ), respectively, where s 0 is the initial state information corresponding to the financial derivative, ζ 1 and ζ 2 represent the network parameters in the two tail risk networks, respectively. The structures of the two tail risk networks are exactly the same, and the initialized network parameters are also the same. The two tail risk networks can adopt a neural network structure, with the input of the network being s 0 and the output being risk information. Taking a fully connected neural network as an example (other types of networks can be used, such as a residual neural network), its structure can be set as follows: the dimension of the input layer is the same as the dimension of the state information s t , and it contains three linear hidden layers with 64 neurons. A nonlinear activation function (such as Swish) and a batch normalization layer are applied after each linear hidden layer, and the output layer is 1-dimensional. In a specific embodiment, the tail risk network can be divided into two cases: 1. The model is trained on samples of financial derivatives with unique initial state information. In this case, s0 is a fixed vector, so ω( s0 ; ζ1 ) is a parameter ζ1 that does not depend on the state information, and ω( s0 ; ζ2 ) is a parameter ζ2 that does not depend on the state information, that is, ω( s0 ; ζ1 ) = ζ1 and ω( s0 ; ζ2 ) = ζ2 . 2. The model is trained on samples of multiple financial derivatives with different initial state information. In this case, ω( s0 ; ζ1 ) is a function that takes the initial state information s0 as input. The tail risk network with the aforementioned neural network structure can be used, and the structure of ω( s0 ; ζ2 ) is the same as that of ω( s0 ; ζ1 ).
在一实施方式中,在策略网络设置为输出调整动作所服从的概率分布的两个参数变量的情况下,值函数网络可以用于基于在目标时刻输入的状态信息输出状态价值。可以用多层前馈神经网络表示值函数网络,由一个输入层,若干个隐藏层和一个输出层构成,且输入层节点数和策略网络一致,每个隐藏层包含若干个节点,而输出层仅有一个节点用于输出一个值函数变量,用于评估输入的状态信息的价值(也即状态价值)。值函数网络可以表示为Vφ(s),φ为值函数网络的网络参数,输入为状态信息st,输出为状态价值;以全连接的神经网络为例(可采用其他类型的网络,例如残差神经网络),其结构可以设置为:输入层维数与状态信息st的维数相同,共包含三层具有64个神经元的线性隐藏层,每一层线性隐藏层之后分别应用非线性激活函数(例如Swish)和连接批标准化层,输出层为1维。In one embodiment, when the policy network is set to output two parameter variables of the probability distribution obeyed by the adjustment action, the value function network can be used to output the state value based on the state information input at the target time. The value function network can be represented by a multi-layer feedforward neural network, consisting of an input layer, several hidden layers and an output layer, and the number of nodes in the input layer is consistent with that of the policy network, each hidden layer contains several nodes, and the output layer has only one node for outputting a value function variable for evaluating the value of the input state information (i.e., the state value). The value function network can be represented as V φ (s), where φ is the network parameter of the value function network, the input is the state information s t , and the output is the state value; taking a fully connected neural network as an example (other types of networks can be used, such as residual neural networks), its structure can be set as follows: the input layer dimension is the same as the dimension of the state information s t , and it contains three linear hidden layers with 64 neurons. After each linear hidden layer, a nonlinear activation function (such as Swish) and a batch normalization layer are applied, and the output layer is 1-dimensional.
在另一实施方式中,在策略网络设置为输出动作值的情况下,值函数网络可以用于基于在目标时刻输入的状态信息和调整动作输出状态动作价值,此时值函数网络也可称为Q函数网络。可以搭建一个Q函数网络和一个目标Q函数网络,干个隐藏层和一个输出层构成,输入层维数等于状态信息的维数加上动作的维数。每个隐藏层包含若干个节点,而输出层仅有一个节点用于输出一个Q函数变量,用于评估输入的一对状态信息和调整动作的价值(也即状态动作价值)。以全连接的神经网络为例(可采用其他类型的网络,例如残差神经网络),Q函数网络的结构可以设置为:输入层维数等于状态信息st的维数加上动作at的维数,共包含三层具有64个神经元的线性隐藏层,每一层线性隐藏层之后分别应用非线性激活函数(例如Swish)和连接批标准化层,输出层为1维。In another embodiment, when the policy network is set to output action values, the value function network can be used to output state action values based on the state information and adjustment actions input at the target time. In this case, the value function network can also be called a Q function network. A Q function network and a target Q function network can be built. The Q-function network consists of three hidden layers and one output layer, where the input layer dimension is equal to the dimension of the state information plus the dimension of the action. Each hidden layer contains several nodes, while the output layer has only one node that outputs a Q-function variable, which is used to evaluate the value of the input pair of state information and the adjustment action (i.e., the state-action value). Taking a fully connected neural network as an example (other types of networks, such as residual neural networks, can be used), the structure of the Q-function network can be set as follows: the input layer dimension is equal to the dimension of the state information s t plus the dimension of the action a t , and it contains three linear hidden layers with 64 neurons. Each linear hidden layer is followed by a nonlinear activation function (such as Swish) and a batch normalization layer, and the output layer is 1-dimensional.
需要说明的是,上述实施例所提供的策略网络和值函数网络的实现仅为示例性的说明,并非对本申请做任何限定,即并非仅限于前述的网络结构,且网络结构可以随动作所服从的概率分布的指定而相应变化。It should be noted that the implementation of the policy network and value function network provided in the above embodiments is only an exemplary description and does not impose any limitation on this application. That is, it is not limited to the aforementioned network structure, and the network structure can change accordingly with the specification of the probability distribution that the action obeys.
步骤S20,采用所述反馈样本集合对所述策略网络、所述值函数网络和所述尾部风险网络分别进行至少一轮训练。Step S20: Using the feedback sample set to perform at least one round of training on the policy network, the value function network, and the tail risk network.
采用反馈样本集合对策略网络、值函数网络和尾部风险网络分别进行至少一轮训练,一轮训练过程中可以基于反馈样本集合中的样本计算损失函数的损失值,根据损失值来计算各个网络的网络参数的梯度值,根据梯度值来更新各个网络的网络参数。一轮训练过程中,可以对各个网络的网络参数分别进行至少一次的更新。具体的训练方式有很多种,在本实施例中并不做限制。The policy network, value function network, and tail risk network are each trained for at least one round using the feedback sample set. During each round of training, the loss value of the loss function can be calculated based on the samples in the feedback sample set. The gradient values of the network parameters of each network are calculated based on the loss value, and the network parameters of each network are updated based on the gradient values. During each round of training, the network parameters of each network can be updated at least once. There are many specific training methods, which are not limited in this embodiment.
步骤S30,在对各网络分别进行至少一轮的训练并检测到满足预设训练结束条件后,得到训练完成的所述策略网络,以供基于所述策略网络进行金融衍生品对冲组合的调整。Step S30: After each network is trained for at least one round and a preset training end condition is detected to be satisfied, the trained strategy network is obtained for adjusting the financial derivatives hedging portfolio based on the strategy network.
在训练过程中可以检测是否满足预设训练结束条件。具体可以在每轮训练结束后检测,也可以是在对各个网络的网络参数进行一次更新后检测。预设训练结束条件可以根据需要进行设置,在本实施例中并不做限制,例如可以设置为损失函数收敛,又如可以设置为训练轮次达到设定的轮次,又如可以设置为训练时长达到设定的训练时长。During the training process, it is possible to detect whether a preset training end condition is met. Specifically, the detection can be performed after each round of training, or after the network parameters of each network are updated. The preset training end condition can be set as needed and is not limited in this embodiment. For example, it can be set to when the loss function converges, when the number of training rounds reaches a set number, or when the training duration reaches a set duration.
若在一轮训练后仍未满足预设训练结束条件,则可以进行下一轮训练,也即,基于最新更新网络参数后的策略网络、值函数网络和尾部风险网络与环境进行交互获得新一轮训练中的反馈样本集合,再基于反馈样本集合对最新更新网络参数后的各个网络进行新一轮的训练。If the preset training end conditions are still not met after one round of training, the next round of training can be carried out. That is, the policy network, value function network and tail risk network after the latest updated network parameters interact with the environment to obtain a set of feedback samples in the new round of training, and then a new round of training is carried out on each network after the latest updated network parameters based on the feedback sample set.
在检测到满足预设训练结束条件后,可以将最新更新网络参数后的策略网络作为训练完成的策略网络,基于训练完成的策略网络,可以通过输入某个金融衍生品在某个时刻对应的状态信息,得到策略信息,根据该策略信息可以生成在该时刻对该金融衍生品的对冲组合执行的调整动作,根据该调整动作对对冲组合进行调整。由于在策略网络的训练过程中,策略损失值基于奖励信息计算得到,而奖励信息基于金融衍生品对冲组合的对冲误差的尾部风险计算得到,所以能够使得基于训练完成的策略网络做出的调整动作获得更低的对冲误差的尾部风险。并且,由于设置用于计算尾部风险的风险信息由尾部风险网络基于输入的初始状态信息和/或目标时刻对应的状态信息预测得到,并且尾部风险网络基于风险信息预测损失值训练得到,这使得可以对具有不同初始状态信息的同一种金融衍生品计算得到对应的不同的对冲误差的尾部风险,从而使得反馈样本集合中可以包括多个具有不同初始状态信息的同一种金融衍生品对应的子反馈样本集合。因此,基于反馈样本集合训练完成的策略网络针对各个具有不同初始状态信息的同一种金融衍生品均能够适用,从而无需针对具有不同初始状态信息的同一种金融衍生品训练不同的模型,降低了对策略网络进行训练所耗费的时间成本、算力成本和硬件资源成本。此外,已有的考虑尾部风险的方法,都只能用基于参数模型模拟生成的数据来进行训练,而不能直接使用真实的市场观测数据来进行训练,这就不能直接有效地使用真实的市场数据的信息,也不能避免建模的错误和参数估计的错误。本申请实施例提供的方案可以只使用真实的市场数据来进行训练,因此既能够避免建模的错误和参数估计的错误,也能够直接利用市场数据的信息,从而能够有效地降低对冲误差的尾部风险,有助于证券公司进行风险管控。After detecting that a preset training termination condition has been met, the policy network with the most recently updated network parameters can be used as the trained policy network. Based on the trained policy network, policy information can be obtained by inputting state information corresponding to a particular financial derivative at a particular moment. Based on this policy information, an adjustment action to be executed on the hedging portfolio of the financial derivative at that moment can be generated, and the hedging portfolio can be adjusted based on this adjustment action. Because during the training of the policy network, the policy loss value is calculated based on reward information, and the reward information is calculated based on the tail risk of the hedging error of the financial derivative hedging portfolio, adjustment actions made based on the trained policy network can achieve a lower tail risk of the hedging error. Furthermore, because the risk information used to calculate the tail risk is predicted by the tail risk network based on the input initial state information and/or state information corresponding to the target moment, and the tail risk network is trained based on the loss value predicted by the risk information, different tail risks of the hedging error can be calculated for the same financial derivative with different initial state information, thereby enabling the feedback sample set to include multiple sub-feedback sample sets corresponding to the same financial derivative with different initial state information. Therefore, the strategy network trained based on the feedback sample set can be applied to all the same financial derivatives with different initial state information, so there is no need to train different models for the same financial derivatives with different initial state information, which reduces the time cost, computing power cost and hardware resource cost of training the strategy network. In addition, the existing methods for considering tail risk can only use data generated by simulation based on parameter models for training, and cannot directly use real market observation data for training. This makes it impossible to directly and effectively use the information of real market data, and also cannot avoid modeling errors and parameter estimation errors. The solution provided in the embodiment of the present application can only use real market data for training, so it can avoid modeling errors and parameter estimation errors, and can also directly use market data information, thereby effectively reducing the tail risk of hedging errors, which is helpful for securities companies to carry out risk management.
基于上述实施例,提出本申请数据处理方法另一实施例,在本实施例中,参见图3,所述步骤S10包括:Based on the above embodiment, another embodiment of the data processing method of the present application is proposed. In this embodiment, referring to FIG3 , step S10 includes:
步骤S101,获取预设的环境数据集中一个目标金融衍生品对应的轨迹数据,所述轨迹数据包括所述目标金融衍生品在合约时段内的价格轨迹、交付金额轨迹、所述合约时段内的无风险利率轨迹、所述合约时段内各个资产的价格轨迹和各个资产的现金流轨迹。Step S101, obtaining trajectory data corresponding to a target financial derivative in a preset environmental data set, wherein the trajectory data includes the price trajectory, delivery amount trajectory, risk-free interest rate trajectory, price trajectory of each asset in the contract period, and cash flow trajectory of each asset of the target financial derivative during the contract period.
预先可以设置环境数据集,环境数据集中可以包括多个给定初始状态信息的金融衍生品分别对应的轨迹数据,各个金融衍生品的初始状态信息可以不同,以下以其中一个金融衍生品为例进行说明,并称为目标金融衍生品以示区分。An environmental data set can be set in advance. The environmental data set can include trajectory data corresponding to multiple financial derivatives with given initial state information. The initial state information of each financial derivative can be different. The following takes one of the financial derivatives as an example and calls it the target financial derivative for distinction.
目标金融衍生品对应的轨迹数据可以包括目标金融衍生品在合约时段内的价格轨迹、交付金额轨迹、合约时段内的无风险利率轨迹、合约时段内各个资产的价格轨迹和各个资产的现金流轨迹。各个资产的现金流是指各个资产产生的现金流,例如股票资产产生的现金流包括股票的股息和现金分红,债券资产产生的现金流包括债券的票息。在一些实施方式中,轨迹数据还可以包括目标金融衍生品的标的资产在合约时段内的波动率轨迹,目标金融衍生品的价格对标的资产价格及其他变量变动的敏感度信息(可以包括金融衍生品的Delta、Gamma、Vega和Theta)以及其它对金融衍生品的对冲组合的对冲误差、金融衍生品的价格、对冲资产的价格或现金流或标的资产的价格或现金流产生影响或具有相关性的信息变量,具体包括哪些信息变量,可以根据需要进行设置,在此并不做限制。目标金融衍生品在t时刻的交付金额Ut是指该金融衍生品合约里指定的该金融衍生品卖方在t时刻需付给买方的现金价值(可以为负值)。需要说明的是,目标金融衍生品的标的资产可能有一个或多个,由金融衍生品的合约规定。合约时段内的各个资产包括目标金融衍生品的对冲组合中的对冲资产和目标金融衍生品的标的资产。需要说明的是,对冲资产可能有一个或多个,根据金融衍生品及其标的资产来选取,对冲资产可以和标的资产相同。The trajectory data corresponding to the target financial derivative may include the target financial derivative's price trajectory, delivery amount trajectory, risk-free interest rate trajectory, price trajectory of each asset during the contract period, and cash flow trajectory of each asset. The cash flow of each asset refers to the cash flow generated by each asset. For example, the cash flow generated by stock assets includes stock dividends and cash distributions, and the cash flow generated by bond assets includes bond coupons. In some embodiments, the trajectory data may also include the volatility trajectory of the target financial derivative's underlying asset during the contract period, information on the sensitivity of the target financial derivative's price to changes in the underlying asset price and other variables (which may include the financial derivative's Delta, Gamma, Vega, and Theta), and other information variables that affect or are correlated with the hedging error of the financial derivative's hedging portfolio, the price of the financial derivative, the price or cash flow of the hedged asset, or the price or cash flow of the underlying asset. The specific information variables included can be set as needed and are not limited here. The delivery amount Ut of the target financial derivative at time t refers to the cash value (which may be a negative value) that the seller of the financial derivative is required to pay to the buyer at time t, as specified in the financial derivative contract. It should be noted that the target financial derivative may have one or more underlying assets, as specified in the financial derivative contract. The assets within the contract period include the hedging assets in the target financial derivative's hedging portfolio and the target financial derivative's underlying assets. It should be noted that there may be one or more hedging assets, selected based on the financial derivative and its underlying assets. The hedging assets can be the same as the underlying assets.
环境数据集的获取方式在本实施例中并不做限制,例如可以通过数据模型模拟得到,又如,可以通过基于预设金融信息数据接口获取真实的市场信息数据得到,在本实施例中并不做限制。The method of obtaining the environmental data set is not limited in this embodiment. For example, it can be obtained through data model simulation, or by obtaining real market information data based on a preset financial information data interface. This is not limited in this embodiment.
在一实施方式中,通过数据模型模拟得到目标金融衍生品对应的轨迹数据具体可以包括:In one embodiment, obtaining trajectory data corresponding to a target financial derivative through data model simulation may specifically include:
指定对冲资产价格和现金流的模型、标的资产价格和现金流的模型和无风险利率的模型。例如可以假定标的资产服从几何布朗运动模型:
Specify the model for the hedge asset price and cash flow, the model for the underlying asset price and cash flow, and the model for the risk-free rate. For example, you can assume that the underlying asset follows a geometric Brownian motion model:
其中μ,σ为给定模型参数。也可以假设标的资产服从GARCH模型:
Where μ and σ are given model parameters. We can also assume that the underlying asset follows the GARCH model:
其中λ,a,b,c,d为给定的模型参数,zt为独立同分布的标准正态分布的随机变量,σt为随机波动率。标的资产的现金流Dt的模型根据标的资产的具体定义来确定。例如,当标的资产为股票时,其现金流Dt包括股息和红利,可以通过历史数据建立Dt的模型。当标的资产为债券时,其现金流包括债券票息,Dt的模型依赖于债券合约的约定以及市场利率等市场信息变量。类似地,可以假定对冲资产的价格Gt及其现金流It服从的模型。需要说明的是,对冲资产Gt和标的资产St可以相同,在没有现金流时,Dt或It为零。可以假设无风险利率模型为瞬时即期CIR模型。Where λ, a, b, c, and d are given model parameters, zt is an independent and identically distributed standard normal random variable, and σt is the random volatility. The model for the underlying asset's cash flow Dt depends on the specific definition of the underlying asset. For example, when the underlying asset is a stock, its cash flow Dt includes dividends and bonuses, and a model for Dt can be established using historical data. When the underlying asset is a bond, its cash flow includes bond coupons, and the model for Dt depends on the terms of the bond contract and market information variables such as market interest rates. Similarly, the price of the hedging asset Gt and its cash flow It can be assumed to follow a model. It should be noted that the hedging asset Gt and the underlying asset St can be identical; in the absence of cash flows, Dt or It is zero. The risk-free interest rate model can be assumed to be the instantaneous spot CIR model.
假设S0,σ0,G0服从一定的初始分布。然后,根据前述模型和给定的离散时间间隔Δt,可以模拟生成预设数量的标的资产价格、标的资产现金流、标的资产波动率、对冲资产价格和对冲资产现金流的轨迹其中T为金融衍生品的合约到期时刻,可以根据需求指定T为固定值或多个不同值。如果模型为几何布朗运动,则所有波动率轨迹中任意时刻根据金融衍生品的合约定义,按照指定的标的资产价格和现金流的模型,对于每一条标的资产价格、标的资产现金流、标的资产波动率、对冲资产价格和对冲资产现金流的轨迹在每个时刻t,基于(St,Dt,σt),根据该金融衍生品的合约定义计算时刻t的交付金额的值Ut,并计算得到时刻t的排除Ut之后金融衍生品的价格Zt,以及计算得到金融衍生品价格在时刻t的敏感度信息例如,如果金融衍生品为到期时刻为T的行权价格为K的欧式看跌期权,则Ut=0,t=0,…,T-1,UT=max(K-ST,0),ZT=0。保存所有标的资产价格、标的资产现金流、标的资产波动率、对冲资产价格、对冲资产现金流、金融衍生品的价格、金融衍生品的交付金额、无风险利率和敏感度信息的轨迹需要说明的是,当对冲资产和标的资产相同时,轨迹中可以省略Gt和It。Assume that S 0 ,σ 0 ,G 0 obey a certain initial distribution. Then, based on the above model and a given discrete time interval Δt, we can simulate and generate a preset number of underlying asset prices, underlying asset cash flows, underlying asset volatility, hedge asset prices, and hedge asset cash flow trajectories. Where T is the expiration time of the financial derivative contract. T can be specified as a fixed value or multiple different values according to the needs. If the model is a geometric Brownian motion, then at any time in all volatility trajectories According to the contract definition of financial derivatives, according to the specified underlying asset price and cash flow model, for each trajectory of the underlying asset price, underlying asset cash flow, underlying asset volatility, hedge asset price and hedge asset cash flow At each time t, based on (S t , D t , σ t ), calculate the value of the delivery amount U t at time t according to the contract definition of the financial derivative, and calculate the price Z t of the financial derivative after excluding U t at time t, and calculate the sensitivity information of the financial derivative price at time t For example, if the financial derivative is a European put option with an exercise price of K and an expiration time of T, then U t = 0, t = 0, ..., T-1, U T = max(KS T , 0), and Z T = 0. Save the traces of all underlying asset prices, underlying asset cash flows, underlying asset volatility, hedge asset prices, hedge asset cash flows, financial derivative prices, financial derivative delivery amounts, risk-free rates, and sensitivity information. It should be noted that when the hedge asset and the underlying asset are the same, G t and I t can be omitted in the trajectory.
步骤S102,采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与所述目标金融衍生品对应的轨迹数据所限定的环境进行至少一次交互,得到所述目标金融衍生品对应的子反馈样本集合,所述子反馈样本集合中包括多条样本,所述样本中包括用于计算损失值的各项样本数据。Step S102: The policy network to be trained, the value function network to be trained, and the tail risk network to be trained are used to interact at least once with the environment defined by the trajectory data corresponding to the target financial derivative to obtain a sub-feedback sample set corresponding to the target financial derivative, wherein the sub-feedback sample set includes multiple samples, and the samples include various sample data used to calculate the loss value.
采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与目标金融衍生品所限定的环境进行交互的过程可以参照相关技术中智能体与环境进行交互的过程,在本实施例中并不做限制。The process of using the strategy network to be trained, the value function network to be trained, and the tail risk network to be trained to interact with the environment defined by the target financial derivatives can refer to the process of interaction between the intelligent agent and the environment in the relevant technology, and is not limited in this embodiment.
步骤S103,将多个具有不同初始状态信息的金融衍生品对应的所述子反馈样本集合生成所述反馈样本集合。Step S103 : generating the feedback sample set by combining the sub-feedback sample sets corresponding to a plurality of financial derivatives having different initial state information.
所收集的反馈样本集合中的样本数量可以预先根据需要设置,在本实施例中并不做限制。The number of samples in the collected feedback sample set can be pre-set as needed and is not limited in this embodiment.
在一实施方式中,在策略网络设置为输出调整动作所服从的概率分布的两个参数变量,值函数网络用于基于在目标时刻输入的状态信息输出状态价值的情况下,至少一项所述样本数据基于所述状态价值计算得到,参见图4,所述步骤S102包括:In one embodiment, when the policy network is configured to output two parameter variables of a probability distribution obeyed by the adjustment action, and the value function network is configured to output a state value based on the state information input at the target time, at least one item of the sample data is calculated based on the state value. Referring to FIG. 4 , step S102 includes:
步骤S1021,根据所述目标金融衍生品对应的轨迹数据确定所述目标金融衍生品对应的初始状态信息,所述初始状态信息为所述合约时段的起始时刻对应的状态信息。Step S1021 : determining initial state information corresponding to the target financial derivative according to the trajectory data corresponding to the target financial derivative, wherein the initial state information is state information corresponding to the start time of the contract period.
根据目标金融衍生品对应的轨迹数据可以确定目标金融衍生品对应的初始状态信息。根据状态信息中具体包含的信息项不同,根据轨迹数据确定初始状态信息的具体实施方式不同。The initial state information corresponding to the target financial derivative can be determined based on the trajectory data corresponding to the target financial derivative. The specific implementation method of determining the initial state information based on the trajectory data varies depending on the specific information items included in the state information.
步骤S1022,将所述起始时刻作为目标时刻,将所述目标时刻对应的状态信息输入所述策略网络,得到所述目标时刻的策略信息,并根据所述策略信息生成在所述目标时刻对金融衍生品对冲组合做出的调整动作。Step S1022: Taking the starting time as the target time, inputting the state information corresponding to the target time into the strategy network, obtaining the strategy information of the target time, and generating an adjustment action for the financial derivatives hedging portfolio at the target time based on the strategy information.
步骤S1023,根据所述调整动作和所述目标金融衍生品对应的轨迹数据,计算得到所述目标时刻的下一时刻的状态信息和做出所述调整动作后所述对冲组合在所述目标时刻的下一时刻的对冲误差。Step S1023, based on the adjustment action and the trajectory data corresponding to the target financial derivative, calculate the state information of the next moment after the target moment and the hedging error of the hedging combination after the adjustment action is made at the next moment after the target moment.
步骤S1024,将所述目标时刻的状态信息输入待训练的值函数网络,得到所述目标时刻对应的状态价值。Step S1024: input the state information of the target moment into the value function network to be trained to obtain the state value corresponding to the target moment.
步骤S1025,计算在所述目标时刻做出所述调整动作后的奖励信息,其中,当所述目标时刻的下一时刻为所述合约时段的终止时刻时,所述奖励信息根据所述目标金融衍生品对冲组合在所述目标时刻的下一时刻的对冲误差和所述对冲误差的尾部风险计算得到,其中,所述尾部风险是根据风险信息计算得到的,所述风险信息是通过将所述目标金融衍生品对应的初始状态信息和/或目标时刻对应的状态信息输入待训练的尾部风险网络计算得到的。Step S1025, calculating the reward information after the adjustment action is made at the target moment, wherein, when the next moment after the target moment is the end moment of the contract period, the reward information is calculated based on the hedging error of the target financial derivative hedging combination at the next moment after the target moment and the tail risk of the hedging error, wherein the tail risk is calculated based on risk information, and the risk information is calculated by inputting the initial state information corresponding to the target financial derivative and/or the state information corresponding to the target moment into the tail risk network to be trained.
步骤S1026,将所述目标时刻的下一时刻作为新的所述目标时刻,并返回执行所述将所述目标时刻对应的状态信息输入所述策略网络,得到所述目标时刻的策略信息的步骤,直到所述目标时刻的下一时刻为所述合约时段的终止时刻为止。Step S1026, taking the next moment after the target moment as the new target moment, and returning to the step of inputting the state information corresponding to the target moment into the policy network to obtain the policy information of the target moment, until the next moment after the target moment is the end moment of the contract period.
示例性地,起始时刻为t=0,起始时刻的状态信息为s0,将s0输入策略网络,得到t=0时刻的策略信息π0,根据策略信息π0,生成t=0时刻的调整动作a0,并根据调整动作a0和目标金融衍生品对应的轨迹数据,计算得到t=1时刻的状态信息s1和做出调整动作a0后对冲组合在t=1时刻的对冲误差W1,计算得到t=0时刻的奖励信息r0,将t=0时刻的状态信息s0输入待训练的值函数网络计算得到t=0时刻的状态价值V(s0)。然后将t=1时刻的状态信息s1输入策略网络,得到t=1时刻的策略信息π1,根据策略信息π1生成t=1时刻的调整动作a1,并根据调整动作a1和目标金融衍生品对应的轨迹数据,计算得到t=2时刻的状态信息s2和做出调整动作a1后对冲组合在t=2时刻的对冲误差W2,计算得到t=1时刻的奖励信息r1,将t=1时刻的状态信息s1输入待训练的值函数网络计算得到t=1时刻的状态价值V(s1)。依次类推,直到计算得到t=T-1时刻的状态信息sT-1、调整动作aT-1、尾部风险、奖励信息rT-1、状态价值V(sT-1)和t=T时刻的对冲误差WT。需要说明的是,在T-1时刻的奖励信息rT-1是根据目标金融衍生品对冲组合的对冲误差的尾部风险和对冲误差WT计算得到的,而该尾部风险是通过将s0和/或T-1时刻的状态信息sT-1输入待训练的尾部风险网络,计算得到风险信息ω,然后根据ω计算得到的。其中,T为目标金融衍生品的合约时段的终止时刻。其他时刻的奖励信息的计算方式
Exemplarily, the starting time is t=0, and the state information at the starting time is s 0 . s 0 is input into the policy network to obtain the policy information π 0 at time t=0. Based on the policy information π 0 , the adjustment action a 0 at time t=0 is generated. Based on the adjustment action a 0 and the trajectory data corresponding to the target financial derivative, the state information s 1 at time t=1 and the hedging error W 1 of the hedging combination at time t=1 after making the adjustment action a 0 are calculated. The reward information r 0 at time t=0 is calculated, and the state information s 0 at time t= 0 is input into the value function network to be trained to calculate the state value V(s 0 ) at time t=0. Then, the state information s 1 at time t = 1 is input into the policy network to obtain the policy information π 1 at time t = 1. Based on the policy information π 1 , the adjustment action a 1 at time t = 1 is generated. Based on the trajectory data corresponding to the adjustment action a 1 and the target financial derivative, the state information s 2 at time t = 2 and the hedging error W 2 of the hedge portfolio at time t = 2 after the adjustment action a 1 are calculated. The reward information r 1 at time t = 1 is calculated. The state information s 1 at time t = 1 is input into the value function network to be trained to calculate the state value V(s 1 ) at time t = 1. This process is repeated until the state information s T-1 at time t = T-1 , the adjustment action a T-1 , the tail risk, the reward information r T-1 , the state value V(s T -1 ), and the hedging error W T at time t = T are calculated. It should be noted that the reward information r T-1 at time T -1 is calculated based on the tail risk of the hedging error of the target financial derivative hedging portfolio and the hedging error W T. The tail risk is calculated by inputting s 0 and/or the state information s T-1 at time T-1 into the tail risk network to be trained, and then calculating the risk information ω. Where T is the end time of the contract period of the target financial derivative. The calculation method of reward information at other times is
其中,δ0:=0,τ0=T,T为该金融衍生品的到期时刻,σ0,S0,G0根据该轨迹数据指定,B0=Z0为卖出该欧式期权所至此轨迹的结束时间T。以下给出5种不同的奖励信息rt的定义,训练时可选定其中一种进行奖励变量rt的计算:
Among them, δ 0 :=0, τ 0 =T, T is the expiration time of the financial derivative, σ 0 , S 0 , G 0 are specified according to the trajectory data, B 0 =Z 0 is the selling price of the European option. The end time of this trajectory is T. The following are the definitions of 5 different reward information r t . During training, you can choose one of them to calculate the reward variable r t :
其中,λ1>0和λ2≥0是预先设定的常数。需要说明的是,奖励信息变量rt的定义并不限于上述5种。Wherein, λ 1 >0 and λ 2 ≥0 are preset constants. It should be noted that the definition of the reward information variable r t is not limited to the above five types.
上述奖励变量rt的计算方式,是基于如下的对冲误差的尾部风险:
The calculation method of the reward variable r t is based on the following hedging error tail risk:
其中,ω(s0;ζ)为初始状态信息s0对应的风险信息,表示对冲误差WT的相反数的α分位数。Wt+1为t+1时刻的对冲误差,其值可以根据t时刻的状态信息st、调整动作at和t+1时刻的状态信息st+1,依据以下公式计算获得:δt+1=at,t=0,1,…,T-1,
Where ω(s 0 ;ζ) is the risk information corresponding to the initial state information s 0 , and represents the α quantile of the opposite number of the hedging error W T. W t+1 is the hedging error at time t+1, and its value can be calculated based on the state information s t at time t , the adjustment action a t , and the state information s t+1 at time t+1 according to the following formula: δ t+1 = a t , t = 0, 1, …, T-1,
其中,n是对冲资产的个数,Gt+1=(Gt+1,1,...,Gt+1,n),Gt+1,i为t+1时刻的第i个对冲资产的价格,Zt+1为t+1时刻的除去Ut+1之后的金融衍生品的价格,Ut+1是t+1时刻的交付金额,三者均可以从价格轨迹中得到,δt+1,i是δt+1的第i个分量,即第i个对冲资产在t+1时刻调整前的数量,而Bt+1可以依据以下方法计算得到:
Where n is the number of hedge assets, G t+1 = (G t+1,1 ,...,G t+1,n ), G t+1,i is the price of the i-th hedge asset at time t+1, Z t+1 is the price of the financial derivative at time t+1 after excluding U t+1 , and U t+1 is the delivery amount at time t+1. All three can be obtained from the price trajectory. δ t+1,i is the i-th component of δ t+1 , that is, the amount of the i-th hedge asset before adjustment at time t+1. B t+1 can be calculated as follows:
其中,at,i、δt,i和Gt,i分别是at、δt和Gt的第i个分量,Ct是t时刻的交易费用,It+1,i是第i个对冲资产在t+1时刻产生的现金流,从现金流轨迹中得到。相比于相关技术中假设的市场是无摩擦的,没有交易费用,本实施例中,通过在计算过程中使用了交易费用,更符合实际市场情况,从而使得训练完成的策略网络更适用于实际应用场景。Where a t,i , δ t,i , and G t,i are the i-th components of a t , δ t , and G t , respectively. C t is the transaction fee at time t. I t+1,i is the cash flow generated by the i-th hedge asset at time t+1, obtained from the cash flow trajectory. Compared to the related art's assumption of a frictionless market with no transaction fees, this embodiment incorporates transaction fees in the calculation process, which is more consistent with actual market conditions and makes the trained policy network more suitable for real-world applications.
在其他实施方式中,还可以采用其他的对冲误差的尾部风险,例如,取定常数0<α1<1,可以采用如下的对冲误差的尾部风险:
In other embodiments, other hedging error tail risks may be used. For example, assuming a constant 0<α 1 <1, the following hedging error tail risk may be used:
步骤S1027,基于所述合约时段内每个时刻对应的状态信息、调整动作、奖励信息和状态价值,生成每个时刻对应的第一样本,所述第一样本中包括用于计算所述策略网络对应的策略损失值的样本数据和用于计算所述值函数网络对应的评估损失值的样本数据。Step S1027: Based on the state information, adjustment action, reward information and state value corresponding to each moment within the contract period, generate a first sample corresponding to each moment, wherein the first sample includes sample data for calculating the strategy loss value corresponding to the strategy network and sample data for calculating the evaluation loss value corresponding to the value function network.
在本实施方式中,策略网络的策略损失函数可以根据具体需要选取,因此,对于策略损失值的计算方式也并不做限制,第一样本中用于计算策略损失值的样本数据具体包括哪些数据在此并不做限制。在本实施方式中,值函数网络的评估损失函数可以根据具体需要选取,因此,对于评估损失值的计算方式也并不做限制,第一样本中用于计算评估损失值的样本数据具体包含哪些数据在此并不做限制。In this embodiment, the policy loss function of the policy network can be selected according to specific needs. Therefore, there is no restriction on the calculation method of the policy loss value, and there is no restriction on the specific data included in the sample data used to calculate the policy loss value in the first sample. In this embodiment, the evaluation loss function of the value function network can be selected according to specific needs. Therefore, there is no restriction on the calculation method of the evaluation loss value, and there is no restriction on the specific data included in the sample data used to calculate the evaluation loss value in the first sample.
在一实施方式中,基于目标金融衍生品的一条轨迹数据生成的各个第一样本可表示为:也即,可以生成T个第一样本。其中, 为优势函数变量,为积累奖励变量,其中γ∈[0,1]为折现率,可以预先根据需要设置;λgae为广义优势函数中的参数,可以预先根据需要设置。In one embodiment, each first sample generated based on a piece of trajectory data of a target financial derivative can be expressed as: That is, T first samples can be generated. is the advantage function variable, is the accumulation reward variable, where γ∈[0,1] is the discount rate, which can be set in advance according to needs; λ gae is the parameter in the generalized advantage function, which can be set in advance according to needs.
步骤S1028,基于所述目标金融衍生品的初始状态信息和所述合约时段的终止时刻对应的对冲误差,生成第二样本,所述第二样本中包括用于计算所述尾部风险网络对应的风险信息预测损失值的样本数据。Step S1028: Generate a second sample based on the initial state information of the target financial derivative and the hedging error corresponding to the termination time of the contract period, wherein the second sample includes sample data for calculating the risk information predicted loss value corresponding to the tail risk network.
在本实施方式中,根据目标金融衍生品的初始状态信息和合约时段的终止时刻对应的对冲误差生成第二样本的方式并不做限制,可以根据所选取的尾部风险网络的风险信息预测损失函数不同而采取不同的生成方式。在本实施方式中,尾部风险网络的风险信息预测损失函数可以根据具体需要选取,因此,对于风险信息预测损失值的计算方式也并不做限制。In this embodiment, the method for generating the second sample based on the initial state information of the target financial derivative and the hedging error corresponding to the end time of the contract period is not limited, and different generation methods can be adopted depending on the risk information prediction loss function selected for the tail risk network. In this embodiment, the risk information prediction loss function of the tail risk network can be selected according to specific needs, and therefore, the method for calculating the risk information prediction loss value is not limited.
在一实施方式中,当步骤S1025中风险信息是通过将目标金融衍生品对应的初始状态信息输入待训练的尾部风险网络计算得到的情况下,第二样本中可以包括目标金融衍生品的初始状态信息和合约时段的终止时刻对应的对冲误差,在基于第二样本对尾部风险网络进行训练时,可以基于第二样本来计算尾部风险网络的风险信息预测损失值,进而根据风险信息预测损失值来计算尾部风险网络的网络参数的梯度值,根据梯度值更新该尾部风险网络的网络参数。In one embodiment, when the risk information in step S1025 is calculated by inputting the initial state information corresponding to the target financial derivative into the tail risk network to be trained, the second sample may include the initial state information of the target financial derivative and the hedging error corresponding to the end time of the contract period. When the tail risk network is trained based on the second sample, the risk information predicted loss value of the tail risk network can be calculated based on the second sample, and then the gradient value of the network parameter of the tail risk network is calculated according to the risk information predicted loss value, and the network parameters of the tail risk network are updated according to the gradient value.
在另一实施方式中,当步骤S1025中风险信息是通过将目标金融衍生品对应的初始状态信息和在目标时刻对应的状态信息分别输入待训练的尾部风险网络计算得到的情况下,第二样本中可以包括目标金融衍生品的初始状态信息、目标时刻对应的状态信息和合约时段的终止时刻对应的对冲误差,在基于第二样本对尾部风险网络进行训练时,可以基于第二样本来计算尾部风险网络的风险信息预测损失值,进而根据风险信息预测损失值来计算尾部风险网络的网络参数的梯度值,根据梯度值更新该尾部风险网络的网络参数。In another embodiment, when the risk information in step S1025 is obtained by respectively inputting the initial state information corresponding to the target financial derivative and the state information corresponding at the target time into the tail risk network to be trained, the second sample may include the initial state information of the target financial derivative, the state information corresponding to the target time and the hedging error corresponding to the end time of the contract period. When the tail risk network is trained based on the second sample, the risk information predicted loss value of the tail risk network can be calculated based on the second sample, and then the gradient value of the network parameter of the tail risk network can be calculated according to the risk information predicted loss value, and the network parameters of the tail risk network can be updated according to the gradient value.
在另一实施方式中,当步骤S1025中风险信息是通过将目标金融衍生品在目标时刻对应的状态信息输入待训练的尾部风险网络计算得到的情况下,第二样本中可以包括目标金融衍生品在目标时刻对应的状态信息和合约时段的终止时刻对应的对冲误差,在基于第二样本对尾部风险网络进行训练时,可以基于第二样本来计算尾部风险网络的风险信息预测损失值,进而根据风险信息预测损失值来计算尾部风险网络的网络参数的梯度值,根据梯度值更新该尾部风险网络的网络参数。In another embodiment, when the risk information in step S1025 is calculated by inputting the state information corresponding to the target financial derivative at the target time into the tail risk network to be trained, the second sample may include the state information corresponding to the target financial derivative at the target time and the hedging error corresponding to the termination time of the contract period. When the tail risk network is trained based on the second sample, the risk information predicted loss value of the tail risk network can be calculated based on the second sample, and then the gradient value of the network parameter of the tail risk network is calculated according to the risk information predicted loss value, and the network parameters of the tail risk network are updated according to the gradient value.
步骤S1029,根据所述第一样本和所述第二样本生成所述目标金融衍生品对应的子反馈样本集合。Step S1029: Generate a sub-feedback sample set corresponding to the target financial derivative based on the first sample and the second sample.
可以将生成的第一样本和第二样本组合得到目标金融衍生品对应的子反馈样本集合。The generated first sample and the second sample can be combined to obtain a sub-feedback sample set corresponding to the target financial derivative.
在一实施方式中,在策略网络设置为输出动作值,值函数网络用于基于在目标时刻输入的状态信息和调整动作输出状态动作价值的情况下,可以采用如下方式生成子反馈样本集合:In one embodiment, when the policy network is configured to output action values and the value function network is configured to output state-action values based on the state information input at the target time and the adjustment action, the sub-feedback sample set can be generated in the following manner:
根据目标金融衍生品对应的轨迹数据确定目标金融衍生品对应的初始状态信息,初始状态信息为合约时段的起始时刻对应的状态信息。将起始时刻作为目标时刻,将目标时刻对应的状态信息输入策略网络得到目标时刻的动作值,根据动作值生成在目标时刻对金融衍生品对冲组合做出的调整动作。根据调整动作和目标金融衍生品对应的轨迹数据,计算得到目标时刻的下一时刻的状态信息和做出调整动作后对冲组合在目标时刻的下一时刻的对冲误差;计算得到在目标时刻做出调整动作后的奖励信息,其中,当目标时刻的下一时刻是合约时段的终止时刻时,将目标金融衍生品对应的初始状态信息和/或目标时刻对应的状态信息输入尾部风险网络,得到风险信息,根据风险信息计算得到目标金融衍生品对冲组合在目标时刻的下一时刻的对冲误差的尾部风险,然后根据目标金融衍生品对冲组合在目标时刻的下一时刻的对冲误差及其尾部风险计算得到奖励信息;将目标时刻的状态信息和调整动作输入待训练的Q函数网络得到目标时刻对应的状态动作价值;将目标时刻的下一时刻作为新的目标时刻,并返回执行将所述目标时刻对应的状态信息输入所述策略网络,得到所述目标时刻的策略信息的步骤,直到目标时刻的下一时刻为合约时段的终止时刻为止;基于合约时段内每个时刻对应的状态信息、调整动作、奖励信息和下一时刻的状态信息,生成每个时刻对应的第一样本,所述第一样本中包括用于计算所述策略网络对应的策略损失值的样本数据和用于计算Q函数网络对应的评估损失值的样本数据;基于目标金融衍生品的初始状态信息和/或目标时刻对应的状态信息以及合约时段的终止时刻对应的对冲误差,生成第二样本;第二样本中包括用于计算尾部风险网络ω(s;ζ1)_对应的风险信息预测损失值的样本数据;根据第一样本和第二样本生成目标金融衍生品对应的子反馈样本集合。Determine the initial state information of the target financial derivative based on the trajectory data corresponding to the target financial derivative. The initial state information is the state information corresponding to the start time of the contract period. The start time is used as the target time, and the state information corresponding to the target time is input into the strategy network. Obtain the action value at the target moment, and generate the adjustment action for the financial derivatives hedging portfolio at the target moment based on the action value. Based on the trajectory data corresponding to the adjustment action and the target financial derivative, calculate the state information at the next moment of the target moment and the hedging error of the hedging portfolio at the next moment of the target moment after the adjustment action is made; calculate the reward information after the adjustment action is made at the target moment, wherein, when the next moment of the target moment is the end moment of the contract period, input the initial state information corresponding to the target financial derivative and/or the state information corresponding to the target moment into the tail risk network to obtain risk information, calculate the tail risk of the hedging error of the target financial derivatives hedging portfolio at the next moment of the target moment based on the risk information, and then calculate the reward information based on the hedging error and tail risk of the target financial derivatives hedging portfolio at the next moment of the target moment; input the state information and adjustment action at the target moment into the Q function network to be trained. Obtain the state-action value corresponding to the target moment; take the next moment after the target moment as the new target moment, and return to the step of inputting the state information corresponding to the target moment into the policy network to obtain the policy information of the target moment, until the next moment after the target moment is the end moment of the contract period; based on the state information, adjustment action, reward information and state information of the next moment corresponding to each moment in the contract period, generate a first sample corresponding to each moment, the first sample includes the information for calculating the policy network The sample data of the corresponding policy loss value and the network used to calculate the Q function Sample data of the corresponding assessed loss value; generating a second sample based on the initial state information of the target financial derivative and/or the state information corresponding to the target moment and the hedging error corresponding to the termination moment of the contract period; the second sample includes sample data for calculating the risk information predicted loss value corresponding to the tail risk network ω(s;ζ 1 )_; generating a sub-feedback sample set corresponding to the target financial derivative based on the first sample and the second sample.
示例性地,起始时刻为t=0,起始时刻的状态信息为s0,将s0输入策略网络得到t=0时刻的动作值随机抽样一个噪声项ε0(例如,从正态分布随机抽样,其中β0为预设参数),对a0添加噪声项,得到调整动作然后根据调整动作a0和目标金融衍生品对应的轨迹数据,计算得到t=1时刻的状态信息s1和做出调整动作a0后对冲组合在t=1时刻的对冲误差W1,计算得到t=0时刻的奖励信息r0,将t=0时刻的状态信息s0和调整动作a0输入待训练的Q函数网络计算得到t=0时刻的状态动作价值然后将t=1时刻的状态信息s1输入策略网络得到t=1时刻的动作值随机抽样一个噪声项ε1(例如,从正态分布随机抽样),对a1添加噪声项,得到调整动作然后根据调整动作a1和目标金融衍生品对应的轨迹数据,计算得到t=2时刻的状态信息s2和做出调整动作a1后对冲组合在t=2时刻的对冲误差W2,计算得到t=1时刻的奖励信息r1,将t=1时刻的状态信息s1和调整动作a1输入待训练的Q函数网络计算计算得到t=1时刻的状态动作价值依次类推,直到计算得到t=T-1时刻的状态信息sT-1、调整动作aT-1、尾部风险、奖励信息rT-1、状态动作价值和t=T时刻的对冲误差WT。需要说明的是,在T-1时刻的奖励信息rT-1是根据目标金融衍生品的对冲组合在t=T时刻的对冲误差WT和该对冲误差的尾部风险计算得到的,而该尾部风险是通过将s0和/或T-1时刻的状态信息sT-1分别输入待训练的尾部风险网络ω(s;ζ1),计算得到风险信息,然后根据风险信息计算得到的。其中,T为目标金融衍生品的合约时段的终止时刻。其他时刻的奖励信息的计算方式在此并不做限制,例如可以根据对冲误差计算,或者设置为0,或者根据尾部风险和对冲误差计算。For example, the starting time is t=0, the state information at the starting time is s 0 , and s 0 is input into the strategy network Get the action value at time t=0 Randomly sample a noise term ε 0 (e.g., from a normal distribution Random sampling, where β 0 is a preset parameter), adds a noise term to a 0 , and obtains the adjustment action Then, based on the trajectory data corresponding to the adjustment action a0 and the target financial derivative, the state information s1 at time t=1 and the hedging error W1 of the hedging combination at time t=1 after making the adjustment action a0 are calculated. The reward information r0 at time t=0 is calculated, and the state information s0 at time t= 0 and the adjustment action a0 are input into the Q function network to be trained. Calculate the state action value at time t=0 Then the state information s1 at time t= 1 is input into the strategy network Get the action value at time t=1 Randomly sample a noise term ε 1 (e.g., from a normal distribution Random sampling), add a noise term to a 1 , and get the adjustment action Then, based on the trajectory data corresponding to the adjustment action a 1 and the target financial derivative, the state information s 2 at time t = 2 and the hedging error W 2 of the hedging combination at time t = 2 after the adjustment action a 1 are calculated. The reward information r 1 at time t = 1 is calculated, and the state information s 1 at time t = 1 and the adjustment action a 1 are input into the Q function network to be trained. Calculate the state action value at time t=1 And so on, until we get the state information s T-1 , adjustment action a T-1 , tail risk, reward information r T-1 , state action value at time t=T-1 and the hedging error W T at time t=T. It should be noted that the reward information r T-1 at time T-1 is calculated based on the hedging error W T of the target financial derivative's hedging portfolio at time t= T and the tail risk of this hedging error. This tail risk is calculated by inputting s 0 and/or the state information s T-1 at time T-1 into the tail risk network to be trained, ω(s; ζ 1 ), and then calculating the risk information. Where T is the end time of the contract period of the target financial derivative. The calculation method of reward information at other times is not limited here. For example, it can be calculated based on the hedging error, set to 0, or calculated based on the tail risk and hedging error.
示例性地,可以从环境数据集中获取一目标金融衍生品对应的轨迹数据进行如下操作。首先确定初始状态s0,s0由前述st的表达式中取t=0得到。示例性地,目标金融衍生品为欧式期权时,初始状态s0为:
For example, the trajectory data corresponding to a target financial derivative can be obtained from the environmental data set. The following operations are performed. First, the initial state s 0 is determined. s 0 is obtained by taking t=0 from the above expression of s t . For example, when the target financial derivative is a European option, the initial state s 0 is:
其中,δ0:=0,τ0=T,T为该金融衍生品的到期时刻,σ0,S0,G0根据该轨迹数据指定,B0=Z0为卖出该欧式期权所获得的现金净值,W0=0。当标的资产St和对冲资产Gt相同时,s0中可以省略G0。Where δ 0 :=0, τ 0 =T, where T is the expiration time of the financial derivative, σ 0 , S 0 , and G 0 are specified based on the trajectory data, B 0 =Z 0 is the net cash value obtained from selling the European option, and W 0 = 0. When the underlying asset S t and the hedge asset G t are the same, G 0 can be omitted in S 0 .
然后,在第k轮训练开始时,策略网络、值函数网络和尾部风险网络的参数分别和从t=0时刻开始,将状态信息st输入到策略网络中,获取动作值然后,随机抽样一个噪声项εt(例如,从正态分布随机抽样),添加噪声项得到调整动作然后做出调整动作at,环境从状态变量st转移至后续状态变量st+1,且环境给予反馈,确定奖励变量rt。继续此过程,直至此轨迹的结束时间T。以下给出5种不同的奖励信息rt的定义,训练时可选定其中一种进行奖励变量rt的计算:
Then, at the beginning of the kth round of training, the parameters of the policy network, value function network, and tail risk network are and Starting from time t=0, the state information s t is input into the strategy network to obtain the action value Then, a noise term εt is randomly sampled (e.g., from a normal distribution Random sampling), adding noise terms to get the adjustment action Then, an adjustment action a t is made, and the environment transfers from the state variable s t to the subsequent state variable s t+1 . The environment also provides feedback to determine the reward variable r t . This process continues until the end time T of this trajectory. The following are the definitions of five different reward information r t . During training, you can choose one of them to calculate the reward variable r t :
其中,λ1>0和λ2≥0是预先设定的常数。需要说明的是,奖励信息变量rt的定义并不限于上述5种。Wherein, λ 1 >0 and λ 2 ≥0 are preset constants. It should be noted that the definition of the reward information variable r t is not limited to the above five types.
上述奖励变量rt的计算方式,是基于如下的对冲误差的尾部风险:
The calculation method of the reward variable r t is based on the following hedging error tail risk:
其中,为初始状态信息s0对应的风险信息,表示对冲误差WT的相反数的分位数。Wt+1为t+1时刻的对冲误差,其值可以根据t时刻的状态信息st、调整动作at和t+1时刻的状态信息st+1,依据以下公式计算获得:in, is the risk information corresponding to the initial state information s 0 , representing the quantile of the opposite number of the hedging error W T. W t+1 is the hedging error at time t+1, and its value can be calculated based on the state information s t at time t , the adjustment action a t , and the state information s t+1 at time t+1 according to the following formula:
δt+1=at,t=0,1,…,T-1,
δ t+1 =a t ,t=0,1,…,T-1,
其中,n是对冲资产的个数,Gt+1=(Gt+1,1,...,Gt+1,n),Gt+1,i为t+1时刻的第i个对冲资产的价格,Zt+1为t+1时刻的除去Ut+1之后的金融衍生品的价格,Ut+1是t+1时刻的交付金额,三者均可以从价格轨迹中得到,δt+1,i是δt+1的第i个分量,即第i个对冲资产在t+1时刻调整前的数量,而Bt+1可以依据以下方法计算得到:
Where n is the number of hedge assets, G t+1 = (G t+1,1 ,...,G t+1,n ), G t+1,i is the price of the i-th hedge asset at time t+1, Z t+1 is the price of the financial derivative at time t+1 after excluding U t+1 , and U t+1 is the delivery amount at time t+1. All three can be obtained from the price trajectory. δ t+1,i is the i-th component of δ t+1 , that is, the amount of the i-th hedge asset before adjustment at time t+1. B t+1 can be calculated as follows:
其中,at,i、δt,i和Gt,i分别是at、δt和Gt的第i个分量,Ct是t时刻的交易费用,It+1,i是第i个对冲资产在t+1时刻产生的现金流,从现金流轨迹中得到。相比于相关技术中假设的市场是无摩擦的,没有交易费用,本实施例中,通过在计算过程中使用了交易费用,更符合实际市场情况,从而使得训练完成的策略网络更适用于实际应用场景。Where a t,i , δ t,i , and G t,i are the i-th components of a t , δ t , and G t , respectively. C t is the transaction fee at time t. I t+1,i is the cash flow generated by the i-th hedge asset at time t+1, obtained from the cash flow trajectory. Compared to the related art's assumption of a frictionless market with no transaction fees, this embodiment incorporates transaction fees in the calculation process, which is more consistent with actual market conditions and makes the trained policy network more suitable for real-world applications.
在其他实施方式中,还可以采用其他的对冲误差的尾部风险,例如,取定常数0<α1<1,可以采用如下的对冲误差的尾部风险:
In other embodiments, other hedging error tail risks may be used. For example, assuming a constant 0<α 1 <1, the following hedging error tail risk may be used:
在一实施方式中,基于目标金融衍生品的一条轨迹数据所生成的各个第一样本可以表示为:也即,可以生成T个第一样本。In one embodiment, each first sample generated based on a piece of trajectory data of a target financial derivative can be expressed as: That is, T first samples can be generated.
基于上述各实施例,提出本申请数据处理方法另一实施例,在本实施例中,反馈样本集合中可以包括多条第一样本和第二样本,第一样本中包括用于计算策略网络对应的策略损失值的样本数据和用于计算值函数网络对应的评估损失值的样本数据,第二样本中包括用于计算尾部风险网络对应的风险信息预测损失值的样本数据。参见图5,所述步骤S20包括:Based on the above embodiments, another embodiment of the data processing method of the present application is proposed. In this embodiment, the feedback sample set may include multiple first samples and second samples. The first samples include sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network. The second samples include sample data for calculating the risk information prediction loss value corresponding to the tail risk network. Referring to Figure 5, step S20 includes:
步骤S201,采用所述第一样本计算所述策略网络对应的策略损失值和所述值函数网络对应的评估损失值。Step S201: Calculate the policy loss value corresponding to the policy network and the evaluation loss value corresponding to the value function network using the first sample.
步骤S202,根据所述策略损失值和所述评估损失值计算所述策略网络和所述值函数网络中网络参数对应的梯度值,并根据梯度值更新网络参数。Step S202: Calculate the gradient values corresponding to the network parameters in the policy network and the value function network according to the policy loss value and the evaluation loss value, and update the network parameters according to the gradient values.
步骤S203,采用所述第二样本计算所述尾部风险网络对应的风险信息预测损失值,根据所述风险信息预测损失值计算所述尾部风险网络中网络参数对应的梯度值,并根据梯度值更新网络参数。Step S203: use the second sample to calculate the risk information predicted loss value corresponding to the tail risk network, calculate the gradient value corresponding to the network parameter in the tail risk network according to the risk information predicted loss value, and update the network parameter according to the gradient value.
步骤S204,采用所述反馈样本集合对所述策略网络、所述值函数网络和所述尾部风险网络中的网络参数分别进行至少一轮的更新后,完成对所述策略网络、所述值函数网络和所述尾部风险网络的一轮训练过程。Step S204, after using the feedback sample set to update the network parameters in the policy network, the value function network and the tail risk network for at least one round, a round of training process for the policy network, the value function network and the tail risk network is completed.
在一实施方式中,可以从反馈样本集合的各个第一样本中,抽取部分第一样本组成一个子集合,对于子集合中的每个第一样本,采用该第一样本计算策略网络对应的策略损失值和值函数网络对应的评估损失值,基于该子集合中的各个第一样本对应的策略损失值和评估损失值计算一个平均损失值,根据平均损失值计算策略网络和值函数网络中网络参数对应的梯度值,根据梯度值对策略网络和值函数网络中网络参数进行一次更新。再从反馈样本集合剩余的各个第一样本中,抽取部分第一样本组成一个子集合,基于该子集合对策略网络和值函数网络中网络参数进行一次更新,依次类推,直到反馈样本集合中的第一样本用完。对于反馈样本集合中的每个第二样本,采用该第二样本计算尾部风险网络对应的风险信息预测损失值,基于反馈样本集合中的各个第二样本对应的风险信息预测损失值计算一个平均损失值,根据该平均损失值计算尾部风险网络中网络参数对应的梯度值,根据梯度值对尾部风险网络中网络参数进行一次更新。In one embodiment, a subset of first samples may be extracted from each first sample in the feedback sample set to form a subset. For each first sample in the subset, the policy loss value corresponding to the policy network and the evaluation loss value corresponding to the value function network are calculated using the first sample. An average loss value is calculated based on the policy loss values and evaluation loss values corresponding to each first sample in the subset. Gradient values corresponding to network parameters in the policy network and the value function network are calculated based on the average loss value, and the network parameters in the policy network and the value function network are updated based on the gradient values. A subset of first samples may then be extracted from each remaining first sample in the feedback sample set to form a subset. The network parameters in the policy network and the value function network are updated based on the subset, and so on, until all first samples in the feedback sample set are exhausted. For each second sample in the feedback sample set, the risk information predicted loss value corresponding to the tail risk network is calculated using the second sample. An average loss value is calculated based on the risk information predicted loss values corresponding to each second sample in the feedback sample set. The gradient values corresponding to the network parameters in the tail risk network are calculated based on the average loss value, and the network parameters in the tail risk network are updated based on the gradient values.
在一实施方式中,所述第二样本中包括一个目标金融衍生品对应的初始状态信息和所述目标金融衍生品的对冲组合在所述目标金融衍生品的合约终止时刻的对冲误差;参见图6,所述步骤S203包括:In one embodiment, the second sample includes initial state information corresponding to a target financial derivative and the hedging error of the hedging combination of the target financial derivative at the time of contract termination of the target financial derivative. Referring to FIG. 6 , step S203 includes:
步骤S2031,将所述第二样本中的初始状态信息输入所述尾部风险网络预测得到所述第二样本对应的风险信息。Step S2031: input the initial state information of the second sample into the tail risk network prediction to obtain risk information corresponding to the second sample.
在具体实施方式中,当步骤S1025中风险信息是通过将目标金融衍生品对应的初始状态信息输入待训练的尾部风险网络计算得到的情况下,第二样本中包括目标金融衍生品的初始状态信息和合约时段的终止时刻对应的对冲误差,且本步骤将第二样本中的初始状态信息输入尾部风险网络预测得到第二样本对应的风险信息。In a specific embodiment, when the risk information in step S1025 is calculated by inputting the initial state information corresponding to the target financial derivative into the tail risk network to be trained, the second sample includes the initial state information of the target financial derivative and the hedging error corresponding to the end time of the contract period, and this step inputs the initial state information in the second sample into the tail risk network to predict the risk information corresponding to the second sample.
当步骤S1025中风险信息是通过将目标金融衍生品对应的初始状态信息和在目标时刻对应的状态信息分别输入待训练的尾部风险网络计算得到的情况下,第二样本中包括目标金融衍生品的初始状态信息、目标时刻对应的状态信息和合约时段的终止时刻对应的对冲误差,且本步骤将第二样本中的初始状态信息和目标时刻对应的状态信息分别输入尾部风险网络预测得到第二样本对应的风险信息。When the risk information in step S1025 is calculated by respectively inputting the initial state information corresponding to the target financial derivative and the state information corresponding at the target time into the tail risk network to be trained, the second sample includes the initial state information of the target financial derivative, the state information corresponding to the target time and the hedging error corresponding to the end time of the contract period, and this step inputs the initial state information and the state information corresponding to the target time in the second sample into the tail risk network to predict the risk information corresponding to the second sample.
当步骤S1025中风险信息是通过将目标金融衍生品在目标时刻对应的状态信息输入待训练的尾部风险网络计算得到的情况下,第二样本中包括目标金融衍生品在目标时刻对应的状态信息和合约时段的终止时刻对应的对冲误差,且本步骤将第二样本中的目标时刻对应的状态信息输入尾部风险网络预测得到第二样本对应的风险信息。When the risk information in step S1025 is calculated by inputting the state information corresponding to the target financial derivative at the target time into the tail risk network to be trained, the second sample includes the state information corresponding to the target financial derivative at the target time and the hedging error corresponding to the end time of the contract period, and this step inputs the state information corresponding to the target time in the second sample into the tail risk network to predict the risk information corresponding to the second sample.
步骤S2032,将所述第二样本对应的风险信息和所述第二样本中的对冲误差代入预设的风险信息预测损失函数,得到所述第二样本对应的风险信息预测损失值。Step S2032: Substitute the risk information corresponding to the second sample and the hedging error in the second sample into a preset risk information prediction loss function to obtain a risk information prediction loss value corresponding to the second sample.
步骤S2033,根据所述反馈样本集合中多个所述第二样本对应的风险信息预测损失值,计算所述尾部风险网络中网络参数对应的梯度值,并根据梯度值更新网络参数。Step S2033: predicting loss values based on the risk information corresponding to the plurality of second samples in the feedback sample set, calculating gradient values corresponding to network parameters in the tail risk network, and updating the network parameters according to the gradient values.
在一实施方式中,在策略网络设置为输出调整动作所服从的概率分布的两个参数变量,值函数网络用于基于在目标时刻输入的状态信息输出状态价值的情况下,在第k轮训练开始时,策略网络、值函数网络和尾部风险网络的参数分别为θ=θk-1,φ=φk-1,ζ=ζk-1。首先交互得到反馈样本集合。然后,在每次更新过程的开始,将反馈样本集合中的第一样本进行随机排序,然后依次将m个第一样本作为一个子样本集合。第一个子样本集合可以记为先用第一个子样本集合中的样本确定如下损失函数:
In one embodiment, the policy network is configured to output two parameter variables of the probability distribution obeyed by the adjustment action, and the value function network is configured to output the state value based on the state information input at the target time. At the beginning of the kth round of training, the parameters of the policy network, the value function network, and the tail risk network are θ=θ k-1 , φ=φ k-1 , and ζ=ζ k-1 , respectively. First, a feedback sample set is obtained interactively. Then, at the beginning of each update process, the first sample in the feedback sample set is randomly sorted, and then the m first samples are sequentially used as a subsample set. The first subsample set can be recorded as First, use the samples in the first subsample set to determine the following loss function:
其中c1>0和c2≥0为预设的常数,等式右边第一项为策略损失值,ε>0为预设的常数,将概率比限制在较小范围;第二项为评估损失值;第三项Entropy(πθ(·|si))为概率分布πθ(·|si)的熵,其计算公式为Entropy(πθ(·|si))=-∫πθ(x|si)log(πθ(x|si))dx。在本实施例中,πθ(·|si)是对角高斯分布,其均值变量为标准差变量为 因此πθ(·|si)的熵为需要说明的是,上述实施例所使用的损失函数仅为示例性的说明,并非对本申请做任何限定,即并非仅限于使用前述的损失函数。Where c 1 >0 and c 2 ≥0 are preset constants, the first term on the right side of the equation is the strategy loss value, ε>0 is a preset constant, and the probability ratio is limited to a smaller range; the second term is the evaluation loss value; the third term, Entropy(π θ (·| si )), is the entropy of the probability distribution π θ (·| si ), and its calculation formula is Entropy(π θ (·| si ))=-∫π θ (x|si)log(π θ (x| si ))dx. In this embodiment, π θ (·| si ) is a diagonal Gaussian distribution with a mean variable of The standard deviation variable is Therefore, the entropy of π θ (·|s i ) is It should be noted that the loss function used in the above embodiment is only an illustrative description and does not limit the present application in any way, that is, it is not limited to the use of the aforementioned loss function.
然后根据损失函数,利用梯度下降算法,对策略网络的参数θ和值函数网络的参数φ进行更新:和其中αθ和αφ为训练步长即学习率,和分别表示参数θ和参数φ的梯度。然后,获取反馈样本集合的第二个子样本集合,其中包含反馈样本集合里的第m+1到2m个样本,重复执行前面的步骤,对策略网络的参数θ和值函数网络的参数φ进行更新,依此类推,直到用完反馈样本集合中的所有子样本集合。然后基于反馈样本集合中的全部第二样本计算风险信息预测损失值:
Then, according to the loss function, the gradient descent algorithm is used to update the parameters θ of the policy network and the parameters φ of the value function network: and Among them, α θ and α φ are the training step size or learning rate, and Represent the gradients of parameters θ and φ respectively. Then, obtain the second subsample set of the feedback sample set, which contains samples from m+1 to 2m in the feedback sample set, and repeat the previous steps to update the parameters θ of the policy network and the parameters φ of the value function network, and so on, until all subsample sets in the feedback sample set are used up. Then, based on all the second samples in the feedback sample set Calculate the predicted loss value of risk information:
其中,在一实施例中,在尾部风险采用如下函数
In one embodiment, the following function is used for tail risk:
来度量时,风险信息预测损失函数定义为
When used to measure, the risk information prediction loss function Defined as
在一实施例中,在尾部风险采用如下函数
In one embodiment, the following function is used for tail risk:
来度量时(其中α1为取定的常数,0<α1<1),风险信息预测损失函数定义为
When used to measure (where α 1 is a fixed constant, 0<α 1 <1), the risk information prediction loss function Defined as
然后根据风险信息预测损失值,利用梯度下降算法,对尾部风险网络的参数ζ进行更新:其中αζ为训练步长即学习率,表示参数ζ对应的梯度。这作为一次更新过程。Then, based on the risk information, the loss value is predicted and the gradient descent algorithm is used to update the parameter ζ of the tail risk network: Among them, αζ is the training step size or learning rate, Denotes the gradient corresponding to the parameter ζ. This is considered an update process.
然后将反馈样本集合中的样本进行随机排序后,再进入下一次更新过程。采用反馈样本集合对各个网络的网络参数进行至少一次的更新后,策略网络、值函数网络和尾部风险网络的参数分别被更新为θk=θ,φk=φ,ζk=ζ。然后进入下一轮的训练,也即基于更新后的网络获取新的反馈样本集合,依据新的反馈样本集合再对网络的网络参数进行多次的更新。The samples in the feedback sample set are then randomly sorted before entering the next update process. After updating the network parameters of each network at least once using the feedback sample set, the parameters of the policy network, value function network, and tail risk network are updated to θk = θ, φk = φ, and ζk = ζ, respectively. The next round of training then begins, where a new feedback sample set is obtained based on the updated network, and the network parameters are updated multiple times based on this new feedback sample set.
在具体实施方式中,梯度下降算法所采用的优化器可以根据需要设置,例如可以采用自适应矩估计(Adam)优化器,学习率可以为0.0005。In a specific embodiment, the optimizer used by the gradient descent algorithm can be set as needed. For example, an adaptive moment estimation (Adam) optimizer can be used, and the learning rate can be 0.0005.
在一实施方式中,在策略网络设置为输出动作值,值函数网络用于基于在目标时刻输入的状态信息和调整动作输出状态动作价值的情况下,在第k轮训练开始时,策略网络Q函数网络和尾部风险网络ω(s0;ζ1)的参数分别为和目标策略网络目标Q函数网络和目标尾部风险网络ω(s0;ζ2)的参数分别为和首先交互得到反馈样本集合。在每次更新过程的开始,将反馈样本集合中的第一样本进行随机排序,然后依次将m个第一样本作为一个子样本集合。第一个子样本集合可以记为对于每一个样本点,根据策略网络和Q函数网络计算对应的目标值
其中,γ∈[0,1]为折现率。从而确定如下损失函数:
In one embodiment, when the policy network is configured to output action values and the value function network is configured to output state action values based on the state information input at the target time and the adjustment action, at the beginning of the kth round of training, the policy network Q-Function Network and the parameters of the tail risk network ω(s 0 ;ζ 1 ) are and Target Policy Network Target Q-function network The parameters of the target tail risk network ω(s 0 ;ζ 2 ) are and First, the feedback sample set is obtained interactively. At the beginning of each update process, the first sample in the feedback sample set is randomly sorted, and then the m first samples are taken as a subsample set in turn. The first subsample set can be recorded as For each sample point, according to the policy network and Q-function network Calculate the corresponding target value Where γ∈[0,1] is the discount rate. The following loss function is thus determined:
其中,c1>0为预设的常数,等式右边第一项为策略损失值,第二项为评估损失值。需要说明的是,上述实施例所使用的损失函数仅为示例性的说明,并非对本申请做任何限定,即并非仅限于使用前述的损失函数。Wherein, c 1 >0 is a preset constant, the first term on the right side of the equation is the strategy loss value, and the second term is the evaluation loss value. It should be noted that the loss function used in the above embodiment is only for illustrative purposes and does not limit the present application in any way, i.e., it is not limited to the use of the aforementioned loss function.
根据损失函数的损失值,利用梯度下降算法,对策略网络的参数θ1和Q函数网络的参数φ1进行更新:和其中αθ和αφ为训练步长即学习率,和分别表示参数θ1和参数φ1的梯度。并且,预设一个复制参数0≤ρ≤1,对于目标策略网络的参数θ2和目标Q函数网络的参数φ2依据复制参数ρ进行成比例更新:θ2=(1-ρ)θ2+ρθ1和φ2=(1-ρ)φ2+ρφ1。然后,获取反馈样本集合的第二个子样本集合,其中包含反馈样本集合里的第m+1到2m个样本,重复执行前面的步骤,对策略网络的参数和Q函数网络的参数进行更新,以及对目标策略网络的参数和目标Q函数网络的参数进行成比例更新,依此类推,直到用完反馈样本集合中的所有子样本集合。According to the loss value of the loss function, the policy network is adjusted using the gradient descent algorithm. Parameters θ1 and Q function network The parameter φ 1 is updated: and Among them, α θ and α φ are the training step size or learning rate, and Represent the gradients of parameters θ 1 and φ 1 respectively. In addition, a replication parameter 0≤ρ≤1 is preset. For the target policy network The parameters θ 2 and the target Q function network The parameter φ 2 of is updated proportionally according to the replication parameter ρ: θ 2 = (1-ρ)θ 2 + ρθ 1 and φ 2 = (1-ρ)φ 2 + ρφ 1 . Then, obtain the second subsample set of the feedback sample set, which contains the m+1th to 2mth samples in the feedback sample set, and repeat the previous steps to update the policy network. Parameters and Q-function network Update the parameters of the target strategy network Parameters and target Q function network The parameters of the feedback sample set are updated proportionally, and so on, until all subsample sets in the feedback sample set are used up.
然后基于反馈样本集合中的所有第二样本计算风险信息预测损失值:
Then based on all second samples in the feedback sample set Calculate the predicted loss value of risk information:
其中,在一实施例中,在尾部风险采用如下函数
In one embodiment, the following function is used for tail risk:
来度量时,风险信息预测损失函数定义为
When used to measure, the risk information prediction loss function Defined as
在一实施例中,在尾部风险采用如下函数
In one embodiment, the following function is used for tail risk:
来度量时,风险信息预测损失函数定义为
When used to measure, the risk information prediction loss function Defined as
然后根据风险信息预测损失值,利用梯度下降算法,对尾部风险网络ω(s0;ζ1)的参数ζ1进行更新: 其中αζ为训练步长即学习率,表示参数对应的梯度。同时,对于目标尾部风险网络ω(s0;ζ2)的参数ζ2依据复制参数ρζ进行成比例更新ζ2=(1-ρζ)ζ2+ρζζ1,其中0≤ρζ≤1是预设的复制参数。这作为一次更新过程。Then, the loss value is predicted based on the risk information, and the parameter ζ 1 of the tail risk network ω(s 0 ;ζ 1 ) is updated using the gradient descent algorithm: Among them, αζ is the training step size or learning rate, Denotes the gradient corresponding to the parameter. Meanwhile, the parameter ζ 2 of the target tail risk network ω(s 0 ;ζ 2 ) is updated proportionally to the replication parameter ρ ζ : ζ 2 = (1-ρ ζ )ζ 2 + ρ ζ ζ 1 , where 0 ≤ ρ ζ ≤ 1 is the preset replication parameter. This is considered a single update process.
然后将反馈样本集合中的样本随机排序后,再进入下一次更新过程。采用反馈样本集合对各个网络的网络参数进行至少一次的更新后,两个策略网络、两个值函数网络和两个尾部风险网络的参数分别被更新为 然后进入下一轮的训练,也即基于更新后的网络获取新的反馈样本集合,依据新的反馈样本集合再对网络的网络参数进行多次的更新。Then, the samples in the feedback sample set are randomly sorted and the next update process is entered. After the network parameters of each network are updated at least once using the feedback sample set, the parameters of the two policy networks, the two value function networks, and the two tail risk networks are updated to Then enter the next round of training, that is, obtain a new set of feedback samples based on the updated network, and update the network parameters multiple times based on the new set of feedback samples.
在一实施方式中,基于上述各实施例,提出本申请数据处理方法另一实施例,在本实施例中,所述步骤S101之前,参见图7,还包括:In one embodiment, based on the above embodiments, another embodiment of the data processing method of the present application is proposed. In this embodiment, before step S101, referring to FIG. 7 , the method further includes:
步骤S40,对于预设时段内以预设资产为标的资产的各个具有不同初始状态信息的金融衍生品中的任一金融衍生品,基于预设金融信息数据接口采集所述金融衍生品对应的轨迹数据;Step S40: for any financial derivative with a preset asset as the underlying asset and having different initial state information within a preset time period, collecting trajectory data corresponding to the financial derivative based on a preset financial information data interface;
步骤S50,根据各个具有不同初始状态信息的金融衍生品对应的所述轨迹数据得到所述环境数据集。Step S50 , obtaining the environmental data set according to the trajectory data corresponding to each financial derivative having different initial state information.
在本实施例中,提出一种获取环境数据集的具体实现方式。预设金融信息数据接口是预先根据需要设置的金融信息数据接口,通过该金融信息数据接口可以获取历史的真实的金融衍生品相关的信息,进而能够获得金融衍生品对应的轨迹数据。预设资产和预设时段可以根据需要进行设置,在本实施例中并不做限制。This embodiment proposes a specific implementation method for acquiring an environmental dataset. A preset financial information data interface is a pre-configured, customized financial information data interface. Through this interface, historical, real-world information related to financial derivatives can be obtained, thereby obtaining trajectory data corresponding to these financial derivatives. Preset assets and time periods can be configured as needed and are not limited in this embodiment.
在一实施方式中,可以对于预设资产(例如上证50ETF)和预设时段(例如2015年1月1日至2022年12月31日)内的每一天,通过预设金融信息数据接口获取当天市场上所有以预设资产作为标的资产的预设种类的金融衍生品(例如上证50ETF看跌期权),并对于每只金融衍生品提取从当天t=0(也即该金融衍生品的合约时段的起始时刻),至该金融衍生品的到期时刻t=T的全部价格(交易价格或者最优要价和最优叫价的中间价格)作为该金融衍生品的价格轨迹同时可以记录该金融衍生品的合约中约定的影响该金融衍生品价格的参数(例如行权价格和到期时间等)。对于每一个该预设种类的金融衍生品,可以通过预设金融信息数据接口收集该金融衍生品的合约时段内个标的资产价格标的资产现金流对冲组合中的n个对冲资产价格Gt=(Gt,1,...,Gt,n)和对冲资产现金流It=(It,1,...,It,n)的轨迹并记录该金融衍生品在合约时段内的交付金额轨迹可以指定该金融衍生品的合约时段内各个时刻的上海银行间同业拆放利率(SHIBOR)作为无风险利率轨迹在一实施例中,可以获取该金融衍生品的价格对应的标的资产的隐含波动率轨迹在一实施例中,也可以将波动率轨迹定义为该金融衍生品的标的资产的历史波动率轨迹或预测得到的波动率轨迹。在一实施例中,如果该金融衍生品有衍生品价格的敏感度数据,可以获取该金融衍生品的敏感度数据在一实施例中,可以通过预设金融信息数据接口获取其它对该金融衍生品的对冲组合的对冲误差、该金融衍生品的价格、对冲资产的价格、对冲资产的现金流、标的资产的价格或标的资产的现金流产生影响或具有相关性的信息变量,具体包括哪些信息变量,可以根据需要进行设置,在此并不做限制。直至处理完预设时段的所有日期中的所有该预设种类的金融衍生品。In one embodiment, for each day within a preset asset (e.g., the SSE 50 ETF) and a preset time period (e.g., January 1, 2015, to December 31, 2022), all preset types of financial derivatives (e.g., SSE 50 ETF put options) with the preset asset as the underlying asset in the market on that day can be obtained through a preset financial information data interface, and for each financial derivative, all prices (transaction prices or the middle price between the best ask price and the best bid price) from t=0 on that day (i.e., the start time of the contract period of the financial derivative) to the expiration time t=T of the financial derivative can be extracted as the price trajectory of the financial derivative. At the same time, the parameters that affect the price of the financial derivatives agreed in the contract of the financial derivatives (such as the strike price and expiration time, etc.) can be recorded. For each of the preset types of financial derivatives, the preset financial information data interface can be used to collect the contract period of the financial derivatives. The price of the underlying asset Cash flow of underlying assets The trajectory of the n hedge asset prices G t = (G t,1 ,...,G t,n ) and the hedge asset cash flows It = (It ,1 ,...,It ,n ) in the hedge portfolio And record the delivery amount of the financial derivative during the contract period You can specify the Shanghai Interbank Offered Rate (SHIBOR) at each moment during the contract period of the financial derivative as the risk-free interest rate trajectory In one embodiment, the implied volatility trajectory of the underlying asset corresponding to the price of the financial derivative can be obtained. In one embodiment, the volatility trajectory can also be It is defined as the historical volatility trajectory or predicted volatility trajectory of the underlying asset of the financial derivative. In one embodiment, if the financial derivative has sensitivity data of the derivative price, the sensitivity data of the financial derivative can be obtained. In one embodiment, other information variables that influence or are correlated with the hedging error of the financial derivative hedging portfolio, the price of the financial derivative, the price of the hedged asset, the cash flow of the hedged asset, and the price or cash flow of the underlying asset can be obtained through a preset financial information data interface. The specific information variables can be set as needed and are not limited here. This process continues until all financial derivatives of the preset type on all dates within a preset period are processed.
在本实施例中,通过基于预设金融信息数据接口获取的真实的数据信息来构建环境数据集,不需要对标的资产价格、标的资产现金流、对冲资产价格、对冲资产现金流、无风险利率以及其他信息变量建立参数模型,可以有效避免参数模型的模型误差和模型参数估计误差。In this embodiment, by constructing an environmental data set based on real data information obtained based on a preset financial information data interface, there is no need to establish a parameter model for the underlying asset price, underlying asset cash flow, hedge asset price, hedge asset cash flow, risk-free interest rate and other information variables, which can effectively avoid model errors and model parameter estimation errors of the parameter model.
此外,本申请实施例还提出一种数据处理装置,所述装置包括:In addition, an embodiment of the present application further provides a data processing device, the device comprising:
交互模块,用于采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与环境进行多次交互得到反馈样本集合,其中,所述策略网络用于基于目标时刻的状态信息输出策略信息,所述策略信息用于生成在所述目标时刻对金融衍生品对冲组合做出的调整动作,所述反馈样本集合包括多条样本,所述样本中包括用于计算损失值的各项样本数据,至少一项样本数据基于奖励信息计算得到,所述奖励信息基于金融衍生品对冲组合的对冲误差和所述对冲误差的尾部风险计算得到,所述尾部风险根据风险信息计算得到,所述风险信息由所述尾部风险网络基于金融衍生品对应的初始状态信息和/或目标时刻对应的状态信息预测得到;An interaction module, configured to employ a policy network to be trained, a value function network to be trained, and a tail risk network to be trained to interact multiple times with an environment to obtain a feedback sample set, wherein the policy network is configured to output policy information based on state information at a target moment, the policy information being configured to generate an adjustment action for the financial derivatives hedging portfolio at the target moment, the feedback sample set comprising a plurality of samples, each of which includes various sample data items used to calculate a loss value, at least one of which is calculated based on reward information, the reward information being calculated based on a hedging error of the financial derivatives hedging portfolio and a tail risk of the hedging error, the tail risk being calculated based on risk information, and the risk information being predicted by the tail risk network based on initial state information corresponding to the financial derivative and/or state information corresponding to the target moment;
训练模块,用于采用所述反馈样本集合对所述策略网络、所述值函数网络和所述尾部风险网络分别进行至少一轮训练;在对各网络分别进行至少一轮的训练并检测到满足预设训练结束条件后,得到训练完成的所述策略网络,以供基于所述策略网络进行金融衍生品对冲组合的调整。A training module is used to use the feedback sample set to perform at least one round of training on the strategy network, the value function network and the tail risk network respectively; after performing at least one round of training on each network and detecting that the preset training end conditions are met, the trained strategy network is obtained for adjustment of the financial derivatives hedging portfolio based on the strategy network.
在一实施方式中,所述交互模块还用于:In one embodiment, the interaction module is further configured to:
获取预设的环境数据集中一个目标金融衍生品对应的轨迹数据,所述轨迹数据包括所述目标金融衍生品在合约时段内的价格轨迹、交付金额轨迹、所述合约时段内的无风险利率轨迹、所述合约时段内各个资产的价格轨迹和各个资产的现金流轨迹;Obtaining trajectory data corresponding to a target financial derivative in a preset environmental data set, the trajectory data including a price trajectory, a delivery amount trajectory, a risk-free interest rate trajectory, a price trajectory of each asset within the contract period, and a cash flow trajectory of each asset within the contract period for the target financial derivative;
采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与所述目标金融衍生品对应的轨迹数据所限定的环境进行至少一次交互,得到所述目标金融衍生品对应的子反馈样本集合,所述子反馈样本集合中包括多条样本,所述样本中包括用于计算损失值的各项样本数据;Using the to-be-trained policy network, the to-be-trained value function network, and the to-be-trained tail risk network to interact at least once with an environment defined by the trajectory data corresponding to the target financial derivative, to obtain a sub-feedback sample set corresponding to the target financial derivative, wherein the sub-feedback sample set includes a plurality of samples, each of which includes sample data for calculating a loss value;
将多个具有不同初始状态信息的金融衍生品对应的所述子反馈样本集合生成所述反馈样本集合。The feedback sample set is generated by using the sub-feedback sample sets corresponding to a plurality of financial derivatives with different initial state information.
在一实施方式中,所述值函数网络用于基于在目标时刻输入的状态信息输出状态价值,至少一项所述样本数据基于所述状态价值计算得到,所述交互模块还用于:In one embodiment, the value function network is configured to output a state value based on state information input at a target time, at least one item of sample data is calculated based on the state value, and the interaction module is further configured to:
根据所述目标金融衍生品对应的轨迹数据确定所述目标金融衍生品对应的初始状态信息,所述初始状态信息为所述合约时段的起始时刻对应的状态信息;determining initial state information corresponding to the target financial derivative according to the trajectory data corresponding to the target financial derivative, wherein the initial state information is state information corresponding to the start time of the contract period;
将所述起始时刻作为目标时刻,将所述目标时刻对应的状态信息输入所述策略网络,得到所述目标时刻的策略信息,并根据所述策略信息生成在所述目标时刻对金融衍生品对冲组合做出的调整动作;Using the starting time as the target time, inputting state information corresponding to the target time into the strategy network to obtain strategy information at the target time, and generating an adjustment action for the financial derivatives hedging portfolio at the target time based on the strategy information;
根据所述调整动作和所述目标金融衍生品对应的轨迹数据,计算得到所述目标时刻的下一时刻的状态信息和做出所述调整动作后所述对冲组合在所述目标时刻的下一时刻的对冲误差;Calculating, based on the adjustment action and the trajectory data corresponding to the target financial derivative, state information at a moment immediately following the target moment and a hedging error of the hedging portfolio at a moment immediately following the target moment after the adjustment action is performed;
将所述目标时刻的状态信息输入待训练的值函数网络,得到所述目标时刻对应的状态价值;Inputting the state information of the target moment into the value function network to be trained to obtain the state value corresponding to the target moment;
计算在所述目标时刻做出所述调整动作后的奖励信息,其中,当所述目标时刻的下一时刻为所述合约时段的终止时刻时,所述奖励信息根据所述目标金融衍生品对冲组合在所述目标时刻的下一时刻的对冲误差和所述对冲误差的尾部风险计算得到,其中,所述尾部风险是根据风险信息计算得到的,所述风险信息是通过将所述目标金融衍生品对应的初始状态信息和/或目标时刻对应的状态信息输入待训练的尾部风险网络计算得到的;Calculating reward information after performing the adjustment action at the target time, wherein, when the moment immediately following the target time is the end time of the contract period, the reward information is calculated based on the hedging error of the target financial derivative hedging portfolio at the moment immediately following the target time and the tail risk of the hedging error, wherein the tail risk is calculated based on risk information, which is calculated by inputting initial state information corresponding to the target financial derivative and/or state information corresponding to the target time into a tail risk network to be trained;
将所述目标时刻的下一时刻作为新的所述目标时刻,并返回执行所述将所述目标时刻对应的状态信息输入所述策略网络,得到所述目标时刻的策略信息的步骤,直到所述目标时刻的下一时刻为所述合约时段的终止时刻为止;Taking the moment after the target moment as the new target moment, and returning to the step of inputting the state information corresponding to the target moment into the policy network to obtain the policy information for the target moment, until the moment after the target moment is the end moment of the contract period;
基于所述合约时段内每个时刻对应的状态信息、调整动作、奖励信息和状态价值,生成每个时刻对应的第一样本,所述第一样本中包括用于计算所述策略网络对应的策略损失值的样本数据和用于计算所述值函数网络对应的评估损失值的样本数据;Generate a first sample corresponding to each moment based on the state information, adjustment action, reward information, and state value corresponding to each moment within the contract period, wherein the first sample includes sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network;
基于所述目标金融衍生品的初始状态信息和所述合约时段的终止时刻对应的对冲误差,生成第二样本,所述第二样本中包括用于计算所述尾部风险网络对应的风险信息预测损失值的样本数据;generating a second sample based on the initial state information of the target financial derivative and the hedging error corresponding to the termination time of the contract period, wherein the second sample includes sample data for calculating the risk information predicted loss value corresponding to the tail risk network;
根据所述第一样本和所述第二样本生成所述目标金融衍生品对应的子反馈样本集合。A sub-feedback sample set corresponding to the target financial derivative is generated based on the first sample and the second sample.
在一实施方式中,当所述风险信息是通过将所述目标金融衍生品对应的初始状态信息输入所述待训练的尾部风险网络计算得到的情况下,所述第二样本包括目标金融衍生品的初始状态信息和合约时段的终止时刻对应的对冲误差。In one embodiment, when the risk information is calculated by inputting the initial state information corresponding to the target financial derivative into the tail risk network to be trained, the second sample includes the initial state information of the target financial derivative and the hedging error corresponding to the end time of the contract period.
在一实施方式中,当所述风险信息是通过将所述目标金融衍生品对应的初始状态信息和在目标时刻对应的状态信息分别输入所述待训练的尾部风险网络计算得到的情况下,所述第二样本包括目标金融衍生品的初始状态信息、目标时刻对应的状态信息和合约时段的终止时刻对应的对冲误差。In one embodiment, when the risk information is calculated by respectively inputting the initial state information corresponding to the target financial derivative and the state information corresponding at the target time into the tail risk network to be trained, the second sample includes the initial state information of the target financial derivative, the state information corresponding to the target time, and the hedging error corresponding to the end time of the contract period.
在一实施方式中,当所述风险信息是通过将所述目标金融衍生品在目标时刻对应的状态信息输入所述待训练的尾部风险网络计算得到的情况下,所述第二样本包括目标金融衍生品在目标时刻对应的状态信息和合约时段的终止时刻对应的对冲误差。In one embodiment, when the risk information is calculated by inputting the state information corresponding to the target financial derivative at the target time into the tail risk network to be trained, the second sample includes the state information corresponding to the target financial derivative at the target time and the hedging error corresponding to the end time of the contract period.
在一实施方式中,所述数据处理装置还包括:In one embodiment, the data processing device further includes:
采集模块,用于对于预设时段内以预设资产为标的资产的各个具有不同初始状态信息的金融衍生品中的任一金融衍生品,基于预设金融信息数据接口采集所述金融衍生品对应的轨迹数据;根据各个具有不同初始状态信息的金融衍生品对应的所述轨迹数据得到所述环境数据集。The collection module is configured to collect, based on a preset financial information data interface, trajectory data corresponding to any financial derivative with different initial state information and a preset asset as an underlying asset within a preset time period; and obtain the environmental dataset based on the trajectory data corresponding to each financial derivative with different initial state information.
在一实施方式中,反馈样本集合中包括多条第一样本和第二样本,所述第一样本中包括用于计算所述策略网络对应的策略损失值的样本数据和用于计算所述值函数网络对应的评估损失值的样本数据,所述第二样本中包括用于计算所述尾部风险网络对应的风险信息预测损失值的样本数据;In one embodiment, the feedback sample set includes a plurality of first samples and second samples, wherein the first samples include sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network, and the second samples include sample data for calculating the risk information prediction loss value corresponding to the tail risk network;
所述训练模块还用于:The training module is also used to:
采用所述第一样本计算所述策略网络对应的策略损失值和所述值函数网络对应的评估损失值;Calculating a policy loss value corresponding to the policy network and an evaluation loss value corresponding to the value function network using the first sample;
根据所述策略损失值和所述评估损失值计算所述策略网络和所述值函数网络中网络参数对应的梯度值,并根据梯度值更新网络参数;Calculating the gradient values corresponding to the network parameters in the policy network and the value function network according to the policy loss value and the evaluation loss value, and updating the network parameters according to the gradient values;
采用所述第二样本计算所述尾部风险网络对应的风险信息预测损失值,根据所述风险信息预测损失值计算所述尾部风险网络中网络参数对应的梯度值,并根据梯度值更新网络参数;Calculating a risk information predicted loss value corresponding to the tail risk network using the second sample, calculating a gradient value corresponding to a network parameter in the tail risk network according to the risk information predicted loss value, and updating the network parameter according to the gradient value;
采用所述反馈样本集合对所述策略网络、所述值函数网络和所述尾部风险网络中的网络参数分别进行至少一轮的更新后,完成对所述策略网络、所述值函数网络和所述尾部风险网络的一轮训练过程。After using the feedback sample set to update the network parameters in the policy network, the value function network and the tail risk network for at least one round, a round of training process for the policy network, the value function network and the tail risk network is completed.
在一实施方式中,所述第二样本中包括一个目标金融衍生品对应的初始状态信息和所述目标金融衍生品的对冲组合在所述目标金融衍生品的合约终止时刻的对冲误差;In one embodiment, the second sample includes initial state information corresponding to a target financial derivative and a hedging error of the hedging combination of the target financial derivative at the time of contract termination of the target financial derivative;
所述训练模块还用于:The training module is also used to:
将所述第二样本中的初始状态信息输入所述尾部风险网络预测得到所述第二样本对应的风险信息;Inputting the initial state information of the second sample into the tail risk network prediction to obtain risk information corresponding to the second sample;
将所述第二样本对应的风险信息和所述第二样本中的对冲误差代入预设的风险信息预测损失函数,得到所述第二样本对应的风险信息预测损失值;Substituting the risk information corresponding to the second sample and the hedging error in the second sample into a preset risk information prediction loss function to obtain a risk information prediction loss value corresponding to the second sample;
根据所述反馈样本集合中多个所述第二样本对应的风险信息预测损失值,计算所述尾部风险网络中网络参数对应的梯度值,并根据梯度值更新网络参数。Predicting loss values based on risk information corresponding to a plurality of second samples in the feedback sample set, calculating gradient values corresponding to network parameters in the tail risk network, and updating the network parameters based on the gradient values.
在一实施方式中,金融衍生品对应的目标时刻的状态信息包括在所述目标时刻所述金融衍生品的对冲组合中的现金数量、各个对冲资产的数量、各个对冲资产的价格、所述金融衍生品的各个标的资产的价格、所述金融衍生品的剩余到期时间、所述金融衍生品的合约中约定的影响所述金融衍生品价格的参数或所述金融衍生品的合约中约定的影响所述金融衍生品价格的参数与所述金融衍生品的各个标的资产初始时刻的价格的比值、所述对冲组合的对冲误差和在所述目标时刻的无风险利率。所述金融衍生品对应的目标时刻的状态信息还可以包括在所述时刻所述金融衍生品的各个标的资产的波动率。In one embodiment, the status information corresponding to a financial derivative at a target time includes the cash amount in the hedging portfolio of the financial derivative at the target time, the amount of each hedging asset, the price of each hedging asset, the price of each underlying asset of the financial derivative, the remaining time to maturity of the financial derivative, the parameters affecting the price of the financial derivative stipulated in the financial derivative contract, or the ratio of the parameters affecting the price of the financial derivative stipulated in the financial derivative contract to the initial price of each underlying asset of the financial derivative, the hedging error of the hedging portfolio, and the risk-free interest rate at the target time. The status information corresponding to the financial derivative at the target time may also include the volatility of each underlying asset of the financial derivative at the target time.
在一实施方式中,所述策略网络包括两个独立的多层前馈神经网络,每个所述前馈神经网络包括一个输入层,若干个隐藏层和一个输出层。In one embodiment, the policy network includes two independent multi-layer feedforward neural networks, each of which includes an input layer, several hidden layers and an output layer.
在一实施方式中,所述值函数网络包括两个独立的多层前馈神经网络,每个所述前馈神经网络包括一个输入层,若干个隐藏层和一个输出层,且所述输入层节点数和所述策略网络一致,每个所述隐藏层包含若干个节点,而所述输出层仅有一个节点用于输出一个值函数变量,用于评估输入的状态信息的价值。In one embodiment, the value function network includes two independent multi-layer feedforward neural networks, each of which includes an input layer, several hidden layers and an output layer, and the number of nodes in the input layer is consistent with that in the policy network, each of the hidden layers contains several nodes, and the output layer has only one node for outputting a value function variable for evaluating the value of the input state information.
此外,本申请实施例还提出一种数据处理设备,如图2所示,图2是本申请实施例方案涉及的硬件运行环境的设备结构示意图。需要说明的是,本申请实施例数据处理设备可以是智能手机、个人计算机、服务器等设备,在此不做具体限制。In addition, the present application also provides a data processing device, as shown in Figure 2, which is a schematic diagram of the device structure of the hardware operating environment involved in the embodiment of the present application. It should be noted that the data processing device in the embodiment of the present application can be a smartphone, a personal computer, a server, etc., and is not specifically limited here.
如图2所示,该数据处理设备可以包括:处理器1001,例如CPU,网络接口1004,用户接口1003,存储器1005,通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in Figure 2, the data processing device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Among them, the communication bus 1002 is used to realize the connection and communication between these components. The user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the user interface 1003 may optionally include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also be a storage device independent of the aforementioned processor 1001.
本领域技术人员可以理解,图2中示出的设备结构并不构成对数据处理设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art will understand that the device structure shown in FIG2 does not constitute a limitation on the data processing device, and may include more or fewer components than shown, or a combination of certain components, or a different arrangement of components.
如图2所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及数据处理程序。操作系统是管理和控制设备硬件和软件资源的程序,支持数据处理程序以及其它软件或程序的运行。在图2所示的设备中,用户接口1003主要用于与客户端进行数据通信;网络接口1004主要用于与服务器建立通信连接;而处理器1001可以用于调用存储器1005中存储的数据处理程序,并执行以下操作:As shown in Figure 2 , memory 1005, a computer storage medium, may include an operating system, a network communication module, a user interface module, and a data processing program. The operating system is a program that manages and controls the device's hardware and software resources, supporting the execution of the data processing program and other software or programs. In the device shown in Figure 2 , user interface 1003 is primarily used for data communication with the client; network interface 1004 is primarily used for establishing a communication connection with the server; and processor 1001 is used to invoke the data processing program stored in memory 1005 and perform the following operations:
采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与环境进行多次交互得到反馈样本集合,其中,所述策略网络用于基于目标时刻的状态信息输出策略信息,所述策略信息用于生成在所述目标时刻对金融衍生品对冲组合做出的调整动作,所述反馈样本集合包括多条样本,所述样本中包括用于计算损失值的各项样本数据,至少一项样本数据基于奖励信息计算得到,所述奖励信息基于金融衍生品对冲组合的对冲误差和所述对冲误差的尾部风险计算得到,所述尾部风险根据风险信息计算得到,所述风险信息由所述尾部风险网络基于金融衍生品对应的初始状态信息和/或目标时刻对应的状态信息预测得到;A feedback sample set is obtained by performing multiple interactions with the environment using a policy network to be trained, a value function network to be trained, and a tail risk network to be trained, wherein the policy network is used to output policy information based on state information at a target moment, and the policy information is used to generate an adjustment action for the financial derivatives hedging portfolio at the target moment. The feedback sample set includes multiple samples, each of which includes sample data used to calculate a loss value, at least one of which is calculated based on reward information, the reward information is calculated based on a hedging error of the financial derivatives hedging portfolio and a tail risk of the hedging error, the tail risk is calculated based on risk information, and the risk information is predicted by the tail risk network based on initial state information corresponding to the financial derivative and/or state information corresponding to the target moment;
采用所述反馈样本集合对所述策略网络、所述值函数网络和所述尾部风险网络分别进行至少一轮训练;Using the feedback sample set to perform at least one round of training on the policy network, the value function network, and the tail risk network respectively;
在对各网络分别进行至少一轮的训练并检测到满足预设训练结束条件后,得到训练完成的所述策略网络,以供基于所述策略网络进行金融衍生品对冲组合的调整。After each network is trained for at least one round and it is detected that a preset training end condition is met, the trained strategy network is obtained for adjusting the financial derivatives hedging portfolio based on the strategy network.
在一实施方式中,所述采用待训练的策略网络、值函数网络和尾部风险网络与环境进行多次交互得到反馈样本集合的操作包括:In one embodiment, the operation of using the to-be-trained policy network, the value function network, and the tail risk network to interact with the environment multiple times to obtain a feedback sample set includes:
获取预设的环境数据集中一个目标金融衍生品对应的轨迹数据,所述轨迹数据包括所述目标金融衍生品在合约时段内的价格轨迹、交付金额轨迹、所述合约时段内的无风险利率轨迹、所述合约时段内各个资产的价格轨迹和各个资产的现金流轨迹;Obtaining trajectory data corresponding to a target financial derivative in a preset environmental data set, the trajectory data including a price trajectory, a delivery amount trajectory, a risk-free interest rate trajectory, a price trajectory of each asset within the contract period, and a cash flow trajectory of each asset within the contract period for the target financial derivative;
采用待训练的策略网络、待训练的值函数网络和待训练的尾部风险网络与所述目标金融衍生品对应的轨迹数据所限定的环境进行至少一次交互,得到所述目标金融衍生品对应的子反馈样本集合,所述子反馈样本集合中包括多条样本,所述样本中包括用于计算损失值的各项样本数据;Using the to-be-trained policy network, the to-be-trained value function network, and the to-be-trained tail risk network to interact at least once with an environment defined by the trajectory data corresponding to the target financial derivative, to obtain a sub-feedback sample set corresponding to the target financial derivative, wherein the sub-feedback sample set includes a plurality of samples, each of which includes sample data for calculating a loss value;
将多个具有不同初始状态信息的金融衍生品对应的所述子反馈样本集合生成所述反馈样本集合。The feedback sample set is generated by using the sub-feedback sample sets corresponding to a plurality of financial derivatives with different initial state information.
在一实施方式中,所述值函数网络用于基于在目标时刻输入的状态信息输出状态价值,至少一项所述样本数据基于所述状态价值计算得到,所述采用待训练的策略网络、值函数网络和尾部风险网络与所述目标金融衍生品对应的轨迹数据所限定的环境进行多次交互,得到所述目标金融衍生品对应的子反馈样本集合的操作包括:In one embodiment, the value function network is configured to output a state value based on state information input at a target time, at least one item of sample data is calculated based on the state value, and the operation of using the to-be-trained policy network, the value function network, and the tail risk network to interact multiple times with an environment defined by the trajectory data corresponding to the target financial derivative to obtain a sub-feedback sample set corresponding to the target financial derivative includes:
根据所述目标金融衍生品对应的轨迹数据确定所述目标金融衍生品对应的初始状态信息,所述初始状态信息为所述合约时段的起始时刻对应的状态信息;determining initial state information corresponding to the target financial derivative according to the trajectory data corresponding to the target financial derivative, wherein the initial state information is state information corresponding to the start time of the contract period;
将所述起始时刻作为目标时刻,将所述目标时刻对应的状态信息输入所述策略网络,得到所述目标时刻的策略信息,并根据所述策略信息生成在所述目标时刻对金融衍生品对冲组合做出的调整动作;Using the starting time as the target time, inputting state information corresponding to the target time into the strategy network to obtain strategy information at the target time, and generating an adjustment action for the financial derivatives hedging portfolio at the target time based on the strategy information;
根据所述调整动作和所述目标金融衍生品对应的轨迹数据,计算得到所述目标时刻的下一时刻的状态信息和做出所述调整动作后所述对冲组合在所述目标时刻的下一时刻的对冲误差;Calculating, based on the adjustment action and the trajectory data corresponding to the target financial derivative, state information at a moment immediately following the target moment and a hedging error of the hedging portfolio at a moment immediately following the target moment after the adjustment action is performed;
将所述目标时刻的状态信息输入待训练的值函数网络,得到所述目标时刻对应的状态价值;Inputting the state information of the target moment into the value function network to be trained to obtain the state value corresponding to the target moment;
计算在所述目标时刻做出所述调整动作后的奖励信息,其中,当所述目标时刻的下一时刻为所述合约时段的终止时刻时,所述奖励信息根据所述目标金融衍生品对冲组合在所述目标时刻的下一时刻的对冲误差和所述对冲误差的尾部风险计算得到,其中,所述尾部风险是根据风险信息计算得到的,所述风险信息是通过将所述目标金融衍生品对应的初始状态信息和/或目标时刻对应的状态信息输入待训练的尾部风险网络计算得到的;Calculating reward information after performing the adjustment action at the target time, wherein, when the moment immediately following the target time is the end time of the contract period, the reward information is calculated based on the hedging error of the target financial derivative hedging portfolio at the moment immediately following the target time and the tail risk of the hedging error, wherein the tail risk is calculated based on risk information, which is calculated by inputting initial state information corresponding to the target financial derivative and/or state information corresponding to the target time into a tail risk network to be trained;
将所述目标时刻的下一时刻作为新的所述目标时刻,并返回执行所述将所述目标时刻对应的状态信息输入所述策略网络,得到所述目标时刻的策略信息的步骤,直到所述目标时刻的下一时刻为所述合约时段的终止时刻为止;Taking the moment after the target moment as the new target moment, and returning to the step of inputting the state information corresponding to the target moment into the policy network to obtain the policy information for the target moment, until the moment after the target moment is the end moment of the contract period;
基于所述合约时段内每个时刻对应的状态信息、调整动作、奖励信息和状态价值,生成每个时刻对应的第一样本,所述第一样本中包括用于计算所述策略网络对应的策略损失值的样本数据和用于计算所述值函数网络对应的评估损失值的样本数据;Generate a first sample corresponding to each moment based on the state information, adjustment action, reward information, and state value corresponding to each moment within the contract period, wherein the first sample includes sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network;
基于所述目标金融衍生品的初始状态信息和所述合约时段的终止时刻对应的对冲误差,生成第二样本,所述第二样本中包括用于计算所述尾部风险网络对应的风险信息预测损失值的样本数据;generating a second sample based on the initial state information of the target financial derivative and the hedging error corresponding to the termination time of the contract period, wherein the second sample includes sample data for calculating the risk information predicted loss value corresponding to the tail risk network;
根据所述第一样本和所述第二样本生成所述目标金融衍生品对应的子反馈样本集合。A sub-feedback sample set corresponding to the target financial derivative is generated based on the first sample and the second sample.
在一实施方式中,所述获取预设的环境数据集中一个目标金融衍生品对应的轨迹数据的操作之前,处理器1001还可以用于调用存储器1005中存储的数据处理程序,执行以下操作:In one embodiment, before obtaining the trajectory data corresponding to a target financial derivative in the preset environmental data set, the processor 1001 may also be configured to call a data processing program stored in the memory 1005 to perform the following operations:
对于预设时段内以预设资产为标的资产的各个具有不同初始状态信息的金融衍生品中的任一金融衍生品,基于预设金融信息数据接口采集所述金融衍生品对应的轨迹数据;For any financial derivative with a preset asset as the underlying asset and having different initial state information within a preset time period, collecting trajectory data corresponding to the financial derivative based on a preset financial information data interface;
根据各个具有不同初始状态信息的金融衍生品对应的所述轨迹数据得到所述环境数据集。The environmental data set is obtained according to the trajectory data corresponding to each financial derivative with different initial state information.
在一实施方式中,反馈样本集合中包括多条第一样本和第二样本,所述第一样本中包括用于计算所述策略网络对应的策略损失值的样本数据和用于计算所述值函数网络对应的评估损失值的样本数据,所述第二样本中包括用于计算所述尾部风险网络对应的风险信息预测损失值的样本数据;In one embodiment, the feedback sample set includes a plurality of first samples and second samples, wherein the first samples include sample data for calculating the policy loss value corresponding to the policy network and sample data for calculating the evaluation loss value corresponding to the value function network, and the second samples include sample data for calculating the risk information prediction loss value corresponding to the tail risk network;
所述采用所述反馈样本集合对所述策略网络、所述值函数网络和所述尾部风险网络分别进行一轮训练的操作包括:The operation of using the feedback sample set to perform a round of training on the policy network, the value function network, and the tail risk network respectively includes:
采用所述第一样本计算所述策略网络对应的策略损失值和所述值函数网络对应的评估损失值;Calculating a policy loss value corresponding to the policy network and an evaluation loss value corresponding to the value function network using the first sample;
根据所述策略损失值和所述评估损失值计算所述策略网络和所述值函数网络中网络参数对应的梯度值,并根据梯度值更新网络参数;Calculating the gradient values corresponding to the network parameters in the policy network and the value function network according to the policy loss value and the evaluation loss value, and updating the network parameters according to the gradient values;
采用所述第二样本计算所述尾部风险网络对应的风险信息预测损失值,根据所述风险信息预测损失值计算所述尾部风险网络中网络参数对应的梯度值,并根据梯度值更新网络参数;Calculating a risk information predicted loss value corresponding to the tail risk network using the second sample, calculating a gradient value corresponding to a network parameter in the tail risk network according to the risk information predicted loss value, and updating the network parameter according to the gradient value;
采用所述反馈样本集合对所述策略网络、所述值函数网络和所述尾部风险网络中的网络参数分别进行至少一轮的更新后,完成对所述策略网络、所述值函数网络和所述尾部风险网络的一轮训练过程。After using the feedback sample set to update the network parameters in the policy network, the value function network and the tail risk network for at least one round, a round of training process for the policy network, the value function network and the tail risk network is completed.
在一实施方式中,所述第二样本中包括一个目标金融衍生品对应的初始状态信息和所述目标金融衍生品的对冲组合在所述目标金融衍生品的合约终止时刻的对冲误差;In one embodiment, the second sample includes initial state information corresponding to a target financial derivative and a hedging error of the hedging combination of the target financial derivative at the time of contract termination of the target financial derivative;
所述采用所述第二样本计算所述尾部风险网络对应的风险信息预测损失值,根据所述风险信息预测损失值计算所述尾部风险网络中网络参数对应的梯度值,并根据梯度值更新网络参数的操作包括:The operation of calculating the risk information predicted loss value corresponding to the tail risk network using the second sample, calculating the gradient value corresponding to the network parameter in the tail risk network according to the risk information predicted loss value, and updating the network parameter according to the gradient value includes:
将所述第二样本中的初始状态信息输入所述尾部风险网络预测得到所述第二样本对应的风险信息;Inputting the initial state information of the second sample into the tail risk network prediction to obtain risk information corresponding to the second sample;
将所述第二样本对应的风险信息和所述第二样本中的对冲误差代入预设的风险信息预测损失函数,得到所述第二样本对应的风险信息预测损失值;Substituting the risk information corresponding to the second sample and the hedging error in the second sample into a preset risk information prediction loss function to obtain a risk information prediction loss value corresponding to the second sample;
根据所述反馈样本集合中多个所述第二样本对应的风险信息预测损失值,计算所述尾部风险网络中网络参数对应的梯度值,并根据梯度值更新网络参数。Predicting loss values based on risk information corresponding to a plurality of second samples in the feedback sample set, calculating gradient values corresponding to network parameters in the tail risk network, and updating the network parameters based on the gradient values.
在一实施方式中,金融衍生品对应的目标时刻的状态信息包括在所述目标时刻所述金融衍生品的对冲组合中的现金数量、各个对冲资产的数量、各个对冲资产的价格、所述金融衍生品的各个标的资产的价格、所述金融衍生品的剩余到期时间、所述金融衍生品的合约中约定的影响所述金融衍生品价格的参数或所述金融衍生品的合约中约定的影响所述金融衍生品价格的参数与所述金融衍生品的各个标的资产初始时刻的价格的比值、所述对冲组合的对冲误差和在所述目标时刻的无风险利率。所述金融衍生品对应的目标时刻的状态信息还可以包括在所述时刻所述金融衍生品的各个标的资产的波动率。In one embodiment, the status information corresponding to a financial derivative at a target time includes the cash amount in the hedging portfolio of the financial derivative at the target time, the amount of each hedging asset, the price of each hedging asset, the price of each underlying asset of the financial derivative, the remaining time to maturity of the financial derivative, the parameters affecting the price of the financial derivative stipulated in the financial derivative contract, or the ratio of the parameters affecting the price of the financial derivative stipulated in the financial derivative contract to the initial price of each underlying asset of the financial derivative, the hedging error of the hedging portfolio, and the risk-free interest rate at the target time. The status information corresponding to the financial derivative at the target time may also include the volatility of each underlying asset of the financial derivative at the target time.
此外,本申请实施例还提出一种计算机可读存储介质,所述存储介质上存储有数据处理程序,所述数据处理程序被处理器执行时实现如上所述的数据处理方法的步骤。In addition, an embodiment of the present application further provides a computer-readable storage medium, on which a data processing program is stored. When the data processing program is executed by a processor, the steps of the data processing method described above are implemented.
本申请数据处理设备和计算机可读存储介质各实施例,均可参照本申请数据处理方法各个实施例,此处不再赘述。The various embodiments of the data processing device and computer-readable storage medium of the present application can refer to the various embodiments of the data processing method of the present application, and will not be repeated here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, in this document, the terms "comprises," "includes," or any other variations thereof are intended to encompass non-exclusive inclusion, such that a process, method, article, or apparatus comprising a series of elements includes not only those elements but also other elements not explicitly listed, or elements inherent to such process, method, article, or apparatus. In the absence of further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of other identical elements in the process, method, article, or apparatus comprising the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the above embodiments of the present application are for description only and do not represent the advantages or disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment methods can be implemented by means of software plus the necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present application, or the part that contributes to the prior art, can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), and includes a number of instructions for enabling a terminal device (which can be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of the present application and do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made using the contents of the present application specification and drawings, or directly or indirectly applied in other related technical fields, are also included in the patent protection scope of the present application.
Claims (15)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410179678.4 | 2024-02-18 | ||
| CN202410179678.4A CN117764735A (en) | 2024-02-18 | 2024-02-18 | Data processing method, device, equipment and computer readable storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025171818A1 true WO2025171818A1 (en) | 2025-08-21 |
Family
ID=90322284
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2025/077841 Pending WO2025171818A1 (en) | 2024-02-18 | 2025-02-18 | Data processing method, apparatus and device, and computer-readable storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN117764735A (en) |
| WO (1) | WO2025171818A1 (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117764735A (en) * | 2024-02-18 | 2024-03-26 | 北京大学深圳研究生院 | Data processing method, device, equipment and computer readable storage medium |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20210081787A1 (en) * | 2019-09-12 | 2021-03-18 | Beijing University Of Posts And Telecommunications | Method and apparatus for task scheduling based on deep reinforcement learning, and device |
| CN113794682A (en) * | 2021-08-06 | 2021-12-14 | 成都墨甲信息科技有限公司 | Industrial Internet of things intrusion detection intelligent agent training method, device and equipment |
| CN115760428A (en) * | 2022-10-31 | 2023-03-07 | 交叉信息核心技术研究院(西安)有限公司 | Decision maker establishing method, system, equipment and medium based on deterministic strategy |
| CN115965879A (en) * | 2022-12-12 | 2023-04-14 | 四川观想科技股份有限公司 | Unmanned training method for incomplete information scene in sparse high-dimensional state |
| CN117764735A (en) * | 2024-02-18 | 2024-03-26 | 北京大学深圳研究生院 | Data processing method, device, equipment and computer readable storage medium |
-
2024
- 2024-02-18 CN CN202410179678.4A patent/CN117764735A/en active Pending
-
2025
- 2025-02-18 WO PCT/CN2025/077841 patent/WO2025171818A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20210081787A1 (en) * | 2019-09-12 | 2021-03-18 | Beijing University Of Posts And Telecommunications | Method and apparatus for task scheduling based on deep reinforcement learning, and device |
| CN113794682A (en) * | 2021-08-06 | 2021-12-14 | 成都墨甲信息科技有限公司 | Industrial Internet of things intrusion detection intelligent agent training method, device and equipment |
| CN115760428A (en) * | 2022-10-31 | 2023-03-07 | 交叉信息核心技术研究院(西安)有限公司 | Decision maker establishing method, system, equipment and medium based on deterministic strategy |
| CN115965879A (en) * | 2022-12-12 | 2023-04-14 | 四川观想科技股份有限公司 | Unmanned training method for incomplete information scene in sparse high-dimensional state |
| CN117764735A (en) * | 2024-02-18 | 2024-03-26 | 北京大学深圳研究生院 | Data processing method, device, equipment and computer readable storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117764735A (en) | 2024-03-26 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Parot et al. | Using Artificial Neural Networks to forecast Exchange Rate, including VAR‐VECM residual analysis and prediction linear combination | |
| US12100017B2 (en) | Unified artificial intelligence model for multiple customer value variable prediction | |
| US8650110B2 (en) | Counterfactual testing of finances using financial objects | |
| JP2023171598A (en) | System for optimizing securities trading execution | |
| EP4109377A1 (en) | System, method and apparatus for modeling loan transitions | |
| US11763049B1 (en) | Systems and methods for time series simulation | |
| US11055772B1 (en) | Instant lending decisions | |
| CA3171885A1 (en) | Systems, computer-implemented methods and computer programs for capital management | |
| Guo et al. | Revenue maximizing markets for zero-day exploits | |
| CN112232949B (en) | Method and device for predicting lending risk based on blockchain | |
| JP2024541469A (en) | SYSTEM AND METHOD FOR AUTOMATED STAKING MODEL | |
| CN111951008A (en) | A risk prediction method, apparatus, electronic device and readable storage medium | |
| WO2025171818A1 (en) | Data processing method, apparatus and device, and computer-readable storage medium | |
| US20250156297A1 (en) | Systems and methods for monitoring provider user activity | |
| US12450608B2 (en) | Transaction evaluation based on a machine learning projection of future account status | |
| US20240104645A1 (en) | System, method and apparatus for optimization of financing programs | |
| US20240152584A1 (en) | Authentication data aggregation | |
| CN114912947A (en) | A decision method and model training method based on causal inference | |
| US20240169355A1 (en) | Settlement card having locked-in card specific merchant and rule-based authorization for each transaction | |
| Cao et al. | Estimating price impact via deep reinforcement learning | |
| CN118096229A (en) | Customer resource determination method, model training method, electronic device and storage medium | |
| US20230072534A1 (en) | Proxy system configured to improve message-to-execution ratio of distributed system | |
| Lin et al. | Investigating the robustness and generalizability of deep reinforcement learning based optimal trade execution systems | |
| US20140074753A1 (en) | Adjustment tool for managing and tracking a collection of assets | |
| KR102575858B1 (en) | Apparatus and methods for portfolio management |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25754542 Country of ref document: EP Kind code of ref document: A1 |