WO2019198815A1 - 機械学習システム、機械学習方法、プログラム - Google Patents
機械学習システム、機械学習方法、プログラム Download PDFInfo
- Publication number
- WO2019198815A1 WO2019198815A1 PCT/JP2019/015973 JP2019015973W WO2019198815A1 WO 2019198815 A1 WO2019198815 A1 WO 2019198815A1 JP 2019015973 W JP2019015973 W JP 2019015973W WO 2019198815 A1 WO2019198815 A1 WO 2019198815A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- variable
- dual
- node
- unit
- machine learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Definitions
- the present invention relates to a machine learning technique.
- Fig. 1 illustrates a network system to which this consensus building algorithm is applied.
- a large number of sensors A-101 eg, microphones, cameras, thermometers
- the data obtained through the sensor A-101 is accumulated in each of a plurality of node units A-102 (for example, servers and PCs) arranged in a distributed manner.
- a centralized framework requires a central server and all data needs to be stored on the central server, but when using consensus building algorithms, the central server does not have to exist and the data is What is necessary is to accumulate
- the data is not limited to data obtained from the sensor A-101, and may be character data obtained from an input device or the Web.
- the node part A-102 of the network system of FIG. 1 is loosely coupled by the edge A-103, and may not be fully coupled, and some data is exchanged directly or indirectly between the node parts A-102. It should be in a relationship that can be done.
- There is no restriction on the connection structure of the edge A-103 and it may be a tree structure or a circular / circular structure. However, all the node parts A-102 may be connected directly or indirectly. That is, the connection structure of the edge A-103 may not be a divided structure.
- the mapping is optimized by exchanging variables generated in the process of learning the mapping between nodes, not the data itself, under the conditions of the network system in FIG. Can be shared, and the response time can be shortened or the amount of data can be reduced.
- consensus-type edge computing based on convex optimization technology is described.
- the cost function is limited to a convex function
- consensus-type edge computing technology already exists.
- the first is a method based on the Distributed ADMM method (Alternating Direction Method of Multipliers) (for example, see Non-Patent Document 1)
- the second is a method based on the PDMM method (Primal Dual Method of Multipliers) (for example, Non-Patent Document). 2).
- DNN is a non-convex cost function and does not enter this framework.
- ⁇ Distributed ADMM method consensus building algorithm The consensus building algorithm of the Distributed ADMM method is described below.
- the cost function is limited to the closed true convex function (Closed Proper Convex function).
- the Distributed ADMM-based consensus building algorithm tries to optimize the whole while updating the main and dual variables.
- a cost function is generated by sharing the main variables between nodes and consensus building so that they match.
- FIG. 2 shows a situation where all three types of nodes are connected.
- the variables are three main variables w 1 , w 2 , w 3 , and six types of dual variables ⁇ 1
- 2 main variables w 1 ⁇ 1> , w 1 ⁇ 2> , w 2 ⁇ 1> , w 2 ⁇ 2> , w 3 ⁇ 1> , w exchanged between the six types of nodes 3 Consists of ⁇ 2> .
- j be the dual variable for the j-th node at the i-th node.
- mapping As an example of mapping, the least squares problem with teacher data is treated.
- s teacher data
- x input data (sometimes referred to as “observation data”)
- w is a weight vector
- ⁇ be a dual variable. It is assumed that the dimension of the dual variable ⁇ is the same as the dimension of the weight vector w.
- the meaning of these alphabets with subscripts is the same as the alphabet without subscripts.
- s i means teacher data like s.
- ⁇ B ⁇ 1, ..., B ⁇ .
- B is a predetermined positive integer.
- B 1.
- the teacher data s i has the following configuration.
- x i [x i, 1 , ..., x i, B ] T
- w i [w i, 1 , ..., w i, B ] T
- j [ ⁇ i
- the cost for optimizing the mapping can be expressed by the following true convex closed function f i .
- f i be the cost function at the i-th node, and search for w that minimizes this.
- the upper left ⁇ of s i, b indicates that it is obtained by w i, b obtained after learning or during learning.
- p means Lp norm.
- w is called the main variable.
- w j, b ⁇ i> means a main variable of the j-th node accumulated in the i-th node. As will be described later, w j, b ⁇ i> is appropriately updated.
- This expression is a cost under the condition that maximizes the dual variable ⁇ (the intention that the term applied to ⁇ becomes 0 if the condition that the main variable between nodes matches between connected nodes) This means that the main variable w is optimized so as to minimize the function.
- the condition that the main variable w matches between connected nodes can be described, for example, in a relationship in which the matrix A is 0 by adding each other.
- the matrix A can be described by I, ⁇ I, 0 as follows. I is the identity matrix and 0 is the zero matrix.
- Non-Patent Document 3 p.154-155, if the cost function “f” is a true convex function, the optimization of the original problem of “minimizing the main variable while maximizing the dual variable” The solution agrees with the optimal solution when solving the problem of “maximizing dual variables while minimizing main variables”.
- Non-Patent Document 2 There is an extended Lagrangian method as a method for solving this problem (for example, see Non-Patent Document 2). This is a method of adding penalty terms to stabilize the optimization process.
- ⁇ i is a predetermined positive number.
- w * refers to the point optimized with respect to w.
- T represents an index of the number of updates.
- the t + 1 update for the dual variable is performed as follows.
- T is the number of repetitions, and is a predetermined positive integer.
- T represents the number of repetitions or transposition of a matrix and a vector.
- the disadvantages of the consensus building algorithm of the Distributed ADMM method are the following (1) to (4).
- (1) The convergence rate is low.
- (4) Transmission rate is slightly high. That is, when there is no statistical bias in the main variable data, the transmission rate tends to be high.
- PDMM-based consensus building algorithm Next, the PDMM-based consensus building algorithm will be described.
- a major difference from Distributed ADMM is that it shares not only main variables but also dual variables. As will be described later, a consensus is formed on a pair of dual variables.
- 2 are three types of pairs ⁇ 1
- 2 ⁇ 2
- 3 ⁇ 3
- 3 ⁇ 3
- the dual variable physically corresponds to the difference between the main variables at the time of each update.
- the cost function is limited to a closed true convex function.
- the cost is as follows.
- variables are expressed by connecting V nodes.
- f * is also known to be a closed true convex function when f is a closed true convex function, such a description is made (see, for example, Non-Patent Document 3).
- the indicating function is also a kind of convex function.
- the matrix P is an operation of exchanging a pair of dual variables between the i and j-th nodes (permutation P), as shown below, and the indicating function outputs 0 when the dual variables match ( That means lower costs).
- the problem to be solved consists of two types of convex functions.
- As a technique for solving such a problem there is an operator division method (see, for example, Non-Patent Document 4). This is a technique that solves problems that are difficult to perform global optimization at once, while optimizing them alternately.
- the PDMM-based consensus building algorithm using P-R partitioning is as follows.
- Update the main variable between nodes may be discontinuous (it is not necessary to update every time), and may be executed independently (asynchronously) for each edge.
- the disadvantages of this PDMM-based consensus building algorithm are (1) to (4) below.
- (1) In some cases, when there is a difference in the statistical properties between the data sets stored in each node, the result of non-convergence was obtained. However, there is no theoretical guarantee that it will converge even for heterogeneous data sets.
- (2) In principle, it cannot be applied to non-convex function optimization (including DNN).
- (3) In terms of exchanging main variables, the confidentiality of information is slightly low. That is, since the main variable may reflect the data characteristic deeply, the statistical characteristic of the data may be obtained by hacking the main variable. (4) High transmission rate. That is, since the main variable and the dual variable are sent, the transmission rate tends to be high.
- w ⁇ R m (m is an integer of 1 or more)
- the input of G i is set to a vector, it may be a matrix or tensor. Since the matrix or tensor can be expressed using a vector, hereinafter the input of G i advances the story as a vector.
- the optimal solution w * (also referred to as the w stop point) of this problem is obtained when the sub-derivative of the cost G includes a 0 vector (zero vector).
- ⁇ G i represents a sub-derivative of G i .
- N (i) represents the set of indexes of the node group connected to the i-th node.
- the problem of consensus building in edge computing is the problem of global optimization by exchanging latent variables and their auxiliary variables (for example, dual variables) instead of exchanging huge amounts of data stored in each node between nodes. .
- the cost expressed by the convex function at the i-th node is expressed as follows.
- T represents a transpose
- p represents an L p norm.
- U (i) ⁇ 1,..., U (i) ⁇ .
- This cost F 1, i as input data p 'dimensional vector v i ⁇ R p', so as to output the training data s i, U (i) of pieces of data set (v i, u, s i , u ) is a cost function for the problem of optimizing the latent variable p i (p i ⁇ R p ′ ).
- the index indicates that the data and the number U (i) differ depending on the node. In general, the data stored in each node is different but may be the same.
- a set of input data and teacher data is referred to as learning data.
- m p′V.
- parallel computing by multiple servers in the same location can be formulated as the same problem as the consensus building problem. For example, it corresponds to a case where models such as speech recognition and image recognition are learned at high speed by performing parallel calculation using a large number of computers.
- the data set stored in each server is basically the same.
- the stopping point that minimizes the cost is searched while constraining the value of the latent variable p to be the same between the nodes.
- j is a p ′ ⁇ p ′ real matrix representing consensus formation. Any matrix can be used as this matrix A i
- I is the identity matrix
- Lagrange's undetermined multiplier method is often used.
- the Lagrangian function L is defined as follows.
- F 1, i * represents a convex conjugate function of the convex function F 1, i .
- F 1, i is a convex function
- F 1, i * is also a convex function
- Equation (2-4) represents the problem of finding the dual variable ⁇ i, j that minimizes the convex function F 1, i * .
- equation (2-4) should be solved, but in the form of equation (2-4), the process of updating the latent variable asynchronously for each node is performed. It cannot be realized. Therefore, the lifting process is performed so that the dual variable ⁇ i, j belongs to each node. Specifically, the dual variable ⁇ i
- FIG. 10A and 10B are diagrams illustrating a state of lifting processing of dual variables.
- FIG. 10A shows that the variable ⁇ i, j controls the error in consensus building between node i and node j.
- the i-th latent variable and the j-th latent variable must be updated while being synchronized. Therefore, in order to be able to update the i-th latent variable and the j-th latent variable even asynchronously, the edge that is non-directional in FIG. 10A is rewritten with two directional edges as shown in FIG. 10B. Since these two edges are generated from one non-directional edge, the restriction of matching is necessary, but it is possible to solve the synchronization problem in the variable update.
- Expression (2-5) is expressed as follows by expressing the variable group written in V using a vector or matrix. Of course, the meaning of the formula does not change and represents the same problem.
- variable group that is, p i , ⁇ i
- instruction function F 2 is defined as follows.
- Applying the matrix P to the vector ⁇ is a process of replacing the dual variables ⁇ i
- p ⁇ R m is a latent variable
- ⁇ R m is a dual variable
- A [a 1 ,..., A n ] T ⁇ R n ⁇ m represents input data.
- the F 2 may be used L 1 norm as regularization term.
- F 1 and F 2 are not limited to this.
- s is teacher data of desired output information.
- Equation (3-1) The P-R type monotone operator division and the D-R type monotone operator division are expressed by Equation (3-1) and Equation (3-2), respectively.
- Equations (3-1) and (3-2) it can be seen that by introducing the averaging operator to the P-R type monotone operator division, the D-R type monotone operator division can be obtained.
- variable update algorithm which is a recursive variable update rule obtained from Equation (3-1) and Equation (3-2), will be described using the consensus building problem in edge computing as an example.
- This consensus building problem is expressed by replacing ⁇ in the following equation (2-7) with w.
- the resolvent operator ⁇ 2 and the Cayley operator ⁇ 2 are as follows.
- Non-Patent Document 2 when the above instruction function F 2 is used, the Cayley operator ⁇ 2 corresponds to the permutation matrix P.
- Equation (3-5) corresponds to the P-R type monotone operator division of Equation (3-1)
- Equation (3-6) corresponds to the D-R type monotone operator division of Equation (3-2).
- T is a variable representing the number of updates.
- Equation (3-3) includes updating the latent variable p and the dual variable w. As a method for updating these variables, a method described in Non-Patent Document 2 will be described. Equation (3-3) is transformed as follows.
- F 1 * includes the optimization calculation of the latent variable p
- the latent variable p is optimized, and after fixing p to the optimum value, a method for deriving the dual variable w is derived.
- the latent variable p is sequentially optimized (that is, p t + 1 is calculated from p t ), a penalty term for p is calculated in addition to the cost included in F 1 * .
- Equation (3-8) the third term in argmin on the right side of Equation (3-8) is a penalty term, and ⁇ > 0.
- penal terms it is possible to perform sequential optimization calculation of the latent variable p.
- equation (3-4) can be calculated as follows.
- ⁇ can be calculated without using w.
- Non-Patent Documents 1 and 2 described above cannot perform machine learning when the cost function is a non-convex function. This is the second problem.
- the conventional variable update rule is generated based on monotonic operator division using a resolvent operator or a Cayley operator.
- This conventional variable update rule has a problem that it takes time to converge to an optimal solution in some cases. In other words, there is a problem that it takes time to learn latent variables. This is the third problem.
- an object of the present invention is to provide a machine learning technique capable of performing machine learning even when the cost function is not a convex function.
- One aspect of the present invention includes a plurality of node units that machine-learn a mapping using a certain main variable that is common based on respective input data while transmitting / receiving information to / from each other, and originally corresponds to the machine learning.
- machine learning is performed so as to minimize the surrogate convex function that is the upper bound of the cost function
- the surrogate convex function is an expression of a linear gradient related to the main variable of the cost function, or It is represented by a first-order gradient equation and a second-order gradient equation for the main variable of the cost function.
- machine learning can be performed even when the cost function is not a convex function by using a proxy convex function that is an upper bound of the cost function instead of the cost function.
- FIG. 2 is a block diagram showing a configuration of a latent variable learning device 100.
- FIG. 3 is a flowchart showing the operation of the latent variable learning device 100.
- 3 is a block diagram illustrating a configuration of a model learning unit 120.
- FIG. 3 is a flowchart showing the operation of a model learning unit 120.
- FIG. 2 is a block diagram showing a configuration of a latent variable learning system 20.
- FIG. 2 is a block diagram showing a configuration of a latent variable learning device 200.
- FIG. 3 is a block diagram illustrating a configuration of a model learning unit 220.
- FIG. 5 is a flowchart showing an operation of a model learning unit 220.
- 3 is a block diagram illustrating a configuration of a model learning unit 230.
- FIG. 5 is a flowchart showing the operation of a model learning unit 230.
- the set of node parts connected to node part i is defined as ⁇ N (i).
- j for the node part j in the node part i is ⁇ i
- j, b is ⁇ i
- f i be the cost function corresponding to the node part i used in machine learning.
- j is introduced as an auxiliary variable of the dual variable ⁇ i
- j, b be z i
- j are the same as ⁇ i
- j is introduced as an auxiliary variable of the dual variable ⁇ i
- j, b be y i
- T is the total number of updates, and is a positive integer, but is not a constant when online learning is assumed.
- variable group main variable, dual variable, dual auxiliary variable
- the symbol “ ⁇ ” does not represent the element of the set, but means that the calculation result on the right side is assigned to the left side.
- PR partitioning is an optimization method that guarantees fast convergence.
- the optimization z b ⁇ C 2 C 1 z b by PR partition is expressed as follows.
- Non-Patent Document 2 The proof is published in Non-Patent Document 2 about the fact that the second Cary operator C 2 corresponds to the permutation matrix P. Updating the top dual variable using the resolve operator can be transformed as follows:
- f * can be expressed as a function of ⁇ .
- Q XX T.
- Node j Node i (•) means that the i-th node unit i transmits information “•” to the node unit j.
- step (5) is replaced with the following calculation, and instead of lowering the convergence, it becomes Douglas Rachford (D-R) partitioning that can expect high stability in the convergence process. Usually 1/2 is used for ⁇ .
- the update of the above main variables and dual variables uses a general form of cost function.
- cost function the consensus building algorithm is as follows.
- step (5) is replaced with the following calculation, and instead of lowering the convergence, it becomes Douglas Rachford (D-R) partitioning that can expect high stability in the convergence process. Usually 1/2 is used for ⁇ .
- the dual auxiliary variable y is exchanged between the nodes, but the nature of the variable can be said to be equivalent to the dual variable ⁇ .
- the dual variable is calculated as follows.
- the main variable w often reflects the statistical properties of the data, but the statistical properties of the dual variables depend on the data after the Legendre transformation. Since Legendre transformation cannot be returned to the main variable without knowing the structure of the function, transmitting a dual variable or its dual auxiliary variable between nodes is a highly confidential data communication.
- the machine learning system comprises a plurality of nodes portions 1, ..., V, a plurality of sensor portions 1 S1, ..., S ⁇ 1, 2 S1, ..., 2 S ⁇ , V S1, ..., V S ⁇ ,
- Internet 0 N , LAN 1 N ,..., V N are provided.
- Each of the plurality of node units 1,..., V is an information processing device such as a server or a PC.
- each of the node units 1,..., V is provided with a learning unit 10 and a transmission / reception unit 11, and a storage unit 12 is provided in the learning unit 10. It is assumed that the storage unit 12 stores data necessary for processing and calculation described below.
- Sensor unit 1 S1, ..., S ⁇ 1, 2 S1, ..., 2 S ⁇ , V S1, ..., V S ⁇ a variety of environments (e.g., home, factory, field and buildings) in a number that is arranged in a dispersed Sensors (eg, microphone, camera, thermometer).
- the sensor units are 1 S1 ,..., 1 S ⁇ are sensors corresponding to the node unit 1. That is, the sensor unit 1 S1 ,..., 1 S ⁇ is a sensor that acquires data used in machine learning in the learning unit 10 of the node unit 1, and each of the sensor units 1 S 1 ,. 1 and LAN 1 N are connected. The same applies to the other sensor units.
- sensor unit 1 S1, ..., S ⁇ 1, 2 S1, ..., 2 S ⁇ , V S1, ..., data obtained through V S [gamma] has a plurality of nodes portions 1, ..., V (e.g., server, PC) Accumulated in.
- V S [gamma] is LAN1 N, ..., node unit 1 V N, ..., the V have been, the sensor unit 1 S1, ..., 1 S ⁇ , 2 S1, ..., 2 S ⁇ , V S1, ..., V S ⁇ node unit 1 via the Internet 0 N, ..., be connected to V it may, or may be connected by other connecting means, in short, the sensor unit 1 S1, ..., S ⁇ 1, 2 S1, ..., 2 S ⁇ , V S1, ..., V S ⁇ node unit data obtained is 1, ..., V should just be used.
- the node parts 1,..., V are connected via the Internet 0 N.
- the node parts 1, Any relationship that can be exchanged.
- the edge connection structure There is no restriction on the edge connection structure, and it may be a tree structure or a circular / circular structure. In short, all the node parts 1,..., V are connected directly or indirectly. That is, the edge connection structure is not a divided structure.
- the plurality of node units 1,..., V receive data from the transmission / reception unit 11, transmit data from the transmission / reception unit 11, read data from the storage unit 12, and write data to the storage unit 12. Perform the machine learning process described.
- the machine learning process is realized, for example, when the plurality of node units 1,..., V perform the processes from step S1 to step S8 in FIG.
- the node unit i (more specifically, the learning unit 10 of the node unit i) uses the data obtained by the sensor unit corresponding to the node unit i to update the dual variable based on the following equation (step) S2). That is, the node part i calculates ⁇ i
- the node unit i (more specifically, the learning unit 10 of the node unit i) updates the main variable based on the following equation (step S3). That is, the node unit i calculates w i, b (t + 1) defined by the following expression for each of i ⁇ ⁇ V, b ⁇ ⁇ B.
- the process of step S3 is performed when a main variable is required. In other words, the process of step S3 may not be performed every time.
- the node unit i (more specifically, the learning unit 10 of the node unit i) updates the dual auxiliary variable based on the following equation (step S4). That is, the node part i calculates y i
- the process of step S4 may not be performed except when the dual auxiliary variable y i
- the node unit i (more specifically, the transmission / reception unit 11 of the node unit i) transmits a dual auxiliary variable y i
- the node unit i transmits the dual auxiliary variable y i
- the transmission / reception unit 11 of the unit i operates so as to input the dual auxiliary variable received from the node unit j to the learning unit 10 of the node unit i as a dual auxiliary variable y j
- the process of step S5 may not be performed every time or may be asynchronous between nodes. For example, only when t is an even number, all the node parts i transmit the dual auxiliary variable y i
- At least one node part i and at least one node part j are dual auxiliary variables y i
- the node part i (more specifically, the learning part 10 of the node part i) updates the dual auxiliary variable based on the following equation (step S6). That is, as i ⁇ ⁇ V, j ⁇ ⁇ N (i), b ⁇ ⁇ B, the node part i becomes z i
- j, b (t + 1) y j
- the machine learning system including a plurality of node units each storing input data and machine learning a mapping using the main variable w based on the teacher data and the input data while transmitting / receiving information to / from each other
- the plurality of node units i perform machine learning so that if the dual variable used in machine learning is optimized, the main variable is also optimized, and not the main variable but the dual variable is transmitted / received to / from each other. Thereby, since a main variable is not transmitted / received between the node parts i, machine learning with higher secrecy than before can be performed.
- ⁇ Modification> [Modification 1] If you use the following cost function instead of the general cost function,
- step S2 the node unit i updates the dual variable based on the following equation.
- step S3 the node unit i updates the main variable based on the following equation.
- step S5 the node unit i may transmit and receive the dual variable ⁇ i
- step S6 is not performed. Processing of steps other than step S6 is performed in the same manner as described above. [Modification 4] Below, the device for compressing transmission capacity is demonstrated.
- the update difference approaches 0 as the optimization process converges. If irreversible compression processing is performed using this property, the transmission capacity should be further reduced.
- j, b (t + 1) is update difference information.
- j, b (t + 1) tends to be biased toward a number close to 0 as it converges.
- j, b (t + 1) is information that is originally desired to be exchanged between nodes.
- j, b (export) is already obtained reference information.
- each element constituting v is compared with the threshold value thr, and if it is equal to or less than the threshold value, processing is performed to make it zero.
- thr is a predetermined number larger than a predetermined 0, and is a very small number such as 1 ⁇ 10 ⁇ 5 .
- u is generated by performing a process of assigning a symbol according to the appearance probability. For example, u is generated by assigning a short symbol to an appearance probability value.
- the node unit i may perform the processing from step S9 to step S16 described below instead of the processing from step S5 to step S6. Since other processes are the same as those in the above-described embodiment, a duplicate description is omitted here.
- FIG. 8 shows a flowchart of an example of a machine learning method in which transmission capacity is reduced.
- the node unit i (more specifically, the learning unit 10 of the node unit i) updates the reference data based on the following equation (step S10). . That is, the node unit i updates the reference data for each of i ⁇ ⁇ V, j ⁇ ⁇ N (i), b ⁇ ⁇ B based on the following formula (step S10).
- the node part i (more specifically, the transmission / reception part 11 of the node part i) transmits m i
- the transmission / reception unit 11 of the node unit i transmits m i
- the processing shown in the following equation may be performed only for some node portions i, not for all node portions i, j, but only for some node portions j.
- At least one node part i sends mi i
- the node part i (more specifically, the learning part 10 of the node part i), for each of i ⁇ ⁇ V, j ⁇ ⁇ N (i), b ⁇ ⁇ B, is expressed as follows:
- j, b (t + 1) is updated using the information obtained by the exchange (step S12).
- the node unit i calculates an update difference based on the following equation (step S13). That is, the node part i calculates v i
- the node unit i (more specifically, the learning unit 10 of the node unit i) performs data for each of i ⁇ ⁇ V, j ⁇ ⁇ N (i), b ⁇ ⁇ B based on the following equations: To generate a code u i
- This data compression may be irreversible.
- C means a predetermined compression process such as a flooring-based compression process or an entropy coding-based compression process.
- the node unit i (more specifically, the transmission / reception unit 11 of the node unit i) transmits u i
- the node unit i transmits u i
- the transmission / reception unit 11 operates to input u j
- the processing shown in the following equation may be performed only for some node portions i, not for all node portions i, j, but only for some node portions j.
- at least one node part i assigns a code u i
- i, b (t + 1) output by the node unit j is received (step S15).
- the node unit i (more specifically, the learning unit 10 of the node unit i) performs dual processing for each of i ⁇ ⁇ V, j ⁇ ⁇ N (i), b ⁇ ⁇ B based on the following formulas.
- j, b (t + 1) is updated (step S16).
- the node unit i may update the D-R division base exemplified below. Usually 1/2 is used for ⁇ .
- the node unit i may perform the D-R division base update exemplified below. Usually 1/2 is used for ⁇ .
- the machine learning system may be realized by a single computer.
- the processing contents of the functions that each unit of the machine learning system should have are described by a program.
- the machine learning system process is realized on the computer.
- the program describing the processing contents can be recorded on a computer-readable recording medium.
- a computer-readable recording medium for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.
- each unit may be configured by executing a predetermined program on a computer, or at least a part of these processing may be realized by hardware.
- the set of node parts connected to node part i is defined as ⁇ N (i).
- j for the node part j in the node part i is ⁇ i
- j, b is ⁇ i
- j is introduced as an auxiliary variable of the dual variable ⁇ i
- j, b be z i
- j are the same as ⁇ i
- j is introduced as an auxiliary variable of the dual variable ⁇ i
- j, b be y i
- the b- th element of the main variable w i of the node part i is set to w i, b, and the dual auxiliary variable w i, b after the t-th update is set to w i, b (t) .
- f i be the cost function corresponding to the node part i used in machine learning
- ⁇ f i (w i, b (t) ) be the first-order gradient for w i, b (t) of the cost function f i .
- T is the total number of updates, and is a positive integer, but is not a constant when online learning is assumed.
- ⁇ be a positive number.
- f i is called a 1 / ⁇ smoothing function.
- x and y are arbitrary vectors
- ⁇ x, y> represents an inner product of x and y.
- ⁇ the larger the gradient of the convex function on the right side (becomes a sharp function), which satisfies the condition that the cost is an upper bound. For this reason, ⁇ may be a small number. However, if ⁇ is too small, the update step is small, so that it takes time to reach the optimal solution.
- the solid line represents the original cost function f i .
- the cost function f i is written as a non-convex function.
- the alternate long and short dash line represents the surrogate convex function g i that is the upper bound.
- f i (w i, b (t) ) is a function output at the update time t and can be calculated.
- ⁇ f i (w i, b (t) ) is the gradient of the function at the update time t, and can be calculated in the framework of probabilistic optimization when DNN is assumed.
- the proxy function may be other than the above.
- the above-described surrogate convex function g i represents the upper bound of the original cost, and has an advantage that only the first-order gradient of the cost function needs to be obtained. It is also an advantage that it can be applied to non-convex functions.
- the following proxy function may be used.
- the underlined part of the following formula is the surrogate function.
- the concept of the surrogate convex function is a convex function that does not contradict the minimization of the original cost function, and can be implemented with the idea of upper bounds and approximation. Specifically, it is a convex function that can be calculated using a first-order and possibly a second-order gradient for the main variable of the original cost function.
- the optimal solution of w is derived by solving the convex conjugate function g * of the function g.
- This process can be written separately for each node.
- the symbol “ ⁇ ” does not represent the element of the set, but means that the calculation result on the right side is assigned to the left side.
- Quadratic PDMM When divided into operations for each node, a variable optimization algorithm can be obtained using a surrogate convex function on a P-R partition basis. This algorithm is also called Quadratic PDMM.
- the Quadratic PDMM is as follows.
- step (5) is replaced with the following calculation, and instead of lowering the convergence, it becomes Douglas Rachford (D-R) partitioning that can expect high stability in the convergence process. Usually 1/2 is used for ⁇ .
- the surrogate convex function changes at every update time, so it can be said that the secrecy is high.
- g indicates a 1 / ⁇ -strong convex function.
- 2 is a convex function, it is equivalent to g being a 1 / ⁇ -strong convex function (for example, non-patent (Refer to Section 2.2.5 of Reference 3.)
- 2 is a convex function, so g will be a 1 / ⁇ -strongly convex function.
- g * is a ⁇ -smooth function when g is a 1 / ⁇ -strong convex function. At that time, the following properties are satisfied.
- the proximity mapping function is defined as follows.
- G is composed of g * which is an ⁇ -smooth function and an indicator function ⁇ having a very strong convexity. Since the indicating function is included in G, it can be said that G is a strong convex function. Assuming G is a ⁇ -strong convex function,
- the machine learning system comprises a plurality of nodes portions 1, ..., V, a plurality of sensor portions 1 S1, ..., S ⁇ 1, 2 S1, ..., 2 S ⁇ , V S1, ..., V S ⁇ ,
- Internet 0 N , LAN 1 N ,..., V N are provided.
- Each of the plurality of node units 1,..., V is an information processing device such as a server or a PC.
- each of the node units 1,..., V is provided with a learning unit 10 and a transmission / reception unit 11, and a storage unit 12 is provided in the learning unit 10. It is assumed that the storage unit 12 stores data necessary for processing and calculation described below.
- Sensor unit 1 S1, ..., S ⁇ 1, 2 S1, ..., 2 S ⁇ , V S1, ..., V S ⁇ a variety of environments (e.g., home, factory, field and buildings) in a number that is arranged in a dispersed Sensors (eg, microphone, camera, thermometer).
- the sensor units are 1 S1 ,..., 1 S ⁇ are sensors corresponding to the node unit 1. That is, the sensor unit 1 S1 ,..., 1 S ⁇ is a sensor that acquires data used in machine learning in the learning unit 10 of the node unit 1, and each of the sensor units 1 S 1 ,. 1 and LAN 1 N are connected. The same applies to the other sensor units.
- sensor unit 1 S1, ..., S ⁇ 1, 2 S1, ..., 2 S ⁇ , V S1, ..., data obtained through V S [gamma] has a plurality of nodes portions 1, ..., V (e.g., server, PC) Accumulated in.
- V S [gamma] is LAN1 N, ..., node unit 1 V N, ..., the V have been, the sensor unit 1 S1, ..., 1 S ⁇ , 2 S1, ..., 2 S ⁇ , V S1, ..., V S ⁇ node unit 1 via the Internet 0 N, ..., be connected to V it may, or may be connected by other connecting means, in short, the sensor unit 1 S1, ..., S ⁇ 1, 2 S1, ..., 2 S ⁇ , V S1, ..., V S ⁇ node unit data obtained is 1, ..., V should just be used.
- the node parts 1,..., V are connected via the Internet 0 N.
- the node parts 1, Any relationship that can be exchanged.
- the edge connection structure There is no restriction on the edge connection structure, and it may be a tree structure or a circular / circular structure. In short, all the node parts 1,..., V are connected directly or indirectly. That is, the edge connection structure is not a divided structure.
- the plurality of node units 1,..., V receive data from the transmission / reception unit 11, transmit data from the transmission / reception unit 11, read data from the storage unit 12, and write data to the storage unit 12. Perform the machine learning process described.
- the machine learning process is realized, for example, when the plurality of node units 1,..., V perform the processes from step S1 to step S8 in FIG.
- the node unit i (more specifically, the learning unit 10 of the node unit i) uses the data obtained by the sensor unit corresponding to the node unit i to update the dual variable based on the following equation (step) S2). That is, the node part i calculates ⁇ i
- the node unit i (more specifically, the learning unit 10 of the node unit i) updates the main variable based on the following equation (step S3). That is, the node unit i calculates w i, b (t + 1) defined by the following expression for each of i ⁇ ⁇ V, b ⁇ ⁇ B.
- the node unit i (more specifically, the learning unit 10 of the node unit i) updates the dual auxiliary variable based on the following equation (step S4). That is, the node part i calculates y i
- the process of step S4 may not be performed except when the dual auxiliary variable y i
- the node unit i (more specifically, the transmission / reception unit 11 of the node unit i) transmits a dual auxiliary variable y i
- the node unit i transmits the dual auxiliary variable y i
- the transmission / reception unit 11 of the unit i operates so as to input the dual auxiliary variable received from the node unit j to the learning unit 10 of the node unit i as a dual auxiliary variable y j
- the process of step S5 may not be performed every time or may be asynchronous between nodes. For example, only when t is an even number, all the node parts i transmit the dual auxiliary variable y i
- At least one node part i and at least one node part j are dual auxiliary variables y i
- the node part i (more specifically, the learning part 10 of the node part i) updates the dual auxiliary variable based on the following equation (step S6). That is, as i ⁇ ⁇ V, j ⁇ ⁇ N (i), b ⁇ ⁇ B, the node part i becomes z i
- j, b (t + 1) y j
- machine learning in the machine learning system including a plurality of node units that each store input data and machine learn a mapping using the main variable based on the teacher data and the input data while transmitting and receiving information to and from each other.
- machine learning is performed so as to minimize the surrogate convex function that is the upper bound of the cost function.
- a machine learning method including a step in which a plurality of node units perform machine learning using a single main variable that is common based on respective input data while transmitting and receiving information to and from each other.
- the plurality of node units are machine-learned so as to minimize the surrogate convex function that is the upper bound of the cost function, instead of the cost function of the non-convex function that originally corresponds to machine learning.
- the surrogate convex function is expressed by a linear gradient equation regarding the main variable of the cost function, or a linear gradient equation and a quadratic gradient equation regarding the main variable of the cost function.
- step S5 the node unit i may transmit and receive the dual variable ⁇ i
- step S6 is not performed. Processing of steps other than step S6 is performed in the same manner as described above. [Modification 3] Below, the device for compressing transmission capacity is demonstrated.
- the update difference approaches 0 as the optimization process converges. If irreversible compression processing is performed using this property, the transmission capacity should be further reduced.
- j, b (t + 1) is update difference information.
- j, b (t + 1) tends to be biased toward a number close to 0 as it converges.
- j, b (t + 1) is information that is originally desired to be exchanged between nodes.
- j, b (export) is already obtained reference information.
- each element constituting v is compared with the threshold value thr, and if it is equal to or less than the threshold value, processing is performed to make it zero.
- thr is a predetermined number larger than a predetermined 0, and is a very small number such as 1 ⁇ 10 ⁇ 5 .
- u is generated by performing a process of assigning a symbol according to the appearance probability. For example, u is generated by assigning a short symbol to an appearance probability value.
- T interval is a predetermined positive integer.
- the node unit i may perform the processing from step S9 to step S16 described below instead of the processing from step S5 to step S6. Since other processes are the same as those in the above-described embodiment, a duplicate description is omitted here.
- FIG. 8 shows a flowchart of an example of a machine learning method in which transmission capacity is reduced.
- the node unit i (more specifically, the learning unit 10 of the node unit i) updates the reference data based on the following equation (step S10). . That is, the node unit i updates the reference data for each of i ⁇ ⁇ V, j ⁇ ⁇ N (i), b ⁇ ⁇ B based on the following formula (step S10).
- the node part i (more specifically, the transmission / reception part 11 of the node part i) transmits m i
- the transmission / reception unit 11 of the node unit i transmits m i
- the processing shown in the following equation may be performed only for some node portions i, not for all node portions i, j, but only for some node portions j.
- At least one node part i sends mi i
- the node part i (more specifically, the learning part 10 of the node part i), for each of i ⁇ ⁇ V, j ⁇ ⁇ N (i), b ⁇ ⁇ B, is expressed as follows:
- j, b (t + 1) is updated using the information obtained by the exchange (step S12).
- the node unit i calculates an update difference based on the following equation (step S13). That is, the node part i calculates v i
- the node unit i (more specifically, the learning unit 10 of the node unit i) performs data for each of i ⁇ ⁇ V, j ⁇ ⁇ N (i), b ⁇ ⁇ B based on the following equations: To generate a code u i
- This data compression may be irreversible.
- C means a predetermined compression process such as a flooring-based compression process or an entropy coding-based compression process.
- the node unit i (more specifically, the transmission / reception unit 11 of the node unit i) transmits u i
- the node unit i transmits u i
- the transmission / reception unit 11 operates to input u j
- the processing shown in the following equation may be performed only for some node portions i, not for all node portions i, j, but only for some node portions j.
- at least one node part i assigns a code u i
- i, b (t + 1) output by the node unit j is received (step S15).
- the node unit i (more specifically, the learning unit 10 of the node unit i) performs dual processing for each of i ⁇ ⁇ V, j ⁇ ⁇ N (i), b ⁇ ⁇ B based on the following formulas.
- j, b (t + 1) is updated (step S16).
- the node unit i may update the D-R division base exemplified below. Usually 1/2 is used for ⁇ .
- the node unit i may perform the D-R division base update exemplified below. Usually 1/2 is used for ⁇ .
- the machine learning system may be realized by a single computer.
- the processing contents of the functions that each unit of the machine learning system should have are described by a program.
- the machine learning system process is realized on the computer.
- the program describing the processing contents can be recorded on a computer-readable recording medium.
- a computer-readable recording medium for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.
- each unit may be configured by executing a predetermined program on a computer, or at least a part of these processing may be realized by hardware.
- _ (Underscore) represents a subscript.
- xy_z represents that yz is a superscript to x
- xy_z represents that yz is a subscript to x.
- Bregman resolvent operator or a Bregman Cary operator is used instead of using a resolvent operator or a Cayley operator. That is, the variable update rule is constructed based on monotone operator division using the Bregman resolvent operator or the Bregman Cary operator. Details will be described below.
- any narrowly-defined convex function can be used as the function D.
- the Euclidean distance is also a kind of Bregman divergence. It may also be written as Bregman divergence B D to clarify that Bregman divergence B is defined using function D.
- the narrowly convex function D is a function having some kind of continuity. Specifically, the narrowly-convex convex function D has properties of strong convex (SC) and Lipschitz smooth (LS). These properties can be expressed as follows. (Characteristic 1) When the second-order Taylor expansion is used, the following approximate expression is established for the function D around the point w.
- the Hessian matrix H D (w) (H D (w) ⁇ R m ⁇ m ) is a positive definite matrix.
- the Hessian matrices K D and M D (K D ⁇ R m ⁇ m , M D ⁇ R m ⁇ m ) are positive definite matrices.
- the following relationship holds with the Hessian matrix H D (w).
- the Resolvent operator and the Cayley operator are generalized using Bregman divergence.
- the generalized Resolvent operator and Cary operator are called Bregman Resolvent operator and Bregman Cary operator, respectively.
- the Bregman Resolvent operator is described in Reference Non-Patent Document 2.
- Reference Non-Patent Document 2 H. H. Bauschke, J. M. Borwein, and P. L. Combettes, “Bregman Monotone Optimization Algorithms”, SIAM Journal on Control and Optimization, Vol. 42, Issue 2, pp. 596-636, 2003.
- the Bregman resolvent operator R n and the Bregman Cary operator C n are given by the following equations.
- Bregman divergence B corresponds to Euclidean distance
- Bregman Resolvent operator R n and Bregman Cary operator C n are respectively Resolvent operators.
- ⁇ n corresponding to the Cary operator ⁇ n . This will be specifically described.
- the subdifferentiation of function D is
- the Hessian matrix H G_1 (w), K G_1 , M G_1 has the following relationship.
- Theorem 1 When the function G 1 is SC and LS, the Bregman resolvent operator R 1 has the following convergence rate.
- ⁇ max, 1 and ⁇ min, 1 included in the equation (4-8) are given by the following equations.
- ⁇ max represents the maximum eigenvalue
- ⁇ min represents the minimum eigenvalue.
- LS (upper bound) and SC (lower bound) of the function G 2 can be expressed as follows using an arbitrary vector h ⁇ R m .
- Hessian matrices H G_2 (w), K G_2 , and M G_2 have the following relationship.
- the coefficients ⁇ max, 2 and ⁇ min, 2 are given by the equations (4-18) and (4-19).
- ⁇ 3 Variable update rule and convergence rate of generalized PR type monotone operator division, generalized DR type monotone operator division >>
- the PR-type monotone operator partition and the DR-type monotone operator partition are derived by transforming the equation (1-2) using the Bregman resolvent operator and the Bregman-Cary operator.
- the PR-type monotone operator division and the DR-type monotone operator division described here correspond to the generalization of the PR-type monotone operator division and the DR-type monotone operator division using Bregman divergence.
- the generalized PR type monotone operator division and the generalized DR type monotone operator division are used.
- [Generalized PR type monotone operator division] Equation (1-2) is transformed as follows.
- Equation (4-22) is obtained by replacing the Cary operator in Equation (3-1) with the Bregmann-Cary operator.
- Equation (4-37) is obtained by replacing the Cayley operator in Equation (3-2) with the Bregman Cayley operator.
- the inverse matrix of the Hessian matrix H D of the function D is the inverse matrix of the Hessian matrix H G_i of the function G i .
- equation (4-48) Substituting equation (4-48) into equation (4-1), which is the definition formula for Bregman divergence, yields:
- the real number ⁇ > 0 corresponds to a parameter that determines the learning step size, and must be selected so that the eigenvalue of the Hessian matrix H D (2GD) is greater than 0 and less than or equal to 1.
- Equation (4-50) the arithmetic mean is used in (case 2), but a geometric mean may be used instead of the arithmetic mean.
- Equation (4-50) (formula (4-50 ')) shows that the case 1, has designed a Hessian matrix H D is divided into the case 2.
- case 2 it is ideal to satisfy equation (4-47) for both functions G 1 and G 2 , but in practice this is difficult to ensure. Therefore, we decided to design Case 2 as shown in Equation (4-50) (Equation (4-50 ')).
- Equation (4-50 ') The reason why this design is preferable is that the following mathematical properties hold for the Hessian matrix. This will be specifically described. First, to calculate the inverse matrix of the Hessian matrix H D.
- the real number ⁇ > 0 corresponds to a parameter that determines the learning step size, and must be selected so that the eigenvalue of the Hessian matrix H D (AGD) is greater than 0 and less than or equal to 1.
- Equation (4-51) the arithmetic mean is used in (case 2), but a geometric mean may be used instead of the arithmetic mean.
- L G_1 and L G_2 are basically matrices assumed to be diagonal matrices, and are designed as follows in the RMSProp method.
- L G_1, L G_2 need not necessarily be a diagonal matrix.
- the design is as follows.
- Equation (3-3) The recursive variable update rules corresponding to Equation (3-3) to Equation (3-6) are as follows.
- Equation (4-54) corresponds to generalized P-R type monotone operator division
- equation (4-55) corresponds to generalized D-R type monotone operator division
- the Hessian matrix H D (z) and its inverse matrix H D ⁇ 1 (z) are both block diagonalization matrices as follows.
- Expression (4-52) includes updating of the latent variable p and the dual variable w, and two methods described below can be considered for handling the latent variable p and the dual variable w.
- Method 1 In this method, as in the conventional method, p is optimized using a penalty term and then w is optimized.
- variable update rules of equations (4-52) to (4-55) are specifically expressed as follows.
- Formula (4-54) and Formula (4-55) were divided into Formula (4-63), Formula (4-64), and Formula (4-65). This is because the variable can be updated asynchronously for each node.
- the algorithm shown in FIG. 12 is obtained.
- the Hessian matrix H D in the algorithm is preferably designed using the equation (4-50) or Formula (4-51).
- Method 2 In this method, the dual variable w is updated, and the latent variable p is updated as necessary.
- variable update rule is specifically expressed from the equations (4-68), (4-69), (4-54), and (4-55) as follows.
- the algorithm shown in FIG. 13 is obtained.
- the Hessian matrix H D in the algorithm is preferably designed using the equation (4-50) or Formula (4-51).
- v i, u and s i, u are a set of input data and teacher data at node i
- p i is a latent variable at node i.
- p * is the optimum value of the latent variable p
- p i t is the value of the latent variable at node i obtained by t updates
- m p′V.
- FIG. 15 shows the experimental results.
- B-MOS (GD) is the conventional method (algorithm in FIG. 11)
- B-MOS (AGD) is the method of the present application when the Hessian matrix is designed by the acceleration gradient method (AGD) (algorithm in FIG. 13)
- MOS (2GD) indicates the method of the present application (algorithm in FIG. 13) when a Hessian matrix is designed by the Newton method.
- D-ADMM represents another conventional method different from B-MOS (GD).
- the method of the present application has a higher convergence rate than the conventional method.
- the Hessian matrix is designed according to the Newton method, the highest convergence rate is obtained.
- ⁇ 7 Expression using differentiation >> At the beginning of ⁇ 1: Definition of Bregman Resolvent and Bregman Cary Operators >>, it is assumed that function D is a narrowly convex function that can be differentiated.
- the explanation that the description is “differentiation of function D” holds. In the following, an expression in which the sub-derivative is replaced with the derivative with respect to the main expression will be shown.
- Bregman divergence B is defined as follows.
- ⁇ represents an operation that differentiates a function.
- (Nature 1) and (Nature 2) of the narrowly convex function D can be expressed as follows. (Characteristic 1) When the second-order Taylor expansion is used, the following approximate expression is established for the function D around the point w.
- the Bregman resolvent operator R n and the Bregman Cary operator C n are given by the following equations.
- the Bregman resolver operator R n and the Bregman Cary operator C n become the resolvent operator ⁇ n and the Cayley operator ⁇ n , as in the case of using the sub-derivative.
- the generalized P-R type monotone operator division is as follows, as in the case of subdifferentiation.
- the generalized D-R type monotonic operator division is as follows, as in the case of subdifferentiation.
- the function D is expressed by the quadratic expression of the expression (4-48) and substituted into the expression (4-1) * which is the definition expression of the Bregman divergence, thereby obtaining the following.
- the Bregman-Cary operator C 2 satisfies the following expression.
- the function G i is differentiable. That is, it is assumed that the differential ⁇ G i of the function G i exists. Even if not those that function G i is differentiable, e.g., for the function G i in the process, such as smoothing may be differentiable function, the function G i is also assumed to be differentiable problem Absent.
- ⁇ D is updated at an arbitrary timing only when the condition expressed by the following formula is satisfied in order to satisfy that function D is strongly convex. Therefore, if the condition is not satisfied, ⁇ D is not updated.
- the function D may be expressed by the following equation.
- FIG. 16 is a block diagram illustrating a configuration of the latent variable learning device 100.
- FIG. 17 is a flowchart showing the operation of the latent variable learning device 100.
- the latent variable learning device 100 includes a model learning unit 120 and a recording unit 190.
- the latent variable learning device 100 learns a latent variable w ⁇ R m (m is an integer of 1 or more) of a model to be machine-learned using learning data.
- a model is a function that takes input data as input and output data as output.
- Learning data means input data used to learn model latent variables or model latent variables.
- the model learning unit 120 learns the latent variable w by using a predetermined procedure using the learning data.
- the procedure will be specifically described below. (Procedure 1)
- a learning procedure using Bregman divergence will be described.
- the model learning unit 120 calculates setup data used when learning the latent variable w using the learning data.
- a cost function G (w): R m ⁇ R (where G (w) ⁇ i G i (w), function G i ) for optimizing the latent variable w, calculated using learning data
- G (w) (where N is an integer of 2 or more and i is an integer satisfying 1 ⁇ i ⁇ N (that is, i is an index)).
- w 2 ) D (w 1 ) -D (w 2 ) defined using the function D (where D: R m ⁇ R is a strictly convex function)
- D the function D (where D: R m ⁇ R is a strictly convex function)
- the model learning unit 120 uses the w stop point w that minimizes the cost function G (w).
- the latent variable w is learned so that Bregman divergence B D (B D (w
- the function D is a narrowly convex function that can be arbitrarily designed.
- the model learning unit 120 calculates setup data used when learning the latent variable w using the learning data.
- G (w) the function D and functions G i blur is defined using grayed man resolvase Vento operator R i (1 ⁇ i ⁇ N ), the blur grayed man resolvase Vento operator R i
- the Bregman-Cary operator C i (1 ⁇ i ⁇ N) defined by using is calculated as setup data.
- the value of the latent variable w is recursively calculated using the Bregman resolvent operator R i (1 ⁇ i ⁇ N) and the Bregman Cary operator C i (1 ⁇ i ⁇ N).
- t is a variable used for counting the number of updates (hereinafter also referred to as a counter)
- the model learning unit 120 performs Bregman Resolvent operator R i (1 ⁇ i ⁇ N) and Bregman Cary operator C Using i (1 ⁇ i ⁇ N), w t + 1 that is the t + 1th update result of the latent variable w is recursively calculated.
- the counter t takes an integer value of 0 or more.
- FIG. 18 is a block diagram illustrating a configuration of the model learning unit 120.
- FIG. 19 is a flowchart showing the operation of the model learning unit 120.
- the model learning unit 120 includes an initialization unit 121, a latent variable calculation unit 122, a first auxiliary variable calculation unit 123, a second auxiliary variable calculation unit 124, and a third auxiliary variable calculation unit 125. And a counter update unit 126 and an end condition determination unit 127.
- model learning unit 120 The operation of the model learning unit 120 will be described with reference to FIG.
- auxiliary variables x, y, z ⁇ R m of the latent variable w are used.
- the latent variable calculation unit 122 calculates w t + 1 that is the t + 1th update result of the latent variable w from z t that is the tth update result of the auxiliary variable z, using Equation (5-1). calculate.
- the first auxiliary variable calculation unit 123 obtains the t + 1th update result of the auxiliary variable x from z t used in S122 and w t + 1 calculated in S122 according to Expression (5-2). t + 1 is calculated.
- the second auxiliary variable calculation unit 124 calculates y t + 1 that is the t + 1th update result of the auxiliary variable y from x t + 1 calculated in S123, using Expression (5-3).
- the third auxiliary variable calculation unit 125 calculates z t + 1 that is the t + 1th update result of the auxiliary variable z. This will be specifically described.
- the third auxiliary variable calculation unit 125 calculates the auxiliary variable z from x t + 1 calculated in S123 and y t + 1 calculated in S124 according to the equation (5-4). Calculate z t + 1 which is the t + 1th update result.
- the third auxiliary variable calculation unit 125 calculates z t used in S122 and x t + 1 calculated in S123 and S124 using Expression (5-5).
- z t + 1 which is the t + 1th update result of the auxiliary variable z is calculated from y t + 1 (where ⁇ is a real number satisfying 0 ⁇ ⁇ 1).
- the counter update unit 126 increments the counter t by 1. Specifically, t ⁇ t + 1.
- the function D may be a function that satisfies the following conditions.
- the function D is a function whose Hessian matrix is close to the inverse of the Hessian matrix of the function G i (w).
- the function D is a function such that the product of the inverse matrix of the Hessian matrix and the Hessian matrix of the function G i (w) is close to the unit matrix”.
- the Hessian matrix of the function D may be calculated as follows according to the Newton method or the acceleration gradient method.
- the function G 1 is a strictly convex function, that is, strongly convex (SC) and Lipschitz smoothing (LS).
- Equation (5-6) and Equation (5-8) follow the Newton method
- Equation (5-7) and Equation (5-9) follow the acceleration gradient method.
- the matrices L G_1 and L G_2 are matrices given by Expressions (5-10) and (5-11).
- the matrices L G_1 and L G_2 are matrices given by the equations (5-12) and (5-13).
- the matrix L G — 1 in Expression (5-10) and Expression (5-12) is a matrix calculated using the gradient of the function G 1 (w). Further, the matrix L G_2 of formula (5-11) and (5-13) is a matrix that is calculated by using the gradient of the function G 2 (w).
- the function G 1 (w) and the function G 2 (w) are strictly convex functions. However, the functions are not necessarily strictly strictly convex functions. . In other words, even when the function G 1 (w) or the function G 2 (w) may be treated as a narrowly convex function, the Hessian matrix of the function D is obtained by the equations (5-6) to (5-9). Can be calculated. More specifically: [Case 1]: The function G 1 is a strictly convex function (strongly convex (SC) and Lipschitz smoothing (LS)) or a narrowly convex function (strongly convex (SC) and Lipschitz smoothing (LS) ).
- the Hessian matrix of the function D can be calculated by the equations (5-6) and (5-7).
- G 1 and G 2 are a narrowly convex function (strongly convex (SC) and Lipschitz smoothing (LS)) or a narrowly convex function (strongly convex (SC) and When it can be assumed that it is Lipschitz smoothing (LS).
- the Hessian matrix of the function D can be calculated by the equations (5-8) and (5-9).
- the sub-differentiation is used.
- a differentiation may be used instead of the sub-differentiation.
- the following equation may be used instead of equations (5-1) to (5-5).
- ⁇ D when using differentiation, ⁇ D may be updated under the condition expressed by the following equation.
- ⁇ D shall be updated at an arbitrary timing only if the conditions expressed by the following formula are satisfied.
- the invention of this embodiment it is possible to learn a latent variable of a model that is a target of machine learning at high speed.
- a variable update rule based on monotone operator division using Bregman Resolvent operators and Bregman Cary operators, convergence to a stationary point (optimal solution) is fast.
- the latent variable can be updated so that It is also based on monotonic operator division using Bregman Resolvent and Bregman Cary operators, so that the convex function used to define Bregman divergence can be configured appropriately to speed up convergence to the stopping point.
- Variable update rules can be configured.
- FIG. 20 is a block diagram illustrating a configuration of the latent variable learning system 20.
- Latent variable learning system 20 as shown in FIG. 20, (the V 1 or more integer) V pieces including latent variable learning device 200 1, ..., a 200 V.
- Each latent variable learning device 200 i is connected to the network 900 and communicates with the latent variable learning device 200 j (j ⁇ i) as necessary.
- the network 900 for example, the Internet can be used.
- FIG. 21 is a block diagram showing a configuration of the latent variable learning apparatus 200.
- the latent variable learning device 200 includes a model learning unit 220, a communication unit 280, and a recording unit 290.
- the latent variable learning device 200 i learns a latent variable p i ⁇ R p ′ (p ′ is an integer of 1 or more) of a model to be machine learning using the learning data.
- the learning data may be common to all the latent variable learning devices 200 i , or may be different for each latent variable learning device 200 i .
- FIG. 22 is a block diagram illustrating a configuration of the model learning unit 220.
- FIG. 23 is a flowchart showing the operation of the model learning unit 220.
- the model learning unit 220 includes an initialization unit 221, a latent variable calculation unit 222, a first dual variable calculation unit 223, a synchronization variable update unit 224, a second dual variable calculation unit 225, The counter update unit 226 and the end condition determination unit 227 are included.
- model learning unit 220 The operation of the model learning unit 220 will be described with reference to FIG. Before entering a specific description, some symbols will be described. These symbols have been used in the discussion so far, and the following description corresponds to the summary.
- N (i) represents an index set of the latent variable learning device 200 group communicating with the latent variable learning device 200 i .
- the function F 1, i (p i ): R p ′ ⁇ R is a cost function for optimizing the latent variable p i calculated using the training data (where F 1, i (p i ) is Closed true convex function).
- j , ⁇ i are variables of the latent variable learning device 200 i for storing data transmitted from the latent variable learning device 200 j (where j ⁇ N (i)).
- j and ⁇ i are also referred to as synchronization variables.
- t is a variable (hereinafter also referred to as a counter) for counting the number of updates of the variable, and ⁇ ( ⁇ is an integer of 1 or more) is a predetermined number of updates. This ⁇ represents the upper limit of the number of repetitions of S222 to S225 described later.
- the latent variable learning system 20 includes the latent variable learning device S i (i ⁇ v), and the latent variable learning device S i (i ⁇ v) uses the learning data to perform the latent variable p i according to a predetermined procedure. To learn. The procedure will be described below with reference to FIG.
- the initialization unit 221 calculates setup data.
- the cost function F 1, i (p i ) is an example.
- the latent variable calculation unit 222 by the following equation, dual variable z i
- the first dual variable calculator 223 calculates the dual variable ⁇ i
- j represents the Hessian matrix of the function D i
- the synchronization variable updating unit 224 uses the communication unit 280 to change the values obtained by learning of the latent variable learning device S j (j ⁇ N (i)) into the variables ⁇ i
- ⁇ is a real number satisfying 0 ⁇ ⁇ 1.
- equation (6-1) corresponds to the case where generalized P-R type monotone operator division is used
- equation (6-2) corresponds to the case where generalized DR type monotone operator division is used.
- the counter update unit 226 increments the counter t by 1. Specifically, t ⁇ t + 1.
- the termination condition determination unit 227 sets the value p of the latent variable p i at that time. i ⁇ is output and the process is terminated. Otherwise, the process returns to S222. That is, the model learning unit 220 repeats the calculations of S222 to S225. Note that z i
- j is preferably designed using the equations (5-6) to (5-9). (Modification 1)
- the above procedure corresponds to the algorithm of FIG.
- a procedure corresponding to the algorithm of FIG. 13 will be described.
- the variable ⁇ i is not used.
- the latent variable learning device 200 includes a model learning unit 230 instead of the model learning unit 220, as shown in FIG.
- the model learning unit 230 will be described with reference to FIGS.
- FIG. 24 is a block diagram illustrating a configuration of the model learning unit 230.
- FIG. 25 is a flowchart showing the operation of the model learning unit 230.
- the model learning unit 230 includes an initialization unit 231, a first dual variable calculation unit 232, a latent variable calculation unit 233, a synchronization variable update unit 234, a second dual variable calculation unit 235,
- the counter update unit 236 and the end condition determination unit 237 are included.
- model learning unit 230 The operation of the model learning unit 230 will be described with reference to FIG.
- the first dual variable calculation unit 232 for j ⁇ N (i), the following equation, dual variable z i
- j represents a Hessian matrix of the function D i
- H F_1, i represents a Hessian matrix of the function F 1, i . That is, H D — i
- ⁇ F 1, i represents the sub-derivative of the function F 1, i .
- the latent variable calculation unit 233 calculates p i t + 1 that is the t + 1th update result of the latent variable p i by the following equation.
- j t represents the t-th update result of the dual variable w i
- the synchronization variable updating unit 234 uses the communication unit 280 to change the value obtained by learning of the latent variable learning device S j (j ⁇ N (i)) to the variable ⁇ i
- ⁇ is a real number satisfying 0 ⁇ ⁇ 1.
- equation (6-1) corresponds to the case where generalized P-R type monotone operator division is used
- equation (6-2) corresponds to the case where generalized DR type monotone operator division is used.
- the counter update unit 236 increments the counter t by 1. Specifically, t ⁇ t + 1.
- the termination condition determination unit 237 determines the value p of the latent variable p i at that time. i ⁇ is output and the process is terminated. Otherwise, the process returns to S232. That is, the model learning unit 230 repeats the calculations of S232 to S235. Note that z i
- j is preferably designed using the equations (5-6) to (5-9).
- the invention of this embodiment it is possible to learn a latent variable of a model that is a target of machine learning at high speed.
- a variable update rule based on monotone operator division using Bregman Resolvent operators and Bregman Cary operators, convergence to a stationary point (optimal solution) is fast.
- the latent variable can be updated so that It is also based on monotonic operator division using Bregman Resolvent and Bregman Cary operators, so that the convex function used to define Bregman divergence can be configured appropriately to speed up convergence to the stopping point.
- Variable update rules can be configured.
- the apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices.
- the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM.
- a physical entity having such hardware resources includes a general-purpose computer.
- the external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.
- each program stored in an external storage device or ROM or the like
- data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate.
- the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).
- the processing functions in the hardware entity (the device of the present invention) described in the above embodiment are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.
- the program describing the processing contents can be recorded on a computer-readable recording medium.
- a computer-readable recording medium for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.
- a magnetic recording device a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording media, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.
- this program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.
- a computer that executes such a program first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads the program stored in its own storage device and executes the process according to the read program.
- the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially.
- the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition. It is good.
- the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).
- the hardware entity is configured by executing a predetermined program on the computer.
- a predetermined program on the computer.
- at least a part of these processing contents may be realized in hardware.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Operations Research (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
《Distributed ADMM法の合意形成アルゴリズム》
以下、Distributed ADMM法の合意形成アルゴリズムについて説明する。
(1)主変数更新
《PDMMベースの合意形成アルゴリズム》
次に、PDMMベースの合意形成アルゴリズムについて説明する。Distributed ADMMと大きく異なる点は、主変数だけでなく、双対変数も共有することである。後述するが、双対変数のペアについて合意形成される。
(1)主変数更新
(問題)
コスト関数Gは、関数G1と関数G2に加法分割されるものとする。ただし、関数G1, G2はいずれも閉真凸関数であるものとする(以後、閉真凸関数のことを単に凸関数という)。
《具体的問題1:エッジコンピューティングにおける合意形成問題》
V個のノードが任意に接続されたグラフG(v, e)(ただし、vはノードの集合、eはエッジの集合である)を考える。
《具体的問題2:画像/音声/言語の認識タスクにおける汎用モデル生成問題》
画像/音声/言語等の認識タスクにおける汎用モデルを生成するための方法として、以下のようなコストを用いて潜在変数pを学習させることが有用であることが知られている(参考非特許文献1)。
(参考非特許文献1:V. Vapnik, “Principles of Risk Minimization for Learning Theory”, Advances in Neural Information Processing Systems 4 (NIPS1991), pp.831-838, 1992.)
《最適解w*を求める従来方法》
ここでは、従来の解法について簡単に説明する。先述したように解くべき問題は式(1-1)である。また、式(1-1)で表される問題の最適解w*(wの停留点)は、式(1-2)が示すように、コストGの劣微分が0ベクトル(零ベクトル)を含むときに得られる。なお、Giが凸関数である場合、Giの劣微分∂Giは単調作用素となる。
以下、秘密性を高めた合意形成アルゴリズムについて説明する。この合意形成アルゴリズムは、従来のPDMMを変形したものである。
(1)双対変数更新(ここでは、最小二乗問題に限定せず、コストを一般形式として記述する。)
(1)双対変数更新
図5に示すように、機械学習システムは、複数のノード部1,…,V、複数のセンサ部1S1,…,1Sα,2S1,…,2Sβ,VS1,…,VSγ、インターネット0N、LAN 1N,…,VNを例えば備えている。複数のノード部1,…,Vのそれぞれは、サーバ、PC等の情報処理機器である。ノード部1,…,Vのそれぞれには、図6で示すように、学習部10と送受信部11が設けられており、学習部10内には記憶部12が設けられている。記憶部12には、以下に説明する処理及び計算に必要なデータが記憶されているとする。
《変形例》
[変形例1]
一般形式のコスト関数ではなく、以下のコスト関数を用いた場合には、
i=1,…,Vとして、ノード部iにDi個のデータが蓄積されているが、Diは時間によって変化してもよい。オンラインの最適化にも対応可能である。このとき、コスト関数は、以下のよう書ける
ステップS5において、ノード部iは、双対補助変数yi|j,b (t+1)ではなく、双対変数λi|j,b (t+1)を送受信してもよい。
[変形例4]
以下では、伝送容量を圧縮するための工夫について説明する。
(1)双対変数更新(ここでは、最小二乗問題に限定せず、コストを一般形式として記述する。)
(4) 参照データの更新
(7) 更新差分を計算
《プログラム及び記録媒体》
機械学習システムを、1つのコンピュータによって実現してもよい。この場合、機械学習システムの各部が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、機械学習システムの処理がコンピュータ上で実現される。
以下、非凸関数に対する最適化にも適用できる合意形成アルゴリズムについて説明する。
(1)双対変数更新
図5に示すように、機械学習システムは、複数のノード部1,…,V、複数のセンサ部1S1,…,1Sα,2S1,…,2Sβ,VS1,…,VSγ、インターネット0N、LAN 1N,…,VNを例えば備えている。複数のノード部1,…,Vのそれぞれは、サーバ、PC等の情報処理機器である。ノード部1,…,Vのそれぞれには、図6で示すように、学習部10と送受信部11が設けられており、学習部10内には記憶部12が設けられている。記憶部12には、以下に説明する処理及び計算に必要なデータが記憶されているとする。
複数のノード部は、機械学習に本来対応する非凸関数のコスト関数に代えて、コスト関数の上界となる代理凸関数を最小化するように機械学習が行われる。
《変形例》
[変形例1]
i=1,…,Vとして、ノード部iにDi個のデータが蓄積されているが、Diは時間によって変化してもよい。オンラインの最適化にも対応可能である。このとき、コスト関数は、以下のよう書ける
ステップS5において、ノード部iは、双対補助変数yi|j,b (t+1)ではなく、双対変数λi|j,b (t+1)を送受信してもよい。
[変形例3]
以下では、伝送容量を圧縮するための工夫について説明する。
(1)双対変数更新
(7) 更新差分を計算
《プログラム及び記録媒体》
機械学習システムを、1つのコンピュータによって実現してもよい。この場合、機械学習システムの各部が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、機械学習システムの処理がコンピュータ上で実現される。
本願では、リゾルヴェント作用素やケーリー作用素を用いる代わりに、ブレグマンリゾルヴェント作用素やブレグマンケーリー作用素を用いる。つまり、ブレグマンリゾルヴェント作用素やブレグマンケーリー作用素を用いた単調作用素分割に基づき、変数更新則を構成する。以下、詳細に説明する。
《1:ブレグマンレゾルヴェント作用素とブレグマンケーリー作用素の定義》
まず、ブレグマンダイバージェンスBについて説明する。ブレグマンダイバージェンスBは連続で微分可能な狭義凸関数Dを用いて以下のように定義される(以後、連続で微分可能な狭義凸関数のことを単に狭義凸関数という)。
(性質1)2次のテーラー展開を用いると、関数Dに対して、点wの周りにおいて以下の近似式が成り立つ。
(参考非特許文献2:H. H. Bauschke, J. M. Borwein, and P. L. Combettes, “Bregman Monotone Optimization Algorithms”, SIAM Journal on Control and Optimization, Vol.42, Issue 2, pp.596-636, 2003.)
ブレグマンリゾルヴェント作用素Rnとブレグマンケーリー作用素Cnは、次式で与えられる。
ここでは、ブレグマンリゾルヴェント作用素、ブレグマンケーリー作用素の収束率に関して、2つのケースについて説明する。
[ケース1]:関数G1は、狭義凸関数である、すなわち、強凸(SC)かつリプシッツ平滑(LS)である。
(性質1)2次のテーラー展開を用いると、関数G1に対して、点wの周りにおいて以下の近似式が成り立つ。
(定理1)関数G1がSCかつLSである場合、ブレグマンレゾルヴェント作用素R1は以下の収束率をもつ。
[ケース2]:関数G1, G2は、いずれも狭義凸関数である、すなわち、いずれも強凸(SC)かつリプシッツ平滑(LS)である。
(性質1)2次のテーラー展開を用いると、関数G2に対して、点wの周りにおいて以下の近似式が成り立つ。
(定理3)関数G2がSCかつLSである場合、ブレグマンレゾルヴェント作用素R2は以下の収束率をもつ。
《3:一般化P-R型単調作用素分割、一般化D-R型単調作用素分割の変数更新則と収束率》
ブレグマンリゾルヴェント作用素、ブレグマンケーリー作用素を用いて、式(1-2)を変形することにより、P-R型の単調作用素分割とD-R型の単調作用素分割を導出する。ここで説明するP-R型の単調作用素分割とD-R型の単調作用素分割は、ブレグマンダイバージェンスを用いたP-R型の単調作用素分割、D-R型の単調作用素分割の一般化に相当する。以下、それぞれ一般化P-R型単調作用素分割、一般化D-R型単調作用素分割ということにする。
[一般化P-R型単調作用素分割]
式(1-2)を以下のように変形していく。
[一般化D-R型単調作用素分割]
一般化D-R型単調作用素分割は式(4-22)に平均化作用素を加えることで得られる。
《4:高収束率を得るためのブレグマンダイバージェンス設計》
一般化P-R型単調作用素分割や一般化D-R型単調作用素分割による変数更新則を用いることにより、最適解を求めることができる。ここでは、より高速に最適解を求めることができるような、ブレグマンダイバージェンスの設計(関数Dの設計)について説明する。具体的には、2つの設計方法について説明する。
式(4-49)に従いブレグマンダイバージェンスを設計する際、式(4-47)を最も忠実に満たすようにするには、ヘシアン行列HDを式(4-50)に従うように設計するとよい。なお、この設計は、二次収束性の最適化法としてよく知られているニュートン法に通じるものである。式(4-50)の代わりに、BFGS法などのヘシアン行列の近似計算法を用いてもよい。
[加速勾配法に従うブレグマンダイバージェンス設計]
ニュートン法に従うブレグマンダイバージェンス設計を用いた場合、実際の変数更新則では、ヘシアン行列HDの逆行列を計算する必要があるが、この計算には非常に大きなコストがかかる。この計算コストの問題を克服するために、式(4-47)の再現性を多少犠牲にして、ヘシアン行列HDの逆行列の計算コストを下げることにする。このような設計として、加速勾配法に従うブレグマンダイバージェンス設計を説明する。この設計は簡単に説明すると、ヘシアン行列HDを対角行列に制約したうえで、式(4-47)をできるだけ満たすようにヘシアン行列HDを設計するものである。なお、この設計は、超一次収束として知られている加速勾配法に通じるものである。
ここでは、エッジコンピューティングにおける合意形成問題を例として、ブレグマンダイバージェンスを用いて一般化された単調作用素分割を用いた再帰的な変数更新則である変数更新アルゴリズムについて説明する。
[方法1]
この方法は、従来法と同様、罰則項を用いてpの最適化を行った後、wの最適化を行う。
[方法2]
この方法は、双対変数wの更新を行い、必要に応じて潜在変数pの更新を行う。
《6:実験及びその結果》
以上説明した変数更新則の効果について確認するため、収束率に関する実験を行った。実験では、図14のような2種類のグラフ構成をもった分散計算器(エッジコンピューティングシステム)を用いた。分散計算器の各ノードにはランダムに生成した異なるデータセットを配置し、潜在変数の収束率を計測した。
《7:微分を用いた表現》
《1:ブレグマンレゾルヴェント作用素とブレグマンケーリー作用素の定義》の冒頭部において、関数Dは微分可能な狭義凸関数であると仮定したことから、関数Dの劣微分を関数Dの微分としても《1:ブレグマンレゾルヴェント作用素とブレグマンケーリー作用素の定義》から《5:エッジコンピューティングにおける合意形成問題への適用》までの議論は成り立つ。具体的には、《1:ブレグマンレゾルヴェント作用素とブレグマンケーリー作用素の定義》から《5:エッジコンピューティングにおける合意形成問題への適用》までの説明における“関数Dの劣微分”との記載を“関数Dの微分”とした説明が成り立つ。以下、主たる式に関して劣微分を微分で置き換えた式を示す。
(性質1)2次のテーラー展開を用いると、関数Dに対して、点wの周りにおいて以下の近似式が成り立つ。
《4:高収束率を得るためのブレグマンダイバージェンス設計》で説明した2つのブレグマンダイバージェンス設計手法では、ブレグマンダイバージェンスの計量を決める関数Dを2次式に限定したときにどのように設計すればよいのかについて論じた。ここでは、2次以上の高次凸性を用いて、更なる高速収束を可能にする関数Dの設計について説明する。なお、以下では関数Dの微分∇Dの設計について説明するが、変数更新の際に使用するのは∇Dであるので、∇Dの設計に関して説明しても一般性を失うことはない。
以下、図16~図17を参照して潜在変数学習装置100を説明する。図16は、潜在変数学習装置100の構成を示すブロック図である。図17は、潜在変数学習装置100の動作を示すフローチャートである。図16に示すように潜在変数学習装置100は、モデル学習部120と、記録部190を含む。
(手順1)
ここでは、ブレグマンダイバージェンスを用いた学習手順について説明する。
(手順2)
ここでは、<技術的背景>で説明したように、単調作用素分割を用いて構成される潜在変数の更新規則を用いた学習手順について説明する。<技術的背景>では、N=2の場合について詳細に説明したが、単調作用素分割を用いて潜在変数の更新規則を構成できる任意のNについて、以下説明するような学習手順を構成することができる。例えば、N=3の場合についても、N=2の場合と同様、単調作用素分割の変形が可能であることが数学的に証明できるので、同様の学習手順を構成することができる。
コスト関数G(w)を最小化するwの停留点w*への収束率を高くするためには、関数Dを以下の条件を満たす関数にすればよい。
(条件)関数Dは、そのヘシアン行列が関数Gi(w)のヘシアン行列の逆行列に近くなるような関数である。
[ケース1]:関数G1は、狭義凸関数である、すなわち、強凸(SC)かつリプシッツ平滑(LS)である場合。
なお、実数εは、ヘシアン行列HDの固有値が0より大きく1以下に収まるように選ばれる。
[ケース2]:関数G1, G2は、いずれも狭義凸関数である、すなわち、いずれも強凸(SC)かつリプシッツ平滑(LS)である場合。
なお、実数εは、ヘシアン行列HDの固有値が0より大きく1以下に収まるように選ばれる。
[ケース1]:関数G1が、狭義凸関数である(強凸(SC)かつリプシッツ平滑(LS)である)、または、狭義凸関数である(強凸(SC)かつリプシッツ平滑(LS)である)と仮定できる場合。
[ケース2]:関数G1, G2のそれぞれが、狭義凸関数である(強凸(SC)かつリプシッツ平滑(LS)である)、または、狭義凸関数である(強凸(SC)かつリプシッツ平滑(LS)である)と仮定できる場合。
ここでは、<技術的背景>の《5:エッジコンピューティングにおける合意形成問題への適用》で説明した2つの変数更新アルゴリズム(図12及び図13のアルゴリズム)に対応する実施形態について説明する。
(変形例1)
上述の手順は、図12のアルゴリズムに対応するものである。ここでは、図13のアルゴリズムに対応する手順について説明する。当該手順では、潜在変数を交換する必要がないため、変数ζiを用いない。
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置(例えば通信ケーブル)が接続可能な通信部、CPU(Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい)、メモリであるRAMやROM、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、CPU、RAM、ROM、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、CD-ROMなどの記録媒体を読み書きできる装置(ドライブ)などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。
Claims (12)
- 情報を互いに送受信しながら、それぞれの入力データに基づいて共通するある1つの主変数を用いた写像を機械学習する複数のノード部を含む機械学習システムにおいて、
上記機械学習に本来対応する非凸関数のコスト関数に代えて、上記コスト関数の上界となる代理凸関数を最小化するように上記機械学習が行われ、
上記代理凸関数は、上記コスト関数の主変数に関する一次勾配の式、又は、上記コスト関数の主変数に関する一次勾配の式及び二次勾配の式で表されている、
機械学習システム。 - 請求項1に記載の機械学習システムにおいて、
Vを2以上の所定の正の整数とし、上記複数のノード部はノード部1,…,Vであるとし、~V={1,…,V}とし、Bを所定の正の整数とし、b=1,…,Bとし、~B={1,…,B}とし、ノード部iにつながっているノード部の集合を~N(i)とし、ノード部iにおけるノード部jに対する双対変数λi|jを構成するb個目のベクトルをλi|j,bとし、t+1回目の更新後の双対変数λi|j,bをλi|j,b (t+1)とし、双対変数λi|jの双対補助変数zi|jを構成するb個目のベクトルをzi|j,bとし、t+1回目の更新後の双対補助変数zi|j,bをzi|j,b (t+1)とし、双対変数λi|jの双対補助変数yi|jを構成するb個目のベクトルをyi|j,bとし、t+1回目の更新後の双対補助変数yi|j,bをyi|j,b (t+1)とし、ノード部iの主変数wiのb個目の要素をwi,bとし、t回目の更新後の双対補助変数wi,bをwi,b (t)とし、上記機械学習で用いられるノード部iに対応するコスト関数をfiとし、コスト関数fiのwi,b (t)に関する一次勾配を∇fi(wi,b (t))とし、Ai|jは以下の式により定義されるとし、Iを単位行列とし、Oを零行列とし、σ1を所定の正の数とし、ηを正の数とし、Tを正の整数として、
t=0,…,T-1に対して、(a) ノード部iが、以下の式に基づいて双対変数の更新を行う処理と、
(b) ノード部iが、以下の式に基づいて、主変数の更新を行う処理と、
が行われ、t=0,…,T-1の一部又は全部に対しては、上記(a)及び上記(b)の処理に加えて、
(c) i∈~V, j∈~N(i), b∈~Bとして、少なくとも1つのノード部iが、少なくとも1つのノード部jに双対補助変数yi|j,b (t+1)を送信する処理と、
(d) i∈~V, j∈~N(i), b∈~Bとして、双対補助変数yj|i,b (t+1)を受信したノード部iが、zi|j,b (t+1)=yj|i,b (t+1)とする処理と、
が行われる、
機械学習システム。 - 複数のノード部が、情報を互いに送受信しながら、それぞれの入力データに基づいて共通するある1つの主変数を用いた写像を機械学習するステップを含む機械学習方法において、
上記複数のノード部は、上記機械学習に本来対応する非凸関数のコスト関数に代えて、上記コスト関数の上界となる代理凸関数を最小化するように上記機械学習を行い、
上記代理凸関数は、上記コスト関数の主変数に関する一次勾配の式、又は、上記コスト関数の主変数に関する一次勾配の式及び二次勾配の式で表されている、
機械学習方法。 - 請求項3に記載の機械学習方法において、
Vを2以上の所定の正の整数とし、上記複数のノード部はノード部1,…,Vであるとし、~V={1,…,V}とし、Bを所定の正の整数とし、b=1,…,Bとし、~B={1,…,B}とし、ノード部iにつながっているノード部の集合を~N(i)とし、ノード部iにおけるノード部jに対する双対変数λi|jを構成するb個目のベクトルをλi|j,bとし、t+1回目の更新後の双対変数λi|j,bをλi|j,b (t+1)とし、双対変数λi|jの双対補助変数zi|jを構成するb個目のベクトルをzi|j,bとし、t+1回目の更新後の双対補助変数zi|j,bをzi|j,b (t+1)とし、双対変数λi|jの双対補助変数yi|jを構成するb個目のベクトルをyi|j,bとし、t+1回目の更新後の双対補助変数yi|j,bをyi|j,b (t+1)とし、ノード部iの主変数wiのb個目の要素をwi,bとし、t回目の更新後の双対補助変数wi,bをwi,b (t)とし、上記機械学習で用いられるノード部iに対応するコスト関数をfiとし、コスト関数fiのwi,b (t)に関する一次勾配を∇fi(wi,b (t))とし、Ai|jは以下の式により定義されるとし、Iを単位行列とし、Oを零行列とし、σ1を所定の正の数とし、ηを正の数とし、Tを正の整数として、
t=0,…,T-1に対して、(a) ノード部iが、以下の式に基づいて双対変数の更新を行うステップと、
(b) ノード部iが、以下の式に基づいて、主変数の更新を行うステップと、
が行われ、t=0,…,T-1の一部又は全部に対しては、上記(a)及び上記(b)のステップに加えて、
(c) i∈~V, j∈~N(i), b∈~Bとして、少なくとも1つのノード部iが、少なくとも1つのノード部jに双対補助変数yi|j,b (t+1)を送信するステップと、
(d) i∈~V, j∈~N(i), b∈~Bとして、双対補助変数yj|i,b (t+1)を受信したノード部iが、zi|j,b (t+1)=yj|i,b (t+1)とするステップと、
が行われる、
機械学習方法。 - 請求項1または2に記載の機械学習システムの各部として、コンピュータを機能させるためのプログラム。
- 共通するある1つの主変数についての学習結果を得るための機械学習をする複数のノード部を含む機械学習システムにおいて、
上記複数のノード部のそれぞれは、
該ノード部に入力された入力データを用いて、該ノード部の双対変数を、双対変数を最適化すれば、主変数も最適化されるように、更新する学習部と、
上記主変数ではなく、該ノード部の学習部が得た上記双対変数又は上記双対変数の補助変数である双対補助変数を該ノード部以外のノード部に対して送信する送信処理と、
上記主変数ではなく、該ノード部以外のノード部が学習部で得た上記双対変数又は上記双対変数の補助変数である双対補助変数を受信する受信処理と、を行う送受信部と、
を含み、
上記学習部は、
上記送受信部が上記受信処理をした場合には、受信した上記双対変数又は上記双対変数の補助変数である双対補助変数を用いて上記双対変数を更新する
機械学習システム。 - 請求項6に記載の機械学習システムにおいて、
上記学習部における上記双対変数の更新処理は、
上記主変数の最適化の更新式が、上記主変数の更新前の項を含まず上記双対変数を含む式で表されることにより、行われる、
機械学習システム。 - 請求項6に記載の機械学習システムにおいて、
Vを2以上の所定の正の整数とし、上記複数のノード部はノード部1,…,Vであるとし、~V={1,…,V}とし、Bを所定の正の整数とし、b=1,…,Bとし、~B={1,…,B}とし、ノード部iにつながっているノード部の集合を~N(i)とし、ノード部iにおけるノード部jに対する双対変数λi|jを構成するb個目のベクトルをλi|j,bとし、t+1回目の更新後の双対変数λi|j,bをλi|j,b (t+1)とし、上記機械学習で用いられるノード部iに対応するコスト関数をfiとし、双対変数λi|jの双対補助変数zi|jを構成するb個目のベクトルをzi|j,bとし、t+1回目の更新後の双対補助変数zi|j,bをzi|j,b (t+1)とし、双対変数λi|jの双対補助変数yi|jを構成するb個目のベクトルをyi|j,bとし、t+1回目の更新後の双対補助変数yi|j,bをyi|j,b (t+1)とし、ノード部iの主変数wi,bのb個目の要素をwiとし、wi,bの最適解のb個目の要素をwi,b *とし、Ai|jは以下の式により定義されるとし、Iを単位行列とし、Oを零行列とし、σ1を所定の正の数とし、Tを正の整数として、
t=0,…,T-1に対して、(a) ノード部iが、以下の式に基づいて双対変数の更新を行う処理と、
(b) ノード部iが、以下の式に基づいて、双対補助変数の更新を行う処理と、
が行われ、t=0,…,T-1の一部又は全部に対しては、上記(a)及び上記(b)の処理に加えて、
(c) i∈~V, j∈~N(i), b∈~Bとして、少なくとも1つのノード部iが、少なくとも1つのノード部jに双対補助変数yi|j,b (t+1)を送信する処理と、
(d) i∈~V, j∈~N(i), b∈~Bとして、双対補助変数yj|i,b (t+1)を受信したノード部iが、zi|j,b (t+1)=yj|i,b (t+1)とする処理と、
が行われる、
機械学習システム。 - 複数のノード部が、共通するある1つの主変数についての学習結果を得るための機械学習をする機械学習方法において、
上記機械学習方法は、
上記複数のノード部のそれぞれの学習部が、該ノード部に入力された入力データを用いて、該ノード部の双対変数を、双対変数を最適化すれば、主変数も最適化されるように、更新する学習ステップと、
該ノード部の送受信部が、上記主変数ではなく、該ノード部の学習部が得た上記双対変数又は上記双対変数の補助変数である双対補助変数を該ノード部以外のノード部に対して送信する送信処理と、
上記主変数ではなく、該ノード部以外のノード部が学習部で得た上記双対変数又は上記双対変数の補助変数である双対補助変数を受信する受信処理と、を行う送受信ステップと、
を含み、
上記学習ステップは、
上記送受信ステップが上記受信処理をした場合には、受信した上記双対変数又は上記双対変数の補助変数である双対補助変数を用いて上記双対変数を更新する
機械学習方法。 - 請求項9に記載の機械学習方法において、
上記学習ステップにおける上記双対変数の更新処理は、
上記主変数の最適化の更新式が、上記主変数の更新前の項を含まず上記双対変数を含む式で表されることにより、行われる、
機械学習方法。 - 請求項9に記載の機械学習方法において、
Vを2以上の所定の正の整数とし、上記複数のノード部はノード部1,…,Vであるとし、~V={1,…,V}とし、Bを所定の正の整数とし、b=1,…,Bとし、~B={1,…,B}とし、ノード部iにつながっているノード部の集合を~N(i)とし、ノード部iにおけるノード部jに対する双対変数λi|jを構成するb個目のベクトルをλi|j,bとし、t+1回目の更新後の双対変数λi|j,bをλi|j,b (t+1)とし、上記機械学習で用いられるノード部iに対応するコスト関数をfiとし、双対変数λi|jの双対補助変数zi|jを構成するb個目のベクトルをzi|j,bとし、t+1回目の更新後の双対補助変数zi|j,bをzi|j,b (t+1)とし、双対変数λi|jの双対補助変数yi|jを構成するb個目のベクトルをyi|j,bとし、t+1回目の更新後の双対補助変数yi|j,bをyi|j,b (t+1)とし、ノード部iの主変数wi,bのb個目の要素をwiとし、wi,bの最適解のb個目の要素をwi,b *とし、Ai|jは以下の式により定義されるとし、Iを単位行列とし、Oを零行列とし、σ1を所定の正の数とし、Tを正の整数として、
t=0,…,T-1に対して、(a) ノード部iが、以下の式に基づいて双対変数の更新を行うステップと、
(b) ノード部iが、以下の式に基づいて、双対補助変数の更新を行うステップと、
が行われ、t=0,…,T-1の一部又は全部に対しては、上記(a)及び上記(b)のステップ加えて、
(c) i∈~V, j∈~N(i), b∈~Bとして、少なくとも1つのノード部iが、少なくとも1つのノード部jに双対補助変数yi|j,b (t+1)を送信するステップ、
(d) i∈~V, j∈~N(i), b∈~Bとして、双対補助変数yj|i,b (t+1)を受信したノード部iが、zi|j,b (t+1)=yj|i,b (t+1)とするステップと、
が行われる、
機械学習方法。 - 請求項6ないし8のいずれか1項に記載の機械学習システムの各部として、コンピュータを機能させるためのプログラム。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/047,028 US12450524B2 (en) | 2018-04-12 | 2019-04-12 | Machine learning system, machine learning method, and program |
| JP2020513461A JP7091446B2 (ja) | 2018-04-12 | 2019-04-12 | 機械学習システム、機械学習方法、プログラム |
| JP2022037711A JP7309004B2 (ja) | 2018-04-12 | 2022-03-11 | 潜在変数学習装置、潜在変数学習方法、プログラム |
Applications Claiming Priority (10)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2018076815 | 2018-04-12 | ||
| JP2018-076817 | 2018-04-12 | ||
| JP2018076816 | 2018-04-12 | ||
| JP2018076817 | 2018-04-12 | ||
| JP2018-076815 | 2018-04-12 | ||
| JP2018-076814 | 2018-04-12 | ||
| JP2018076814 | 2018-04-12 | ||
| JP2018-076816 | 2018-04-12 | ||
| JP2018-202397 | 2018-10-29 | ||
| JP2018202397 | 2018-10-29 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019198815A1 true WO2019198815A1 (ja) | 2019-10-17 |
Family
ID=68164056
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2019/015973 Ceased WO2019198815A1 (ja) | 2018-04-12 | 2019-04-12 | 機械学習システム、機械学習方法、プログラム |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US12450524B2 (ja) |
| JP (2) | JP7091446B2 (ja) |
| WO (1) | WO2019198815A1 (ja) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPWO2022249436A1 (ja) * | 2021-05-28 | 2022-12-01 | ||
| JP2022184274A (ja) * | 2021-06-01 | 2022-12-13 | 日本電信電話株式会社 | 変数最適化システム |
| WO2022269680A1 (ja) * | 2021-06-21 | 2022-12-29 | 日本電信電話株式会社 | 変数最適化システム |
| JPWO2023228317A1 (ja) * | 2022-05-25 | 2023-11-30 |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11132514B1 (en) * | 2020-03-16 | 2021-09-28 | Hong Kong Applied Science and Technology Research Institute Company Limited | Apparatus and method for applying image encoding recognition in natural language processing |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140279771A1 (en) * | 2013-03-12 | 2014-09-18 | Oracle International Corporation | Novel Quadratic Regularization For Neural Network With Skip-Layer Connections |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP2723231A4 (en) | 2011-06-23 | 2015-02-25 | Sync Rx Ltd | LUMINAL BACKGROUND CLEANING |
| US20150302317A1 (en) | 2014-04-22 | 2015-10-22 | Microsoft Corporation | Non-greedy machine learning for high accuracy |
-
2019
- 2019-04-12 WO PCT/JP2019/015973 patent/WO2019198815A1/ja not_active Ceased
- 2019-04-12 US US17/047,028 patent/US12450524B2/en active Active
- 2019-04-12 JP JP2020513461A patent/JP7091446B2/ja active Active
-
2022
- 2022-03-11 JP JP2022037711A patent/JP7309004B2/ja active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140279771A1 (en) * | 2013-03-12 | 2014-09-18 | Oracle International Corporation | Novel Quadratic Regularization For Neural Network With Skip-Layer Connections |
Non-Patent Citations (7)
| Title |
|---|
| CHANG, T. H. ET AL.: "Distributed Constrained Optimization by Consensus-Based Primal-Dual Perturbation Method", IEEE TRANSACTIONS ON AUTOMATIC CONTROL, vol. 59, no. 6, 26 February 2014 (2014-02-26), pages 1524 - 1538, XP011548974, ISSN: 0018-9286, DOI: 10.1109/TAC.2014.2308612 * |
| DUCHI, J. C. ET AL.: "Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling", IEEE TRANSACTIONS ON AUTOMATIC CONTROL, vol. 57, no. 3, 30 June 2011 (2011-06-30), pages 592 - 606, XP011428520, ISSN: 0018-9286, DOI: 10.1109/TAC.2011.2161027 * |
| FEINBERG, V.: "Non-convex First Order Methods", 20 June 2017 (2017-06-20), XP055645593, Retrieved from the Internet <URL:https://vlad17.github.io/2017/06/20/nonconvex-first-order-methods.html> [retrieved on 20190619] * |
| MU LI ET AL.: "Scaling Distributed Machine Learning with the Parameter Server", PROCEEDINGS OF THE 11TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI'14), 8 October 2014 (2014-10-08), pages 583 - 598, XP061024674, ISBN: 978-1-931971-16-4, Retrieved from the Internet <URL:https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf> [retrieved on 20190311] * |
| RICHTARIK, P. ET AL.: "Modern Convex Optimization Methods for Large-Scale Empirical Risk Minimization (Part I: Primal Methods)", INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2015), 6 July 2015 (2015-07-06), pages 1, 4, 8, 13, 20, 22, 28, 30, 36, 39, 41, 44, 46, 47, 51 , 54 , 57 - 63 , 68, Retrieved from the Internet <URL:https://icml.cc/2015/tutorials/2015_ICML_ConvexOptimization_I.pdf> [retrieved on 20190620] * |
| SHERSON, T. ET AL.: "Derivation and Analysis of the Primal-Dual Method of Multipliers Based on Monotone Operator Theory", TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING OVER NETWORKS, 6 November 2017 (2017-11-06), pages 1 - 13, XP081281630, Retrieved from the Internet <URL:https://arxiv.org/pdf/1706.02654v3.pdf> [retrieved on 20190618] * |
| TSIANOS, K. I. ET AL.: "Consensus-Based Distributed Optimization: Practical Issues and Applications in Large-Scale Machine Learning", PROCEEDINGS OF THE 50TH ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON 2012, 5 October 2012 (2012-10-05), pages 1543 - 1550, XP032345182, ISBN: 978-1-4673-4539-2, DOI: 10.1109/Allerton.2012.6483403 * |
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPWO2022249436A1 (ja) * | 2021-05-28 | 2022-12-01 | ||
| JP7548425B2 (ja) | 2021-05-28 | 2024-09-10 | 日本電信電話株式会社 | 変数最適化システム |
| JP2022184274A (ja) * | 2021-06-01 | 2022-12-13 | 日本電信電話株式会社 | 変数最適化システム |
| JP7522076B2 (ja) | 2021-06-01 | 2024-07-24 | 日本電信電話株式会社 | 変数最適化システム |
| WO2022269680A1 (ja) * | 2021-06-21 | 2022-12-29 | 日本電信電話株式会社 | 変数最適化システム |
| JPWO2022269680A1 (ja) * | 2021-06-21 | 2022-12-29 | ||
| JP7533788B2 (ja) | 2021-06-21 | 2024-08-14 | 日本電信電話株式会社 | 変数最適化システム |
| JPWO2023228317A1 (ja) * | 2022-05-25 | 2023-11-30 | ||
| JP7732591B2 (ja) | 2022-05-25 | 2025-09-02 | Ntt株式会社 | 連合学習システム、モデル学習装置、連合学習方法、モデル学習プログラム |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2019198815A1 (ja) | 2021-06-17 |
| JP7091446B2 (ja) | 2022-06-27 |
| JP7309004B2 (ja) | 2023-07-14 |
| JP2022066553A (ja) | 2022-04-28 |
| US20210158226A1 (en) | 2021-05-27 |
| US12450524B2 (en) | 2025-10-21 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7091446B2 (ja) | 機械学習システム、機械学習方法、プログラム | |
| CN113241126B (zh) | 用于训练确定分子结合力的预测模型的方法和装置 | |
| JP7179835B2 (ja) | モデル生成装置、モデル生成方法、プログラム | |
| Li et al. | Robust stochastic configuration networks with maximum correntropy criterion for uncertain data regression | |
| De Bie | Clifford algebras, Fourier transforms, and quantum mechanics | |
| Jiang et al. | Modeling and prediction of the transmission dynamics of COVID-19 based on the SINDy-LM method | |
| CN114830152A (zh) | 用于工业规模的结构数字孪生的基于组件的降阶建模方法和系统 | |
| Zhang et al. | Design and analysis of recurrent neural network models with non‐linear activation functions for solving time‐varying quadratic programming problems | |
| Köfinger et al. | Empirical optimization of molecular simulation force fields by Bayesian inference | |
| CN113254716B (zh) | 视频片段检索方法、装置、电子设备和可读存储介质 | |
| Bayer et al. | Sorting-free Hill-based stability analysis of periodic solutions through Koopman analysis | |
| Pan | Distributed optimization and statistical learning for large-scale penalized expectile regression | |
| CN113239799B (zh) | 训练方法、识别方法、装置、电子设备和可读存储介质 | |
| Calleja et al. | Constructive approaches to QP-time-dependent KAM theory for Lagrangian tori in Hamiltonian systems | |
| Rebala et al. | Principal component analysis | |
| Xu | Dynamics of some coupled nonlinear Schrödinger systems in | |
| Bevanda et al. | Koopman-equivariant gaussian processes | |
| WO2022269680A1 (ja) | 変数最適化システム | |
| CN116830114A (zh) | 一种用于自注意力神经网络的归一化方案 | |
| JP7655446B2 (ja) | 解析装置及びプログラム | |
| Liang et al. | A learning-based approach to event-triggered guaranteed cost control for completely unknown nonlinear systems | |
| JP2022184274A (ja) | 変数最適化システム | |
| Channark et al. | Generalized and optimal sequence of weights on a progressive‐iterative approximation method with memory for least square fitting | |
| WO2022249436A1 (ja) | 変数最適化システム | |
| Li et al. | Subspace identification of closed-loop EIV system based on instrumental variables using orthoprojection |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19786073 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2020513461 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19786073 Country of ref document: EP Kind code of ref document: A1 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 17047028 Country of ref document: US |