[go: up one dir, main page]

WO2018212711A1 - Predictive analysis methods and systems - Google Patents

Predictive analysis methods and systems Download PDF

Info

Publication number
WO2018212711A1
WO2018212711A1 PCT/SG2018/050234 SG2018050234W WO2018212711A1 WO 2018212711 A1 WO2018212711 A1 WO 2018212711A1 SG 2018050234 W SG2018050234 W SG 2018050234W WO 2018212711 A1 WO2018212711 A1 WO 2018212711A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
hidden layer
pooling
embedding
layer stack
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/SG2018/050234
Other languages
French (fr)
Inventor
Xiangnan HE
Tat-Seng CHUA
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Singapore
Original Assignee
National University of Singapore
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Singapore filed Critical National University of Singapore
Publication of WO2018212711A1 publication Critical patent/WO2018212711A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0495Quantised networks; Sparse networks; Compressed networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present disclosure relates to predictive analysis using machine learning and more specifically to modeling interactions between predictor variables in predictive analysis.
  • Predictive analytics is one of the most important techniques for many information retrieval (IR) and data mining (DM) tasks, ranging from recommendation systems, targeted advertising, to search ranking and sentiment analysis.
  • a predictive task is formulated as estimating a function that maps predictor variables to some target, for example, real valued target for regression and categorical target for classification. Distinct from continuous predictor variables that are naturally found in images and audios, such as raw features, the predictor variables for web applications are mostly discrete and categorical. For example, in online advertising, we need to predict how likely (target) a user (first predictor variable) of a particular occupation (second predictor variable) will click on an ad (third predictor variable).
  • the generated feature vector can be of very high dimension but sparse.
  • many successful solutions in both industry and academia largely rely on manually crafting combinatorial features, i.e., constructing new features by combining multiple predictor variables, also known as cross features.
  • top data scientists are usually masters of crafting combinatorial features, which play a key role in their winning formulas.
  • the power of such features comes at a high cost, since it requires heavy engineering efforts and useful domain knowledge to design effective features.
  • these solutions can be difficult to generalize to new problems or domains.
  • FMs factorization machines
  • a predictive analytical method comprises: receiving a set of predictor variables as an input feature vector comprising a plurality of features; projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space; inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score.
  • Embodiments of the present invention have the following features a Bi-interaction pooling operation is introduced in neural network modeling, and a new neural network view for FM presented. Based on this view a novel neural factorization machines (NFM) model to deepen FM under the neural network framework for learning higher-order and non-linear feature interactions is provided.
  • NVM neural factorization machines
  • At least one layer of the hidden layer stack has a non-linear activation function.
  • the non-linear activation function may be a sigmond, hyperbolic tangent, or rectifier function.
  • converting the set of embedding vectors comprises performing a pooling operation on the set of embedding vectors.
  • converting the set of embedding vectors into the bi-interaction pooling vector comprises calculating an element-wise product of embedding vectors of the set of embedding vectors.
  • the input feature vector is sparse vector.
  • the input feature vector may be one-hot encoded.
  • a supervised machine learning method comprises; receiving training data comprising a plurality of sets of predictor variables and target values; for each set of predictor variables: projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space by performing a pooling operation on the set of embedding vectors inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score; and optimizing parameters of the pooling operation and the hidden layer stack by optimizing an objective function.
  • optimizing the objective function comprises carrying out stochastic gradient descent.
  • a data processing system comprises a processor and a data storage device.
  • the data storage device stores computer executable instructions operable by the processor to: receive a set of predictor variables as an input feature vector comprising a plurality of features; project each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; convert the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space; input the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transform an output vector of the hidden layer stack into a prediction score.
  • At least one layer of the hidden layer stack has a non-linear activation function.
  • the non-linear activation function may be a sigmond, hyperbolic tangent, or rectifier function.
  • the data storage device comprises instructions operable by the processor to convert the set of embedding vectors into a bi-interaction pooling vector by performing a pooling operation on the set of embedding vectors.
  • the data storage device comprises instructions operable by the processor to convert the set of embedding vectors into the bi-interaction pooling vector by calculating an element-wise product of embedding vectors of the set of embedding vectors.
  • the input feature vector is a sparse vector and / or is one-hot encoded.
  • the data storage device further comprises instructions to perform a ranking operation using the prediction score.
  • the data storage device stores computer executable instructions operable by the processor to perform a machine learning method comprising: receiving training data comprising a plurality of sets of predictor variables and target values; for each set of predictor variables: projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space by performing a pooling operation on the set of embedding vectors; inputting the bi- interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score; and optimizing parameters of the pooling operation and the hidden layer stack by optimizing an objective function.
  • optimizing the objective function comprises carrying out stochastic gradient descent.
  • a non-transitory computer- readable medium has stored thereon program instructions for causing at least one processor to perform operations of a method disclosed above.
  • Figure 1 is a block diagram showing a neutral factorization machines model architecture according to an embodiment of the present invention
  • Figure 2 is a block diagram showing a technical architecture of a data processing system according to an embodiment of the present invention
  • Figure 3 is a flowchart showing a method of predictive analysis using neural factorization machines according to an embodiment of the present invention
  • Figure 4 is a flowchart showing a method of learning the parameters of a neural factorization machines system according to an embodiment of the present invention
  • Figures 5A and 5B are graphs showing a comparison of validation error with respect to dropout on the bi-interaction layer of embodiments of the present invention
  • Figures 6A and 6B are graphs showing training and validation error of each epoch to illustrate the effect of dropout in the bi-interaction layer of embodiments of the present invention
  • Figures 7A and 7B are graphs showing training and validation error of each epoch to illustrate the effect of batch normalization in the bi-interaction layer of embodiments of the present invention
  • Figures 8A and 8B are graphs showing root mean square error for different activation functions in a first hidden layer in embodiments of the present invention.
  • Figures 9A and 9B are graphs showing the effect of pre-training in embodiments of the present invention.
  • Figures 10A and 10B are graphs showing a performance comparison for root mean square error with respect to different numbers of latent factors for embodiments of the present disclosure and state of the art methods.
  • Neural Factorization Machines which enhances FMs by modeling higher-order and non-linear feature interactions.
  • the third term f(x) is the core component of NFM for modeling feature interactions, which is a multi-layered feed-forward neural network as shown in Figure 1 . In what follows, we elaborate the design of f(x) layer by layer.
  • Figure 1 illustrates a neutral factorization machines model architecture according to an embodiment of the present invention.
  • the input to the model 100 is a sparse input feature vector 1 10 which comprises a plurality of input feature values 1 12.
  • the input feature vector 1 10 is input into an embedding layer 120.
  • the embedding layer 120 is a fully connected layer that projects each feature 1 12 to a dense vector representation.
  • the embedding layer 120 comprises a plurality of embedding vectors 122 each corresponding to a feature 1 12 of the input feature vector 1 10.
  • Vj ⁇ k be the embedding vector for the /-th feature.
  • the set of embedding vectors is then fed into a bi-interaction layer 130.
  • the Bi- Interaction layer which is a pooling operation that converts the set of embedding vectors V x into one vector: where 0 denotes the element-wise product of two vectors.
  • the output of Bi- Interaction pooling is a / -dimension vector that encodes the second-order interactions between features in the embedding space.
  • Bi-lnteraction pooling layer 130 is a stack of standard fully connected layers 140.
  • the stack 140 comprises a plurality of hidden layers 142 which are capable of learning higher-order interactions between features. Formally, they are defined as follows:
  • L denotes the number of hidden layers
  • Wj , bj and ⁇ - denote the weight matrix, bias vector and activation function for the /-th layer, respectively.
  • the output of the hidden layer stack 140 is transformed into a prediction score 150.
  • the output vector of the last hidden layer z L is transformed to the final prediction score 150 using the following projection:
  • vector h denotes the neuron weights of the prediction layer.
  • stochastic gradient descent is a universal solver for optimizing a neural network model.
  • the gradient of the NFM model with respect to a model parameter can be obtained with the standard chain rule.
  • Bi-lnteraction pooling layer is a new operation proposed by this work, we give its derivative as follows: which can be computed in ⁇ kN x ) time, the same time complexity with computing the Bi-lnteraction operation.
  • Dropout is a regularization technique for neural networks to prevent overfitting. The idea is to randomly drop neurons (along with their connections) of the neural network during training. That is, in each parameter update, only part of the model parameters that contributes to the prediction of (x) will be updated. Through this process, it can prevent complex co-adaptations of neurons on training data. It is important to note that in the testing phase, dropout must be disabled and the whole network is used for estimating y(x) . As such, dropout can also be seen as performing model averaging with smaller neural networks.
  • NFM To avoid feature embeddings co-adapting to each other and overfitting the data, we propose to adopt dropout on the Bi-lnteraction layer. Specifically, after obtaining f BI (V x ) which is a k-dimensional vector of latent factors, we randomly drop p percent of latent factors, where p is termed as the dropout ratio. Since NFM with no hidden layer degrades to the FM model, it can be seen as a new way to regularize FM. Moreover, we also apply dropout on each hidden layer of NFM to prevent the learning of higher-order feature interactions from co-adaptations and overfiting.
  • One difficulty of training multi-layered neural networks is caused by the fact of covariance shift.
  • loffe and Szegedy [S. loffe and C. Szegedy. 'Batch normalization: Accelerating deep network training by reducing internal covariate shift', In ICML, pages 448-456, 2015] proposed batch normalization (BN), which normalizes layer inputs to a zero-mean unit-variance Gaussian distribution for each training mini-batch. It has been shown that BN leads to faster convergence and better performance in several computer vision tasks.
  • ⁇ . !*! ⁇ , - « ⁇ «, - «. )* denotes the mini-batch variance
  • ⁇ and ⁇ are trainable parameters (vectors) to scale and shift the normalized value to restore the representation power of the network.
  • ⁇ ⁇ and ⁇ ⁇ are estimated from the whole training data.
  • NFM to avoid the update of feature embeddings changing the input distribution to hidden layers or prediction layer, we perform BN on the output of the Bi-lnteraction pooling. For each successive hidden layer, the BN is also applied.
  • FIG. 2 is a block diagram showing a technical architecture 200 of a data processing system according to an embodiment of the present invention.
  • the methods of optimizing model parameters and predictive analysis using neural factorization machines according to embodiments of the present invention are implemented on a computer, or a number of computers, each having a data- processing unit.
  • the block diagram as shown in Figure 2 illustrates a technical architecture 200 of a computer which is suitable for implementing one or more embodiments herein.
  • the technical architecture 200 includes a processor 222 (which may be referred to as a central processor unit or CPU) that is in communication with memory devices including secondary storage 224 (such as disk drives), read only memory (ROM) 226, random access memory (RAM) 228.
  • secondary storage 224 such as disk drives
  • ROM read only memory
  • RAM random access memory
  • the processor 222 may be implemented as one or more CPU chips.
  • the technical architecture 220 may further comprise input/output (I/O) devices 230, and network connectivity devices 232.
  • the technical architecture 200 further comprises activity table storage 240 which may be implemented as a hard disk drive or other type of storage device.
  • the secondary storage 224 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if RAM 228 is not large enough to hold all working data. Secondary storage 224 may be used to store programs which are loaded into RAM 228 when such programs are selected for execution.
  • the secondary storage 224 has an input vector receiving module 224a, an embedding layer module 224b, a bi- interaction pooling module 224c, hidden layer stack module 224d, a prediction score calculation module 224e and an optimization module 224f comprising non-transitory instructions operative by the processor 222 to perform various operations of the methods of the present disclosure.
  • the modules 224a-224f are distinct modules which perform respective functions implemented by the data processing system. It will be appreciated that the boundaries between these modules are exemplary only, and that alternative embodiments may merge modules or impose an alternative decomposition of functionality of modules. For example, the modules discussed herein may be decomposed into sub-modules to be executed as multiple computer processes, and, optionally, on multiple computers. Moreover, alternative embodiments may combine multiple instances of a particular module or sub-module.
  • modules 224a-224f may alternatively be implemented as one or more hardware modules (such as field-programmable gate array(s) or application-specific integrated circuit(s)) comprising circuitry which implements equivalent functionality to that implemented in software.
  • the ROM 226 is used to store instructions and perhaps data which are read during program execution.
  • the secondary storage 224, the RAM 228, and/or the ROM 226 may be referred to in some contexts as computer readable storage media and/or non-transitory computer readable media.
  • the I/O devices 230 may include printers, video monitors, liquid crystal displays (LCDs), plasma displays, touch screen displays, keyboards, keypads, switches, dials, mice, track balls, voice recognizers, card readers, paper tape readers, or other well- known input devices.
  • LCDs liquid crystal displays
  • plasma displays plasma displays
  • touch screen displays keyboards, keypads, switches, dials, mice, track balls
  • voice recognizers card readers, paper tape readers, or other well- known input devices.
  • the network connectivity devices 332 may take the form of modems, modem banks, Ethernet cards, universal serial bus (USB) interface cards, serial interfaces, token ring cards, fiber distributed data interface (FDDI) cards, wireless local area network (WLAN) cards, radio transceiver cards that promote radio communications using protocols such as code division multiple access (CDMA), global system for mobile communications (GSM), long-term evolution (LTE), worldwide interoperability for microwave access (WiMAX), near field communications (NFC), radio frequency identity (RFID), and/or other air interface protocol radio transceiver cards, and other well-known network devices.
  • CDMA code division multiple access
  • GSM global system for mobile communications
  • LTE long-term evolution
  • WiMAX worldwide interoperability for microwave access
  • NFC near field communications
  • RFID radio frequency identity
  • These network connectivity devices 232 may enable the processor 222 to communicate with the Internet or one or more intranets.
  • the processor 222 might receive information from the network, or might output information to the network in the course of performing the method operations described herein. Such information, which is often represented as a sequence of instructions to be executed using processor 222, may be received from and outputted to the network, for example, in the form of a computer data signal embodied in a carrier wave.
  • the processor 222 executes instructions, codes, computer programs, scripts which it accesses from hard disk, floppy disk, optical disk (these various disk based systems may all be considered secondary storage 224), flash drive, ROM 226, RAM 228, or the network connectivity devices 232. While only one processor 222 is shown, multiple processors may be present. Thus, while instructions may be discussed as executed by a processor, the instructions may be executed simultaneously, serially, or otherwise executed by one or multiple processors.
  • the technical architecture 200 is described with reference to a computer, it should be appreciated that the technical architecture may be formed by two or more computers in communication with each other that collaborate to perform a task.
  • an application may be partitioned in such a way as to permit concurrent and/or parallel processing of the instructions of the application.
  • the data processed by the application may be partitioned in such a way as to permit concurrent and/or parallel processing of different portions of a data set by the two or more computers.
  • virtualization software may be employed by the technical architecture 200 to provide the functionality of a number of servers that is not directly bound to the number of computers in the technical architecture 200.
  • the functionality disclosed above may be provided by executing the application and/or applications in a cloud computing environment.
  • Cloud computing may comprise providing computing services via a network connection using dynamically scalable computing resources.
  • a cloud computing environment may be established by an enterprise and/or may be hired on an as-needed basis from a third party provider.
  • Figure 3 is a flowchart showing a method of predictive analysis using neural factorization machines according to an embodiment of the present invention. The method 300 is carried out on the data processing system 200 shown in Figure 2.
  • the input vector receiving module 224a of the data processing system 200 receives predictor variables as an input feature vector.
  • the input feature vector may be a sparse vector 1 12 encoded by one-hot encoding.
  • step 304 the embedding layer module 224b of the data processing system 200 projects the input feature vector onto an embedding space to obtain a set of embedding vectors 122.
  • step 306 the bi-interaction pooling module 224c of the data processing system 200 converts the set of embedding vectors 122 into a bi-interaction pooling vector that encodes second order interactions between features of the feature vector in the embedding space.
  • the bi-interaction pooling vector is input into the hidden layer stack 140 by the hidden layer stack module 224d of the data processing system 200.
  • Each of the layers 142 of the hidden layer stack 140 comprises a plurality of nodes which perform calculations and transfer information from Layer 1 to Layer L.
  • step 310 the prediction score calculation module 224e of the data processing system 200 transforms the output vector of the hidden layer stack 140 into a prediction score 150.
  • Figure 4 is a flowchart showing a method of learning the parameters of a neural factorization machines system according to an embodiment of the present invention.
  • the method 400 shown in Figure 4 may be carried out on the data processing system 200 shown in Figure 2.
  • the data processing system 200 receives training data which comprises a plurality of sets of predictor variables, with each set forming an input feature vector.
  • the input feature vectors may be sparse vectors encoded by one-hot encoding.
  • the data processing system 200 calculates a prediction score for each set of predictor variables using the training data received in step 402.
  • the prediction scores calculated for the training data are parameterized by model variables. Step 404 is carried out according to the method 300 shown in Figure 3.
  • step 406 the optimization module 224f of the data processing system 200 optimizes the parameters of the model.
  • the optimization carried out in step 406 may comprise minimizing an objective function.
  • Embodiments of the NFM systems and method can be used as the ranking engine for recommender systems. We now discuss how to apply NFM to build an E- commerce recommendation system.
  • Frappe is a context-aware app discovery tool.
  • This dataset is constructed by Baltrunas et al. [L. Baltrunas, K. Church, A. Karatzoglou, and N. Oliver, 'Frappe: Understanding the usage and perception of mobile app recommendations in-the- wild', CoRR, abs/1505.03014, 2015]. It contains 96,203 app usage logs of users under different contexts. Besides user ID and app ID, each log contains 8 context variables, including weather, city and daytime (e.g., morning or afternoon). We converted each log (i.e., user ID, app ID and all context variables) to a feature vector using one-hot encoding, resulting in 5,382 features in total. A target value of 1 means the user has used the app under the context.
  • MovieLens is the Full version of the latest MovieLens data published by GroupLens [F. M. Harper and J. A. Konstan, 'The mocludes datasets: History and context', ACM Transactions on Interactive Intelligent Systems, 5:19: 1-19: 19, 2015].
  • the tagging part of the data includes 668, 953 tag applications of 17,045 users on 23,743 items with 49,657 distinct tags.
  • We converted each tag application i.e., user ID, movie ID and tag
  • a target value of 1 means the user has assigned the tag to the movie.
  • DeepCross [Y. Shan, T. R. Hoens, J. Jiao, H. Wang, D. Yu, and J. Mao, 'Deep crossing: Webscale modeling without manually crafted combinatorial features', In KDD, pages 255-262.].
  • Figures 5A and 5B show the validation error of NFM-0 with respect to dropout ratio on the Bi-lnteraction layer and L 2 regularization on feature embeddings for Frappe and MovieLens datasets respectively.
  • the performance of linear regression (LR) is also shown for benchmarking the performance of prediction that does not consider feature interactions.
  • LR linear regression
  • using a dropout ratio of 0.3 leads to a lowest validation error of 0.3562, which is significantly better than that of L2 regularization 0.3799.
  • Figures 6A and 6B show training and validation error of each epoch of NFM-0 with and without dropout for Frappe and MovieLens datasets respectively. Both datasets show that with a dropout ratio of 0.3, although the training error is higher, the validation error becomes lower. This demonstrates the ability of dropout in preventing overfitting and as such, better generalization can be achieved.
  • Figures 7A and 7B show training error of each epoch of NFM-0 with and without batch normalisation on the Bi-lnteraction layer for Frappe and MovieLens datasets respectively.
  • the dropout is enabled with a ratio of 0.3, and the learning rate is set to 0.02. Focusing on the training error, we can see that batch normalization (BN) leads to a faster convergence; on Frappe, when BN is applied, the training error of epoch 20 is even lower than that of epoch 60 without BN; and the validation error indicates that the lower training error is not overfitting.
  • BN batch normalization
  • the hidden layers of NFM play a pivotal role in capturing higher order interactions between features.
  • NFM-0 we set the size of the hidden layer the same as the embedding size.
  • Figures 8A and 8B show the validation error of NFM with respect to different activation functions and dropout ratios for the hidden layer for Frappe and MovieLens datasets respectively.
  • the performance of LibFM and NFM-0 are also shown for benchmarking purposes.
  • NFM's performance is improved with a large margin — compared to NFM-0 which has a similar performance with LibFM, the relative improvement is 1 1.3% and 5.2% for Frappe and MovieLens, respectively.
  • the different non-linear activation functions there is no obvious winner.
  • NFM does not perform that well. This provides evidence to the necessity of learning higher-order feature interactions with non-linear functions.
  • Figures 9A and 9B show the state of each epoch of NFM-1 with and without pre- training for Frappe and MovieLens datasets respectively.
  • NFM exhibits extremely fast convergence— on both datasets, with 5 epochs only, the performance is on par with 40 epochs of NFM that is trained from scratch (with BN enabled).
  • pre-training does not improve NFM's final performance, and a random initialization can achieve a result that is slightly better than that with pre-training.
  • This demonstrates the robustness of NFM, which is relatively insensitive to parameter initialization.
  • NFM is much easier to train and optimize, which is due largely to the informative and effective Bi-lnteraction pooling operation.
  • Figures 10A and 10B are graphs showing the test RMSE with respect to different number of latent factors (i.e., embedding sizes), where Wide&Deep and DeepCross are pre-trained with FM to better explore the two methods for Frappe and MovieLens datasets respectively.
  • Table 3 shows the concrete scores obtained on factors 128 and 256, and the number of model parameters of each method. The scores of Wide&Deep and DeepCross without pre-training are also shown.
  • Table 3 Test error and number of trainable parameters for di.erent methods on latent factors 128 and 256. M denotes "million”; * and ** denote the statistical significance for p ⁇ 0.05 and p ⁇ 0.01, respectively.
  • NFM consistently achieves the best performance on both datasets with the fewest model parameters besides FM.
  • the performance is followed by Wide&Deep, which uses a 3-layer MLP to learn feature interactions. We have also tried deeper layers for Wide&Deep, however the performance has not been improved. This further verifies the utility of using the informative Bi-lnteraction pooling in the low level.
  • HOFM shows slight improvement over FM with 1 .45% and 1 .04% average improvement on Frappe and MovieLens, respectively.
  • DeepCross is the deepest method among all baselines that utilizes a 10-layer network.
  • Frappe DeepCross only achieves a comparable performance with the shallow FM model, while it underperforms FM significantly on MovieLens.
  • the reasons are due to optimization difficulties and overfitting (as evidenced by the worse performance on factors 128 and 256).
  • NFM is more expressive than FM by modelling higher-order and non-linear feature interactions with only k 2 more parameters.
  • HOFM Compared to HOFM, NFM models higher-order interactions in a non-linear way with much fewer parameters.
  • NFM captures more informative interactions at the lower level and does not require a deep structure to predict well.
  • NFM neural network model
  • the key of NFM's architecture is the newly proposed Bi-lnteraction operation, based on which we allow a neural network model to learn more informative feature interactions at the lower level.
  • Extensive experiments on two real-world datasets show that with one hidden layer only, NFM significantly outperforms FM, higher-order FM, and the state-of-the-art deep learning approaches Wide&Deep and DeepCross. The work bridges the gap between linear models and deep learning.
  • Linear models such as various factorization methods, have shown to be effective information retrieval (IR) and data mining (DM) tasks and are easy to interpret.
  • IR information retrieval
  • DM data mining
  • their limited expressiveness may hinder the performance when modeling real-world data with complex inherent patterns.
  • deep learning models have exhibited great expressive power and yielded immense success on speech processing and computer vision, their performance is still unsatisfactory for IR tasks, such as collaborative filtering.
  • IR and DM tasks are naturally sparse; and to date, there still lacks effective deep learning solutions for prediction with sparse data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Methods and systems for predictive analysis are disclosed, A predictive analysis method comprises: receiving a set of predictor variables as an input feature vector comprising a plurality of features; projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space; inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score.

Description

PREDICTIVE ANALYSIS METHODS AND SYSTEMS
FIELD OF THE INVENTION The present disclosure relates to predictive analysis using machine learning and more specifically to modeling interactions between predictor variables in predictive analysis.
BACKGROUND OF THE INVENTION
Predictive analytics is one of the most important techniques for many information retrieval (IR) and data mining (DM) tasks, ranging from recommendation systems, targeted advertising, to search ranking and sentiment analysis. Typically, a predictive task is formulated as estimating a function that maps predictor variables to some target, for example, real valued target for regression and categorical target for classification. Distinct from continuous predictor variables that are naturally found in images and audios, such as raw features, the predictor variables for web applications are mostly discrete and categorical. For example, in online advertising, we need to predict how likely (target) a user (first predictor variable) of a particular occupation (second predictor variable) will click on an ad (third predictor variable). To build predictive models with these categorical predictor variables, a common solution is to convert them to a set of binary features (also known as feature vector) via one- hot encoding. Thereafter, standard machine learning (ML) techniques such as logistic regression and support vector machines can be applied.
Depending on the possible values of categorical predictor variables, the generated feature vector can be of very high dimension but sparse. To build effective ML models with such sparse data, it is crucial to account for the interactions between features. Many successful solutions in both industry and academia largely rely on manually crafting combinatorial features, i.e., constructing new features by combining multiple predictor variables, also known as cross features. For example, we can cross variable occupation = {banker, doctor} with gender = {M, F} and get a new occupation_gender = {banker_M, banker_F, doctor_M, doctor_F}. It is well known that top data scientists are usually masters of crafting combinatorial features, which play a key role in their winning formulas. However, the power of such features comes at a high cost, since it requires heavy engineering efforts and useful domain knowledge to design effective features. Thus these solutions can be difficult to generalize to new problems or domains.
Instead of augmenting feature vectors manually, another solution is to design a ML model to learn feature interactions from raw data automatically. A popular approach is factorization machines (FMs), which embeds features into a latent space and models the interactions between features via inner product of their embedding vectors. While FM has yielded great promise in many prediction tasks, we argue that its performance can be limited by its linearity, as well as the modeling of pairwise (i.e., second-order) feature interactions only. Specifically, for real-world data that have complex and non-linear underlying structure, FM may not be expressive enough. Although higher-order FMs have been proposed by Rendle [S. Rendle, 'Factorization machines' In ICDM, pages 995-1000, 2010], they still belong to the family of linear models and are claimed to be difficult to estimate.
SUMMARY OF THE INVENTION According to a first aspect of the present disclosure a predictive analytical method comprises: receiving a set of predictor variables as an input feature vector comprising a plurality of features; projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space; inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score.
Embodiments of the present invention have the following features a Bi-interaction pooling operation is introduced in neural network modeling, and a new neural network view for FM presented. Based on this view a novel neural factorization machines (NFM) model to deepen FM under the neural network framework for learning higher-order and non-linear feature interactions is provided.
In an embodiment, at least one layer of the hidden layer stack has a non-linear activation function. The non-linear activation function may be a sigmond, hyperbolic tangent, or rectifier function.
In an embodiment, converting the set of embedding vectors comprises performing a pooling operation on the set of embedding vectors.
In an embodiment, converting the set of embedding vectors into the bi-interaction pooling vector comprises calculating an element-wise product of embedding vectors of the set of embedding vectors. In an embodiment, the input feature vector is sparse vector. The input feature vector may be one-hot encoded.
The analytical method may form part of a ranking method. According to a second aspect of the present disclosure a supervised machine learning method comprises; receiving training data comprising a plurality of sets of predictor variables and target values; for each set of predictor variables: projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space by performing a pooling operation on the set of embedding vectors inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score; and optimizing parameters of the pooling operation and the hidden layer stack by optimizing an objective function. In an embodiment, optimizing the objective function comprises carrying out stochastic gradient descent.
According to a third aspect of the present disclosure, a data processing system comprises a processor and a data storage device. The data storage device stores computer executable instructions operable by the processor to: receive a set of predictor variables as an input feature vector comprising a plurality of features; project each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; convert the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space; input the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transform an output vector of the hidden layer stack into a prediction score.
In an embodiment, at least one layer of the hidden layer stack has a non-linear activation function. The non-linear activation function may be a sigmond, hyperbolic tangent, or rectifier function.
In an embodiment, the data storage device comprises instructions operable by the processor to convert the set of embedding vectors into a bi-interaction pooling vector by performing a pooling operation on the set of embedding vectors. In an embodiment, the data storage device comprises instructions operable by the processor to convert the set of embedding vectors into the bi-interaction pooling vector by calculating an element-wise product of embedding vectors of the set of embedding vectors. In an embodiment, the input feature vector is a sparse vector and / or is one-hot encoded.
In an embodiment, the data storage device further comprises instructions to perform a ranking operation using the prediction score. In an embodiment, the data storage device stores computer executable instructions operable by the processor to perform a machine learning method comprising: receiving training data comprising a plurality of sets of predictor variables and target values; for each set of predictor variables: projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space by performing a pooling operation on the set of embedding vectors; inputting the bi- interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score; and optimizing parameters of the pooling operation and the hidden layer stack by optimizing an objective function.
In an embodiment, optimizing the objective function comprises carrying out stochastic gradient descent.
According to a yet further aspect, there is provided a non-transitory computer- readable medium. The computer-readable medium has stored thereon program instructions for causing at least one processor to perform operations of a method disclosed above.
BRIEF DESCRIPTION OF THE DRAWINGS
In the following, embodiments of the present invention will be described as non- limiting examples with reference to the accompanying drawings in which:
Figure 1 is a block diagram showing a neutral factorization machines model architecture according to an embodiment of the present invention;
Figure 2 is a block diagram showing a technical architecture of a data processing system according to an embodiment of the present invention; Figure 3 is a flowchart showing a method of predictive analysis using neural factorization machines according to an embodiment of the present invention;
Figure 4 is a flowchart showing a method of learning the parameters of a neural factorization machines system according to an embodiment of the present invention;
Figures 5A and 5B are graphs showing a comparison of validation error with respect to dropout on the bi-interaction layer of embodiments of the present invention; Figures 6A and 6B are graphs showing training and validation error of each epoch to illustrate the effect of dropout in the bi-interaction layer of embodiments of the present invention;
Figures 7A and 7B are graphs showing training and validation error of each epoch to illustrate the effect of batch normalization in the bi-interaction layer of embodiments of the present invention;
Figures 8A and 8B are graphs showing root mean square error for different activation functions in a first hidden layer in embodiments of the present invention;
Figures 9A and 9B are graphs showing the effect of pre-training in embodiments of the present invention; and
Figures 10A and 10B are graphs showing a performance comparison for root mean square error with respect to different numbers of latent factors for embodiments of the present disclosure and state of the art methods.
DETAILED DESCRIPTION In the present disclosure, we propose a novel model for predictive analytics with sparse inputs named Neural Factorization Machines (NFMs), which enhances FMs by modeling higher-order and non-linear feature interactions. By devising a new operation in neural network modeling— Bilinear Interaction (Bi-lnteraction) pooling — we subsume FM under the neural network framework for the first time. Through stacking nonlinear layers above the Bi-lnteraction layer, we are able to deepen the shallow linear FM, modeling higher-order and non-linear feature interactions effectively to improve FM's expressiveness. In contrast to traditional deep learning methods that simply concatenate or average embedding vectors in the low level, our use of Bi-lnteraction pooling encodes more informative feature interactions, greatly facilitating the following "deep" layers to learn meaningful information. Compared to the state-of-the-art deep learning methods— the 3-layer Wide&Deep [H.-T. Cheng, L. Koc, J. Harmsen, et al. 'Wide & deep learning for recommender systems', In DLRS, pages 7-10, 2016] and 10-layer DeepCross [Y. Shan, T. R. Hoens, J. Jiao, H. Wang, D. Yu, and J. Mao 'Deep crossing: Webscale modeling without manually crafted combinatorial features', In KDD, pages 255-262] our 1 -layer NFM shows consistent improvements with a much simpler structure and fewer model parameters.
1. The Neural Factorization Machines Model
Similar to factorization machine, NFM is a general machine learner working with any real valued feature vector. Given a sparse vector x ε W1 as input, where a feature value Xj = 0 means the /-th feature does not exist in the instance, NFM estimates the target as:
Figure imgf000009_0001
where the first and second terms are the linear regression part similar to that for FM, which models global bias of data and weight of features. The third term f(x) is the core component of NFM for modeling feature interactions, which is a multi-layered feed-forward neural network as shown in Figure 1 . In what follows, we elaborate the design of f(x) layer by layer.
Figure 1 illustrates a neutral factorization machines model architecture according to an embodiment of the present invention. For clarity purposes, the linear regression part has been omitted from the figure. The input to the model 100 is a sparse input feature vector 1 10 which comprises a plurality of input feature values 1 12. The input feature vector 1 10 is input into an embedding layer 120. The embedding layer 120 is a fully connected layer that projects each feature 1 12 to a dense vector representation. The embedding layer 120 comprises a plurality of embedding vectors 122 each corresponding to a feature 1 12 of the input feature vector 1 10. Formally, let Vj≡ k be the embedding vector for the /-th feature. After embedding, we obtain a set of embedding vectors Vx = {xiv1t xnvn } to represent the input feature vector x. Note that we have rescaled an embedding vector by its input feature value, rather than simply an embedding table lookup, so as to account for the real valued features.
The set of embedding vectors is then fed into a bi-interaction layer 130. the Bi- Interaction layer, which is a pooling operation that converts the set of embedding vectors Vx into one vector:
Figure imgf000010_0001
where 0 denotes the element-wise product of two vectors. Clearly, the output of Bi- Interaction pooling is a / -dimension vector that encodes the second-order interactions between features in the embedding space.
It is notable that the proposal of Bi-lnteraction pooling does not introduce extra model parameter, and more importantly, it can be efficiently computed in linear time. This property is the same with average/max pooling and concatenation that are rather simple but commonly used in neural network approaches. To show the linear time complexity of evaluatin Bi-lnteraction pooling, we reformulate the equation as:
Figure imgf000010_0002
where we use the symbol v2 to denote the product vQv. By considering the sparsity of x, we can actually perform Bi-lnteraction pooling in 0(kNx) time, where Nx denotes the number of non-zero entries in x. This is a very appealing property, meaning that the benefit of Bi-lnteraction pooling in modeling pairwise feature interactions does not involve any additional cost. Above the Bi-lnteraction pooling layer 130 is a stack of standard fully connected layers 140. The stack 140 comprises a plurality of hidden layers 142 which are capable of learning higher-order interactions between features. Formally, they are defined as follows:
where L denotes the number of hidden layers, Wj , bj and σ-, denote the weight matrix, bias vector and activation function for the /-th layer, respectively. This is advantageous over higher-order FM which only supports the learning of higher-order interactions in a linear way.
Finally, the output of the hidden layer stack 140 is transformed into a prediction score 150. The output vector of the last hidden layer zL is transformed to the final prediction score 150 using the following projection:
Figure imgf000011_0001
where vector h denotes the neuron weights of the prediction layer.
To summarize, we give the formulation NFM's predictive model as:
n
VNFA4 (x) = w0 + ^ w/Xi + 1 σι (W/ ..σι (Wi jfe /( - } + bi ) ...) + b/J, i = l
Compared to FM, the additional model parameters of NFM are mainly {Wj, bj}, which are used for learning higher-order interactions between features. It is clear that by setting L to zero, which means the output of Bi-lnteraction pooling is directly projected to the prediction score, we can exactly recover the FM model. 2 Learning
To optimize the model parameters of NFM, one need to specify an objective function to optimize, which can be tailored for different tasks. For example, for regression with real-valued targets, one can optimize the squared loss; for binary classification with 0/1 labels, one can optimize the log loss].
For any smooth objective function, stochastic gradient descent (SGD) is a universal solver for optimizing a neural network model. The gradient of the NFM model with respect to a model parameter can be obtained with the standard chain rule. As the Bi-lnteraction pooling layer is a new operation proposed by this work, we give its derivative as follows:
Figure imgf000012_0001
which can be computed in {kNx) time, the same time complexity with computing the Bi-lnteraction operation.
While neural network models have strong representation ability, they are also easy to overfit to the training data. Dropout is a regularization technique for neural networks to prevent overfitting. The idea is to randomly drop neurons (along with their connections) of the neural network during training. That is, in each parameter update, only part of the model parameters that contributes to the prediction of (x) will be updated. Through this process, it can prevent complex co-adaptations of neurons on training data. It is important to note that in the testing phase, dropout must be disabled and the whole network is used for estimating y(x) . As such, dropout can also be seen as performing model averaging with smaller neural networks.
In NFM, to avoid feature embeddings co-adapting to each other and overfitting the data, we propose to adopt dropout on the Bi-lnteraction layer. Specifically, after obtaining fBI(Vx) which is a k-dimensional vector of latent factors, we randomly drop p percent of latent factors, where p is termed as the dropout ratio. Since NFM with no hidden layer degrades to the FM model, it can be seen as a new way to regularize FM. Moreover, we also apply dropout on each hidden layer of NFM to prevent the learning of higher-order feature interactions from co-adaptations and overfiting. One difficulty of training multi-layered neural networks is caused by the fact of covariance shift. It means that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. As a result, the later layer needs to adapt to these changes (which are often noisy) when updating its parameters, which adversely slows down the training. To address the problem, loffe and Szegedy [S. loffe and C. Szegedy. 'Batch normalization: Accelerating deep network training by reducing internal covariate shift', In ICML, pages 448-456, 2015] proposed batch normalization (BN), which normalizes layer inputs to a zero-mean unit-variance Gaussian distribution for each training mini-batch. It has been shown that BN leads to faster convergence and better performance in several computer vision tasks.
Formally, let the input vector to a layer be xt ε Rd and all input vectors to the layer of the mini-batch be: Έ = {xj, then BN normalizes Xj as:
Figure imgf000013_0001
Where:
Figure imgf000013_0002
denotes the mini-batch mean, and
· . = !*! ∑, - « <«, - «. )* denotes the mini-batch variance, and γ and β are trainable parameters (vectors) to scale and shift the normalized value to restore the representation power of the network. Note that in testing, BN also needs to be applied, where μΒ and σΒ are estimated from the whole training data. In NFM, to avoid the update of feature embeddings changing the input distribution to hidden layers or prediction layer, we perform BN on the output of the Bi-lnteraction pooling. For each successive hidden layer, the BN is also applied.
3. Implementation of a Neural Factorization Machines System Figure 2 is a block diagram showing a technical architecture 200 of a data processing system according to an embodiment of the present invention. Typically, the methods of optimizing model parameters and predictive analysis using neural factorization machines according to embodiments of the present invention are implemented on a computer, or a number of computers, each having a data- processing unit. The block diagram as shown in Figure 2 illustrates a technical architecture 200 of a computer which is suitable for implementing one or more embodiments herein. The technical architecture 200 includes a processor 222 (which may be referred to as a central processor unit or CPU) that is in communication with memory devices including secondary storage 224 (such as disk drives), read only memory (ROM) 226, random access memory (RAM) 228. The processor 222 may be implemented as one or more CPU chips. The technical architecture 220 may further comprise input/output (I/O) devices 230, and network connectivity devices 232. The technical architecture 200 further comprises activity table storage 240 which may be implemented as a hard disk drive or other type of storage device.
The secondary storage 224 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if RAM 228 is not large enough to hold all working data. Secondary storage 224 may be used to store programs which are loaded into RAM 228 when such programs are selected for execution. In this embodiment, the secondary storage 224 has an input vector receiving module 224a, an embedding layer module 224b, a bi- interaction pooling module 224c, hidden layer stack module 224d, a prediction score calculation module 224e and an optimization module 224f comprising non-transitory instructions operative by the processor 222 to perform various operations of the methods of the present disclosure. As depicted in Figure 2, the modules 224a-224f are distinct modules which perform respective functions implemented by the data processing system. It will be appreciated that the boundaries between these modules are exemplary only, and that alternative embodiments may merge modules or impose an alternative decomposition of functionality of modules. For example, the modules discussed herein may be decomposed into sub-modules to be executed as multiple computer processes, and, optionally, on multiple computers. Moreover, alternative embodiments may combine multiple instances of a particular module or sub-module. It will also be appreciated that, while a software implementation of the modules 224a-224f is described herein, these may alternatively be implemented as one or more hardware modules (such as field-programmable gate array(s) or application-specific integrated circuit(s)) comprising circuitry which implements equivalent functionality to that implemented in software. The ROM 226 is used to store instructions and perhaps data which are read during program execution. The secondary storage 224, the RAM 228, and/or the ROM 226 may be referred to in some contexts as computer readable storage media and/or non-transitory computer readable media.
The I/O devices 230 may include printers, video monitors, liquid crystal displays (LCDs), plasma displays, touch screen displays, keyboards, keypads, switches, dials, mice, track balls, voice recognizers, card readers, paper tape readers, or other well- known input devices.
The network connectivity devices 332 may take the form of modems, modem banks, Ethernet cards, universal serial bus (USB) interface cards, serial interfaces, token ring cards, fiber distributed data interface (FDDI) cards, wireless local area network (WLAN) cards, radio transceiver cards that promote radio communications using protocols such as code division multiple access (CDMA), global system for mobile communications (GSM), long-term evolution (LTE), worldwide interoperability for microwave access (WiMAX), near field communications (NFC), radio frequency identity (RFID), and/or other air interface protocol radio transceiver cards, and other well-known network devices. These network connectivity devices 232 may enable the processor 222 to communicate with the Internet or one or more intranets. With such a network connection, it is contemplated that the processor 222 might receive information from the network, or might output information to the network in the course of performing the method operations described herein. Such information, which is often represented as a sequence of instructions to be executed using processor 222, may be received from and outputted to the network, for example, in the form of a computer data signal embodied in a carrier wave. The processor 222 executes instructions, codes, computer programs, scripts which it accesses from hard disk, floppy disk, optical disk (these various disk based systems may all be considered secondary storage 224), flash drive, ROM 226, RAM 228, or the network connectivity devices 232. While only one processor 222 is shown, multiple processors may be present. Thus, while instructions may be discussed as executed by a processor, the instructions may be executed simultaneously, serially, or otherwise executed by one or multiple processors.
It is understood that by programming and/or loading executable instructions onto the technical architecture 200, at least one of the CPU 222, the RAM 228, and the ROM 226 are changed, transforming the technical architecture 200 in part into a specific purpose machine or apparatus having the novel functionality taught by the present disclosure. It is fundamental to the electrical engineering and software engineering arts that functionality that can be implemented by loading executable software into a computer can be converted to a hardware implementation by well-known design rules.
Although the technical architecture 200 is described with reference to a computer, it should be appreciated that the technical architecture may be formed by two or more computers in communication with each other that collaborate to perform a task. For example, but not by way of limitation, an application may be partitioned in such a way as to permit concurrent and/or parallel processing of the instructions of the application. Alternatively, the data processed by the application may be partitioned in such a way as to permit concurrent and/or parallel processing of different portions of a data set by the two or more computers. In an embodiment, virtualization software may be employed by the technical architecture 200 to provide the functionality of a number of servers that is not directly bound to the number of computers in the technical architecture 200. In an embodiment, the functionality disclosed above may be provided by executing the application and/or applications in a cloud computing environment. Cloud computing may comprise providing computing services via a network connection using dynamically scalable computing resources. A cloud computing environment may be established by an enterprise and/or may be hired on an as-needed basis from a third party provider. Figure 3 is a flowchart showing a method of predictive analysis using neural factorization machines according to an embodiment of the present invention. The method 300 is carried out on the data processing system 200 shown in Figure 2. In step 302, the input vector receiving module 224a of the data processing system 200 receives predictor variables as an input feature vector. The input feature vector may be a sparse vector 1 12 encoded by one-hot encoding.
In step 304, the embedding layer module 224b of the data processing system 200 projects the input feature vector onto an embedding space to obtain a set of embedding vectors 122.
In step 306, the bi-interaction pooling module 224c of the data processing system 200 converts the set of embedding vectors 122 into a bi-interaction pooling vector that encodes second order interactions between features of the feature vector in the embedding space.
In step 308, the bi-interaction pooling vector is input into the hidden layer stack 140 by the hidden layer stack module 224d of the data processing system 200. Each of the layers 142 of the hidden layer stack 140 comprises a plurality of nodes which perform calculations and transfer information from Layer 1 to Layer L.
In step 310, the prediction score calculation module 224e of the data processing system 200 transforms the output vector of the hidden layer stack 140 into a prediction score 150.
Figure 4 is a flowchart showing a method of learning the parameters of a neural factorization machines system according to an embodiment of the present invention. The method 400 shown in Figure 4 may be carried out on the data processing system 200 shown in Figure 2.
In step 402, the data processing system 200 receives training data which comprises a plurality of sets of predictor variables, with each set forming an input feature vector. The input feature vectors may be sparse vectors encoded by one-hot encoding. In step 404, the data processing system 200 calculates a prediction score for each set of predictor variables using the training data received in step 402. The prediction scores calculated for the training data are parameterized by model variables. Step 404 is carried out according to the method 300 shown in Figure 3.
In step 406, the optimization module 224f of the data processing system 200 optimizes the parameters of the model. The optimization carried out in step 406 may comprise minimizing an objective function.
4. Applications of Neural Factorization Machines
4.1. Applications to Recommendation Systems Embodiments of the NFM systems and method can be used as the ranking engine for recommender systems. We now discuss how to apply NFM to build an E- commerce recommendation system.
In practical E-commerce systems, we typically have three types of data to build a recommendation service: 1 ) users' interaction histories on products, such as purchasing, rating, clicking histories etc.; 2) user profiles, such as demographics like age, gender, hometown, income level etc.; 3) product properties, such as categories, prices, descriptive tags, product images etc. For each interaction, we convert it to a training instance with the basic features include user ID and product ID; this will provide the basic collaborative filtering system. To incorporate the side information of user profiles and product properties, we need to do feature engineering based on the types of side information. For categorical variables like ages (male or female) and hometown (Shanghai, Beijing or other cities), we can append them to the feature vector via one-hot encoding. For real-value variables like user ages and product prices, we can append it to the feature vector as it is, or discretize the real-value feature into categorical variable.
To build a visual-aware recommender system, such as accounting for 2D product images that carry rich semantics, we shall use some deep learning methods to extract a representation vector for an image first. For example, we can use ResNet [K. He, X. Zhang, S. Ren, and J. Sun, 'Deep Residual Learning for Image Recognition', In CVPR, pages 770-778, 2016] to extract a 4096-dimension feature representation (denote it as f) for the image, and then convert it to an embedding vector (e.g. by using Wr where W denotes the convert matrix to be learned from data). We can then apply AFM on the original feature embedding vectors and the generated image embedding vector (i.e. the embedding layer has one more embedding vector Wr representing the product image embedding). In the following subsection, we show how to deploy NFM to two recommendation scenarios— context-aware recommendation and personalized tag recommendation.
5.1 Experiment Setting Datasets.
We experimented with two publicly accessible datasets: Frappe and MovieLens:
Frappe is a context-aware app discovery tool. This dataset is constructed by Baltrunas et al. [L. Baltrunas, K. Church, A. Karatzoglou, and N. Oliver, 'Frappe: Understanding the usage and perception of mobile app recommendations in-the- wild', CoRR, abs/1505.03014, 2015]. It contains 96,203 app usage logs of users under different contexts. Besides user ID and app ID, each log contains 8 context variables, including weather, city and daytime (e.g., morning or afternoon). We converted each log (i.e., user ID, app ID and all context variables) to a feature vector using one-hot encoding, resulting in 5,382 features in total. A target value of 1 means the user has used the app under the context.
MovieLens is is the Full version of the latest MovieLens data published by GroupLens [F. M. Harper and J. A. Konstan, 'The movielens datasets: History and context', ACM Transactions on Interactive Intelligent Systems, 5:19: 1-19: 19, 2015]. As this work concerns higher-order interactions between features, we study the task of personalized tag recommendation rather than collaborative filtering that considers the second-order interactions only. The tagging part of the data includes 668, 953 tag applications of 17,045 users on 23,743 items with 49,657 distinct tags. We converted each tag application (i.e., user ID, movie ID and tag) to a feature vector, resulting in 90,445 features in total. A target value of 1 means the user has assigned the tag to the movie.
As both original datasets contain positive instances only (i.e., all instances have target value 1 ), we sampled two negative instances to pair with one positive instance to ensure the generalization of the predictive model. For each log of Frappe, we randomly sampled two apps that the user has not used in the context; for each tag application of MovieLens, we randomly sampled two tags that the user has not assigned to the movie. Each negative instance is assigned to a target value of -1 . Table 1 summarizes the statistics of the final evaluation datasets.
Table 1. Statistics of the evaluation datasets
Dataset lnstance# Feature# User# Item*
Frappe 288,609 5,382 957 4,082
MovieLens 2,006,859 90,445 17,045 23,743
Evaluation criteria
We randomly split the dataset into training (70%), validation (20%), and test (10%) sets. The validation set was used for tuning hyper-parameters and the final performance comparison was conducted on the test set. We consider the regression task in this work and evaluate the prediction performance with root mean square error (RMSE). We rounded up the prediction of each model to 1 or -1 if it was out of the range. The one-sample paired t-test was performed to judge the statistical significance where necessary.
We compared with the following competitive embedding-based models that are specifically designed for prediction with sparse inputs:
LibFM [S. Rendle 'Factorization machines with libfm', ACM Transactions on Intelligent Systems and Technology, 3:57: 1-57:22, 2012]. This is the official implementation (http://www. libfm.org) of FM released by Rendle. HOFM. This is the third-part implementation (https://github.com/geffy/tffm) of higher- order FM. We experimented with order size 3, since the MovieLens data concerns the ternary relationship between users, movies and tags.
Wide&Deep [H.-T. Cheng, L. Koc, J. Harmsen, et al. 'Wide & deep learning for recommender systems', In DLRS, pages 7-10, 2016]. We used the same network structure as reported in their paper, which has three layers with size 1024, 512 and 256, respectively.
DeepCross [Y. Shan, T. R. Hoens, J. Jiao, H. Wang, D. Yu, and J. Mao, 'Deep crossing: Webscale modeling without manually crafted combinatorial features', In KDD, pages 255-262.]. We used the same structure as reported in their paper, which stacks 5 residual units (each unit has two layers) with the hidden dimension 512, 512, 256, 128 and 64, respectively.
For the network structure of NFM, we use one hidden layer with the rectified linear unit (ReLU) as the activation function, since the baselines DeepCross and Wide&Deep also choose ReLU in their original papers.
To fairly compare models' capability, we learned all models by optimizing the square loss. To prevent overfitting, we tuned the L2 regularization for linear models LibFM and HOFM and the dropout ratio for neural network models Wide&Deep, DeepCross and NFM. Besides LibFM that optimized FM with the vanilla SGD, all other methods were optimized with mini-batch Adagrad [J. Duchi, E. Hazan, and Y. Singer, 'Adaptive subgradient methods for online learning and stochastic optimization', Journal of Machine Learning Research, 12:2121-2159, 201 1 ], where the batch size was set to 128 for Frappe and 4,096 for MovieLens. For all methods, the early stopping strategy was performed, where we stopped training if the RMSE on validation set increased for 4 successive epochs. Without special mention, the embedding size is set to 64 by default.
5.2 Study of Bi-lnteraction Pooling We empirically study the Bi-lnteraction pooling operation. To avoid other components (e.g., hidden layers) affecting the analysis, we study the NFM-0 model that directly projects the output of Bi-lnteraction pooling to prediction score with no hidden layer. As discussed above, NFM-0 is identical to FM as the trainable h does not impact model's expressiveness. We firrst compare dropout with traditional L2 regularization for preventing model overfitting, and then explore the impact of batch normalization.
Figures 5A and 5B show the validation error of NFM-0 with respect to dropout ratio on the Bi-lnteraction layer and L2 regularization on feature embeddings for Frappe and MovieLens datasets respectively. The performance of linear regression (LR) is also shown for benchmarking the performance of prediction that does not consider feature interactions. First, LR leads to very poor performance, highlighting the importance of modelling interactions between sparse features for prediction. Second, we see that both regularization and dropout can well prevent overfitting and improve NFM-0's generalization to unseen data. Between the two strategies, dropout offers better performance. Specifically, on Frappe, using a dropout ratio of 0.3 leads to a lowest validation error of 0.3562, which is significantly better than that of L2 regularization 0.3799. One reason might be that enforcing L2 regularization only suppresses the values of parameters in each update numerically, while using dropout can be seen as ensembling multiple sub-models, which can be more effective. Considering the genericity of FM that subsumes many factorization models, we believe this is a new interesting finding, meaning that dropout can also be an effective strategy to address overfitting of linear latent-factor models.
Figures 6A and 6B show training and validation error of each epoch of NFM-0 with and without dropout for Frappe and MovieLens datasets respectively. Both datasets show that with a dropout ratio of 0.3, although the training error is higher, the validation error becomes lower. This demonstrates the ability of dropout in preventing overfitting and as such, better generalization can be achieved.
Figures 7A and 7B show training error of each epoch of NFM-0 with and without batch normalisation on the Bi-lnteraction layer for Frappe and MovieLens datasets respectively. The dropout is enabled with a ratio of 0.3, and the learning rate is set to 0.02. Focusing on the training error, we can see that batch normalization (BN) leads to a faster convergence; on Frappe, when BN is applied, the training error of epoch 20 is even lower than that of epoch 60 without BN; and the validation error indicates that the lower training error is not overfitting. It has been shown [K. He, X. Zhang, S. Ren, and J. Sun, 'Deep residual learning for image recognition', In CVPR, pages 770-778; and S. loffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift', In ICML, pages 448-456, 2015] that by addressing the internal covariate shift with BN, the model's generalization ability can be improved. Our result also verifies this point, where using BN leads to slight improvement (although the improvement is not statistically significant). Furthermore, we notice that BN makes the learning less stable, as evidenced by the larger performance fluctuation of the Oropout+BN' lines. This is caused by our use of dropout and BN together, as randomly dropping neurons can change the input distribution normalized by BN.
5.3 Impact of Hidden Layers
The hidden layers of NFM play a pivotal role in capturing higher order interactions between features. To explore the impact, we first add one hidden layer above the Bi- Interaction layer and slightly overuse the name NFM to indicate this specific model. To ensure the same model capability with NFM-0, we set the size of the hidden layer the same as the embedding size.
Figures 8A and 8B show the validation error of NFM with respect to different activation functions and dropout ratios for the hidden layer for Frappe and MovieLens datasets respectively. The performance of LibFM and NFM-0 are also shown for benchmarking purposes. First and foremost, we observe that by using nonlinear activations, NFM's performance is improved with a large margin — compared to NFM-0 which has a similar performance with LibFM, the relative improvement is 1 1.3% and 5.2% for Frappe and MovieLens, respectively. This highlights the importance of modelling higher-order feature interactions for quality prediction. Among the different non-linear activation functions, there is no obvious winner. Second, when we use the identity function as the activation function, i.e., the hidden layer performs a linear transformation, NFM does not perform that well. This provides evidence to the necessity of learning higher-order feature interactions with non-linear functions. Separately, we use the same setting for all layers and tune them the same way as NFM-1.
Table 2: NFM with respect to different number of hidden layers
Figure imgf000024_0001
As can be seen from Table 2, when we stack more layers, the performance is not further improved, and best performance is when we use one hidden layer only. We have also explored other designs for hidden layers, such as the tower structure and residual units, however, the performance is still not improved. We think the reason is because the Bi-lnteraction layer has encoded informative second-order feature interactions, and based on which, a simple non-linear function is sufficient to capture higher order interactions. To verify this, we replaced the Bi-lnteraction layer with concatenation (which leads to the same architecture as Wide&Deep), and found that the performance can be gradually improved with more hidden layers (up to three); however, the best performance achievable is still inferior to that of NFM-1 . This demonstrates the value of using a more informative operation for low-level layers, which can ease the burden of higher-level layers for learning meaningful information. As a result, a deep structure becomes not necessarily required. To see whether a deeper NFM can further improve the performance, we stack more layers above the Bi-lnteraction layer. The activation function is ReLU, which has been shown to have good performance for deep networks. As it is computationally expensive to tune the size and dropout ratio for each hidden layer same way as NFM-1 . As can be seen from Table 2, when we stack more layers, the performance is not further improved, and best performance is when we use one hidden layer only. We have also explored other designs for hidden layers, such as the tower structure and residual units, however, the performance is still not improved. We think the reason is because the Bi-lnteraction layer has encoded informative second-order feature interactions, and based on which, a simple non-linear function is sufficient to capture higher order interactions. To verify this, we replaced the Bi-lnteraction layer with concatenation (which leads to the same architecture as Wide&Deep), and found that the performance can be gradually improved with more hidden layers (up to three); however, the best performance achievable is still inferior to that of NFM-1 . This demonstrates the value of using a more informative operation for low-level layers, which can ease the burden of higher-level layers for learning meaningful information. As a result, a deep structure becomes not necessarily required.
It is known that parameter initialization can greatly affect the convergence and performance of deep neutral networks (DNNs), since gradient-based methods can only find local optima for DNNs. As have been shown in Section above, initializing with feature embeddings learned by FM can significantly enhance Wide&Deep and DeepCross.
Figures 9A and 9B show the state of each epoch of NFM-1 with and without pre- training for Frappe and MovieLens datasets respectively. First, we can see that by using FM embeddings as pre-training, NFM exhibits extremely fast convergence— on both datasets, with 5 epochs only, the performance is on par with 40 epochs of NFM that is trained from scratch (with BN enabled). Second, we find that pre-training does not improve NFM's final performance, and a random initialization can achieve a result that is slightly better than that with pre-training. This demonstrates the robustness of NFM, which is relatively insensitive to parameter initialization. In contrast to the huge impact of pre-training on Wide&Deep and DeepCross (cf. Figures 5A and 5B) that improves both their convergence and final performance, we draw the conclusion that NFM is much easier to train and optimize, which is due largely to the informative and effective Bi-lnteraction pooling operation.
5.4 Performance Comparison We now compare with state-of-the-art methods. For NFM, we use one hidden layer with rectified linear unit (ReLU) as the activation function, since the baselines DeepCross and Wide&Deep also choose ReLU in their original papers. Note that the most important hyper-parameter for NFM is the dropout ratio, which we use 0.5 for the Bi-lnteraction layer and tune the value for the hidden layer.
Figures 10A and 10B are graphs showing the test RMSE with respect to different number of latent factors (i.e., embedding sizes), where Wide&Deep and DeepCross are pre-trained with FM to better explore the two methods for Frappe and MovieLens datasets respectively.
Table 3 shows the concrete scores obtained on factors 128 and 256, and the number of model parameters of each method. The scores of Wide&Deep and DeepCross without pre-training are also shown.
Table 3: Test error and number of trainable parameters for di.erent methods on latent factors 128 and 256. M denotes "million"; * and ** denote the statistical significance for p < 0.05 and p < 0.01, respectively.
Figure imgf000026_0001
We have the following three key observations.
First and foremost, NFM consistently achieves the best performance on both datasets with the fewest model parameters besides FM. This demonstrates the effectiveness and rationality of NFM in modelling higher-order and non-linear feature interactions for prediction with sparse data. The performance is followed by Wide&Deep, which uses a 3-layer MLP to learn feature interactions. We have also tried deeper layers for Wide&Deep, however the performance has not been improved. This further verifies the utility of using the informative Bi-lnteraction pooling in the low level. Second, we observe that HOFM shows slight improvement over FM with 1 .45% and 1 .04% average improvement on Frappe and MovieLens, respectively. This sheds light on the limitation of FM that models only the second-order feature interactions, and thus the usefulness of modelling higher-order interactions. Meanwhile, the large performance gap between HOFM and NFM reflects the value of modelling higher- order interactions in a non-linear way, since HOFM models higher-order interactions linearly and uses many more parameters than NFM.
Lastly, the relatively weak performance of DeepCross reveals that deeper learnings are not always be.er, as DeepCross is the deepest method among all baselines that utilizes a 10-layer network. On Frappe, DeepCross only achieves a comparable performance with the shallow FM model, while it underperforms FM significantly on MovieLens. We believe that the reasons are due to optimization difficulties and overfitting (as evidenced by the worse performance on factors 128 and 256). To conclude the performance study, we summarize the key advantages of our NFM over the baseline methods:
- NFM is more expressive than FM by modelling higher-order and non-linear feature interactions with only k2 more parameters. - Compared to HOFM, NFM models higher-order interactions in a non-linear way with much fewer parameters.
- Compared to existing deep learning methods Wide&Deep and DeepCross, NFM captures more informative interactions at the lower level and does not require a deep structure to predict well.
6. Conclusions In this disclosure, we proposed a novel neural network model NFM, which brings together the effectiveness of linear factorization machines with the strong representation ability of non-linear neural networks for sparse data prediction. The key of NFM's architecture is the newly proposed Bi-lnteraction operation, based on which we allow a neural network model to learn more informative feature interactions at the lower level. Extensive experiments on two real-world datasets show that with one hidden layer only, NFM significantly outperforms FM, higher-order FM, and the state-of-the-art deep learning approaches Wide&Deep and DeepCross. The work bridges the gap between linear models and deep learning. Linear models, such as various factorization methods, have shown to be effective information retrieval (IR) and data mining (DM) tasks and are easy to interpret. However, their limited expressiveness may hinder the performance when modeling real-world data with complex inherent patterns. While deep learning models have exhibited great expressive power and yielded immense success on speech processing and computer vision, their performance is still unsatisfactory for IR tasks, such as collaborative filtering. In our view, one reason is that most data of IR and DM tasks are naturally sparse; and to date, there still lacks effective deep learning solutions for prediction with sparse data. By connecting neural networks with FM— one of the most powerful linear models for supervised learning— we are able to design a simple yet effective deep learning solution for sparse data prediction.
Whilst the foregoing description has described exemplary embodiments, it will be understood by those skilled in the art that many variations of the embodiments can be made within the scope and spirit of the present invention.

Claims

1 . A predictive analytical method comprising
receiving a set of predictor variables as an input feature vector comprising a plurality of features;
projecting each feature of the feature vector onto a dense vector
representation to obtain a set of embedding vectors representing the input feature vector in an embedding space;
converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space;
inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and
transforming an output vector of the hidden layer stack into a prediction score.
2. A method according to claim 1 , wherein at least one layer of the hidden layer stack has a non-linear activation function.
3. A method according to claim 2, wherein the non-linear activation function is a sigmond, hyperbolic tangent, or rectifier function.
4. A method according to any preceding claim wherein converting the set of embedding vectors comprises performing a pooling operation on the set of embedding vectors.
5. A method according to any preceding claim wherein converting the set of embedding vectors into the bi-interaction pooling vector comprises calculating an element-wise product of embedding vectors of the set of embedding vectors.
6. A method according to any preceding claim wherein the input feature vector is sparse vector.
7. A method according to claim 6 wherein the input feature vector is one-hot encoded.
8. A ranking method comprising the predictive analytical method of any preceding claim.
9. A supervised machine learning method comprising:
receiving training data comprising a plurality of sets of predictor variables and target values;
for each set of predictor variables:
projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space;
converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space by performing a pooling operation on the set of embedding vectors
inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score; and
optimizing parameters of the pooling operation and the hidden layer stack by- optimizing an objective function.
10. A method according to claim 9, wherein optimizing the objective function comprises carrying out stochastic gradient descent.
1 1 . A computer readable medium carrying processor executable instructions which when executed on a processor cause the processor to carry out a method according to any one of claims 1 to 10.
12. A data processing system comprising a processor and a data storage device, the data storage device storing computer executable instructions operable by the processor to:
receive a set of predictor variables as an input feature vector comprising a plurality of features; project each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space;
convert the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space;
input the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and
transform an output vector of the hidden layer stack into a prediction score.
13. A data processing system according to claim 12, wherein at least one layer of the hidden layer stack has a non-linear activation function.
14. A data processing system according to claim 13, wherein the non-linear activation function is a sigmond, hyperbolic tangent, or rectifier function.
15. A data processing system according to any one of claims 12 to 14, wherein the data storage device comprises instructions operable by the processor to convert the set of embedding vectors into a bi-interaction pooling vector by performing a pooling operation on the set of embedding vectors.
16. A data processing system according to any one of claims 12 to 15, wherein the data storage device comprises instructions operable by the processor to convert the set of embedding vectors into the bi-interaction pooling vector by calculating an element-wise product of embedding vectors of the set of embedding vectors,
17. A data processing system according to any one of claims 12 to 16, wherein the input feature vector is a sparse vector and / or is one-hot encoded.
18. A data processing system according to any one of claims 12 to 17, the data storage device further comprising instructions to perform a ranking operation using the prediction score.
19. A data processing system according to any one of claims 12 to 18, the data storage device storing computer executable instructions operable by the processor to perform a machine learning method comprising;
receiving training data comprising a plurality of sets of predictor variables and target values;
for each set of predictor variables:
projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space;
converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space by performing a pooling operation on the set of embedding vectors
inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score; and
optimizing parameters of the pooling operation and the hidden layer stack by optimizing an objective function.
20. A data processing system according to claim 19, wherein optimizing the objective function comprises carrying out stochastic gradient descent.
PCT/SG2018/050234 2017-05-19 2018-05-15 Predictive analysis methods and systems Ceased WO2018212711A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG10201704120Q 2017-05-19
SG10201704120Q 2017-05-19

Publications (1)

Publication Number Publication Date
WO2018212711A1 true WO2018212711A1 (en) 2018-11-22

Family

ID=64274541

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2018/050234 Ceased WO2018212711A1 (en) 2017-05-19 2018-05-15 Predictive analysis methods and systems

Country Status (1)

Country Link
WO (1) WO2018212711A1 (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245310A (en) * 2019-03-06 2019-09-17 腾讯科技(深圳)有限公司 A kind of behavior analysis method of object, device and storage medium
CN110490637A (en) * 2019-07-15 2019-11-22 北京三快在线科技有限公司 Recommended method, device, electronic equipment and the readable storage medium storing program for executing of commodity group
CN110689937A (en) * 2019-09-05 2020-01-14 郑州金域临床检验中心有限公司 Coding model training method, system and equipment and detection item coding method
CN110728541A (en) * 2019-10-11 2020-01-24 广州市丰申网络科技有限公司 Information stream media advertisement creative recommendation method and device
CN111340522A (en) * 2019-12-30 2020-06-26 支付宝实验室(新加坡)有限公司 Resource recommendation method, device, server and storage medium
CN111429175A (en) * 2020-03-18 2020-07-17 电子科技大学 Method for predicting click conversion under sparse characteristic scene
CN111553766A (en) * 2020-04-28 2020-08-18 苏州市职业大学 Commodity recommendation method, commodity recommendation device, commodity recommendation equipment and commodity recommendation medium
CN111737578A (en) * 2020-06-22 2020-10-02 陕西师范大学 Recommendation method and system
CN111967949A (en) * 2020-09-22 2020-11-20 武汉博晟安全技术股份有限公司 Leaky-Conv & Cross-based safety course recommendation engine sorting algorithm
CN113111575A (en) * 2021-03-30 2021-07-13 西安交通大学 Combustion engine degradation evaluation method based on depth feature coding and Gaussian mixture model
CN113255977A (en) * 2021-05-13 2021-08-13 江西鑫铂瑞科技有限公司 Intelligent factory production equipment fault prediction method and system based on industrial internet
CN113724092A (en) * 2021-08-20 2021-11-30 同盾科技有限公司 Cross-feature federated marketing modeling method and device based on FM and deep learning
CN113887694A (en) * 2020-07-01 2022-01-04 复旦大学 A CTR Prediction Model Based on Feature Representation with Attention Mechanism
CN113918764A (en) * 2020-12-31 2022-01-11 浙江大学 A movie recommendation system based on cross-modal fusion
CN113988178A (en) * 2021-10-27 2022-01-28 广东电网有限责任公司 Method and device for detecting electricity stealing users of low-voltage distribution network
CN114297477A (en) * 2021-12-09 2022-04-08 中国科学技术大学 Intelligent financial management method to automatically identify potential customers
CN115206458A (en) * 2022-06-21 2022-10-18 北京诺道认知医学科技有限公司 Method and device for predicting plasma concentration of cyclosporin a
CN117556753A (en) * 2024-01-11 2024-02-13 联和存储科技(江苏)有限公司 Method, device, equipment and storage medium for analyzing energy consumption of storage chip
CN118964626A (en) * 2024-10-17 2024-11-15 烟台大学 A data anomaly detection method and system based on quantum graph federated learning
CN120064889A (en) * 2025-04-25 2025-05-30 阳谷新太平洋电缆有限公司 Cable fault big data early warning system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469335A (en) * 2016-08-31 2017-03-01 北京百度网讯科技有限公司 A kind of film box office Forecasting Methodology and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469335A (en) * 2016-08-31 2017-03-01 北京百度网讯科技有限公司 A kind of film box office Forecasting Methodology and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHENG H.-T. ET AL.: "Wide & Deep Learning for Recommender Systems", PROCEEDINGS OF THE 1 ST WORKSHOP ON DEEP LEARNING FOR RECOMMENDER SYSTEMS, 15 September 2016 (2016-09-15), pages 7 - 10, XP058280488, [retrieved on 20180711] *
QU Y. ET AL.: "Product-based Neural Network for User Response Prediction", IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING, 15 December 2016 (2016-12-15), pages 1149 - 1154, XP033056098, [retrieved on 20180711] *
RENDLE S .: "Factorization Machines", IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 17 December 2010 (2010-12-17), pages 995 - 1000, XP055562823, [retrieved on 20180709] *
ZHANG W. ET AL.: "Deep Learning over Multi-field Categorical Data - A Case Study on User Response Prediction", 38TH EUROPEAN CONFERENCE ON INFORMATION RETRIEVAL, 23 March 2016 (2016-03-23), pages 45 - 57, XP047359078, [retrieved on 20180711] *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245310A (en) * 2019-03-06 2019-09-17 腾讯科技(深圳)有限公司 A kind of behavior analysis method of object, device and storage medium
CN110245310B (en) * 2019-03-06 2023-10-13 腾讯科技(深圳)有限公司 Object behavior analysis method, device and storage medium
CN110490637A (en) * 2019-07-15 2019-11-22 北京三快在线科技有限公司 Recommended method, device, electronic equipment and the readable storage medium storing program for executing of commodity group
CN110689937A (en) * 2019-09-05 2020-01-14 郑州金域临床检验中心有限公司 Coding model training method, system and equipment and detection item coding method
CN110728541A (en) * 2019-10-11 2020-01-24 广州市丰申网络科技有限公司 Information stream media advertisement creative recommendation method and device
CN110728541B (en) * 2019-10-11 2024-01-23 广州市丰申网络科技有限公司 Information streaming media advertising creative recommendation method and device
CN111340522A (en) * 2019-12-30 2020-06-26 支付宝实验室(新加坡)有限公司 Resource recommendation method, device, server and storage medium
CN111340522B (en) * 2019-12-30 2024-03-08 支付宝实验室(新加坡)有限公司 Resource recommendation method, device, server and storage medium
CN111429175B (en) * 2020-03-18 2022-05-27 电子科技大学 Method for predicting click conversion under sparse characteristic scene
CN111429175A (en) * 2020-03-18 2020-07-17 电子科技大学 Method for predicting click conversion under sparse characteristic scene
CN111553766A (en) * 2020-04-28 2020-08-18 苏州市职业大学 Commodity recommendation method, commodity recommendation device, commodity recommendation equipment and commodity recommendation medium
CN111553766B (en) * 2020-04-28 2023-09-15 苏州市职业大学 Commodity recommendation method, device, equipment and medium
CN111737578A (en) * 2020-06-22 2020-10-02 陕西师范大学 Recommendation method and system
CN111737578B (en) * 2020-06-22 2024-04-02 陕西师范大学 Recommendation method and system
CN113887694A (en) * 2020-07-01 2022-01-04 复旦大学 A CTR Prediction Model Based on Feature Representation with Attention Mechanism
CN111967949A (en) * 2020-09-22 2020-11-20 武汉博晟安全技术股份有限公司 Leaky-Conv & Cross-based safety course recommendation engine sorting algorithm
CN113918764A (en) * 2020-12-31 2022-01-11 浙江大学 A movie recommendation system based on cross-modal fusion
CN113111575A (en) * 2021-03-30 2021-07-13 西安交通大学 Combustion engine degradation evaluation method based on depth feature coding and Gaussian mixture model
CN113111575B (en) * 2021-03-30 2023-03-31 西安交通大学 Combustion engine degradation evaluation method based on depth feature coding and Gaussian mixture model
CN113255977A (en) * 2021-05-13 2021-08-13 江西鑫铂瑞科技有限公司 Intelligent factory production equipment fault prediction method and system based on industrial internet
CN113724092A (en) * 2021-08-20 2021-11-30 同盾科技有限公司 Cross-feature federated marketing modeling method and device based on FM and deep learning
CN113724092B (en) * 2021-08-20 2024-06-07 同盾科技有限公司 Cross-feature federal marketing modeling method and device based on FM and deep learning
CN113988178A (en) * 2021-10-27 2022-01-28 广东电网有限责任公司 Method and device for detecting electricity stealing users of low-voltage distribution network
CN114297477A (en) * 2021-12-09 2022-04-08 中国科学技术大学 Intelligent financial management method to automatically identify potential customers
CN115206458A (en) * 2022-06-21 2022-10-18 北京诺道认知医学科技有限公司 Method and device for predicting plasma concentration of cyclosporin a
CN117556753A (en) * 2024-01-11 2024-02-13 联和存储科技(江苏)有限公司 Method, device, equipment and storage medium for analyzing energy consumption of storage chip
CN117556753B (en) * 2024-01-11 2024-03-19 联和存储科技(江苏)有限公司 Method, device, equipment and storage medium for analyzing energy consumption of storage chip
CN118964626A (en) * 2024-10-17 2024-11-15 烟台大学 A data anomaly detection method and system based on quantum graph federated learning
CN120064889A (en) * 2025-04-25 2025-05-30 阳谷新太平洋电缆有限公司 Cable fault big data early warning system
CN120064889B (en) * 2025-04-25 2025-07-15 阳谷新太平洋电缆有限公司 A cable fault big data early warning system

Similar Documents

Publication Publication Date Title
WO2018212711A1 (en) Predictive analysis methods and systems
He et al. Neural factorization machines for sparse predictive analytics
Xiao et al. Graph neural networks in node classification: survey and evaluation
Zhou et al. Edge intelligence: Paving the last mile of artificial intelligence with edge computing
US10248664B1 (en) Zero-shot sketch-based image retrieval techniques using neural networks for sketch-image recognition and retrieval
Lu et al. Brain intelligence: go beyond artificial intelligence
CA2997797C (en) Bilstm-siamese network based classifier for identifying target class of queries and providing responses thereof
Li et al. A novel double incremental learning algorithm for time series prediction
US12321841B2 (en) Unsupervised cross-domain data augmentation for long-document based prediction and explanation
Chowdhury et al. Few-shot class-incremental learning for 3d point cloud objects
US20250225398A1 (en) Data processing method and related apparatus
US12493819B2 (en) Utilizing machine learning models to generate initiative plans
Chowdhury et al. Qsfvqa: A time efficient, scalable and optimized vqa framework
Sharma et al. Transfer learning and its application in computer vision: A review
CN116561591A (en) Training method for semantic feature extraction model of scientific and technological literature, feature extraction method and device
Ma et al. Acceleration algorithms in gnns: A survey
US11907673B1 (en) Enhancing chatbot recognition of user intent through graph analysis
Zhang Distributed SVM face recognition based on Hadoop
Zhang et al. Review on deep learning in feature selection
EP4586148A1 (en) Performance optimization predictions related to an entity dataset based on a modified version of a predefined feature set for a candidate machine learning model
He et al. Position: Beyond Euclidean--Foundation Models Should Embrace Non-Euclidean Geometries
Liu et al. GSL-Mash: Enhancing Mashup Creation Service Recommendations Through Graph Structure Learning
Fei et al. Active learning methods with deep Gaussian processes
Hanbali et al. Advanced machine learning and deep learning approaches for fraud detection in mobile money transactions
Lambert et al. Flexible recurrent neural networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18801924

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18801924

Country of ref document: EP

Kind code of ref document: A1