WO2018212711A1

WO2018212711A1 - Predictive analysis methods and systems

Info

Publication number: WO2018212711A1
Application number: PCT/SG2018/050234
Authority: WO
Inventors: Xiangnan HE; Tat-Seng CHUA
Original assignee: National University of Singapore
Current assignee: National University of Singapore
Priority date: 2017-05-19
Filing date: 2018-05-15
Publication date: 2018-11-22
Anticipated expiration: 2019-11-19

Abstract

Methods and systems for predictive analysis are disclosed, A predictive analysis method comprises: receiving a set of predictor variables as an input feature vector comprising a plurality of features; projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space; inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score.

Description

PREDICTIVE ANALYSIS METHODS AND SYSTEMS

FIELD OF THE INVENTION The present disclosure relates to predictive analysis using machine learning and more specifically to modeling interactions between predictor variables in predictive analysis.

BACKGROUND OF THE INVENTION

Predictive analytics is one of the most important techniques for many information retrieval (IR) and data mining (DM) tasks, ranging from recommendation systems, targeted advertising, to search ranking and sentiment analysis. Typically, a predictive task is formulated as estimating a function that maps predictor variables to some target, for example, real valued target for regression and categorical target for classification. Distinct from continuous predictor variables that are naturally found in images and audios, such as raw features, the predictor variables for web applications are mostly discrete and categorical. For example, in online advertising, we need to predict how likely (target) a user (first predictor variable) of a particular occupation (second predictor variable) will click on an ad (third predictor variable). To build predictive models with these categorical predictor variables, a common solution is to convert them to a set of binary features (also known as feature vector) via one- hot encoding. Thereafter, standard machine learning (ML) techniques such as logistic regression and support vector machines can be applied.

Depending on the possible values of categorical predictor variables, the generated feature vector can be of very high dimension but sparse. To build effective ML models with such sparse data, it is crucial to account for the interactions between features. Many successful solutions in both industry and academia largely rely on manually crafting combinatorial features, i.e., constructing new features by combining multiple predictor variables, also known as cross features. For example, we can cross variable occupation = {banker, doctor} with gender = {M, F} and get a new occupation_gender = {banker_M, banker_F, doctor_M, doctor_F}. It is well known that top data scientists are usually masters of crafting combinatorial features, which play a key role in their winning formulas. However, the power of such features comes at a high cost, since it requires heavy engineering efforts and useful domain knowledge to design effective features. Thus these solutions can be difficult to generalize to new problems or domains.

Instead of augmenting feature vectors manually, another solution is to design a ML model to learn feature interactions from raw data automatically. A popular approach is factorization machines (FMs), which embeds features into a latent space and models the interactions between features via inner product of their embedding vectors. While FM has yielded great promise in many prediction tasks, we argue that its performance can be limited by its linearity, as well as the modeling of pairwise (i.e., second-order) feature interactions only. Specifically, for real-world data that have complex and non-linear underlying structure, FM may not be expressive enough. Although higher-order FMs have been proposed by Rendle [S. Rendle, 'Factorization machines' In ICDM, pages 995-1000, 2010], they still belong to the family of linear models and are claimed to be difficult to estimate.

SUMMARY OF THE INVENTION According to a first aspect of the present disclosure a predictive analytical method comprises: receiving a set of predictor variables as an input feature vector comprising a plurality of features; projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space; inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score.

Embodiments of the present invention have the following features a Bi-interaction pooling operation is introduced in neural network modeling, and a new neural network view for FM presented. Based on this view a novel neural factorization machines (NFM) model to deepen FM under the neural network framework for learning higher-order and non-linear feature interactions is provided.

In an embodiment, at least one layer of the hidden layer stack has a non-linear activation function. The non-linear activation function may be a sigmond, hyperbolic tangent, or rectifier function.

In an embodiment, converting the set of embedding vectors comprises performing a pooling operation on the set of embedding vectors.

In an embodiment, converting the set of embedding vectors into the bi-interaction pooling vector comprises calculating an element-wise product of embedding vectors of the set of embedding vectors. In an embodiment, the input feature vector is sparse vector. The input feature vector may be one-hot encoded.

The analytical method may form part of a ranking method. According to a second aspect of the present disclosure a supervised machine learning method comprises; receiving training data comprising a plurality of sets of predictor variables and target values; for each set of predictor variables: projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space by performing a pooling operation on the set of embedding vectors inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score; and optimizing parameters of the pooling operation and the hidden layer stack by optimizing an objective function. In an embodiment, optimizing the objective function comprises carrying out stochastic gradient descent.

According to a third aspect of the present disclosure, a data processing system comprises a processor and a data storage device. The data storage device stores computer executable instructions operable by the processor to: receive a set of predictor variables as an input feature vector comprising a plurality of features; project each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; convert the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space; input the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transform an output vector of the hidden layer stack into a prediction score.

In an embodiment, the data storage device comprises instructions operable by the processor to convert the set of embedding vectors into a bi-interaction pooling vector by performing a pooling operation on the set of embedding vectors. In an embodiment, the data storage device comprises instructions operable by the processor to convert the set of embedding vectors into the bi-interaction pooling vector by calculating an element-wise product of embedding vectors of the set of embedding vectors. In an embodiment, the input feature vector is a sparse vector and / or is one-hot encoded.

In an embodiment, the data storage device further comprises instructions to perform a ranking operation using the prediction score. In an embodiment, the data storage device stores computer executable instructions operable by the processor to perform a machine learning method comprising: receiving training data comprising a plurality of sets of predictor variables and target values; for each set of predictor variables: projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space; converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space by performing a pooling operation on the set of embedding vectors; inputting the bi- interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score; and optimizing parameters of the pooling operation and the hidden layer stack by optimizing an objective function.

In an embodiment, optimizing the objective function comprises carrying out stochastic gradient descent.

According to a yet further aspect, there is provided a non-transitory computer- readable medium. The computer-readable medium has stored thereon program instructions for causing at least one processor to perform operations of a method disclosed above.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, embodiments of the present invention will be described as non- limiting examples with reference to the accompanying drawings in which:

Figure 1 is a block diagram showing a neutral factorization machines model architecture according to an embodiment of the present invention;

Figure 2 is a block diagram showing a technical architecture of a data processing system according to an embodiment of the present invention; Figure 3 is a flowchart showing a method of predictive analysis using neural factorization machines according to an embodiment of the present invention;

Figure 4 is a flowchart showing a method of learning the parameters of a neural factorization machines system according to an embodiment of the present invention;

Figures 5A and 5B are graphs showing a comparison of validation error with respect to dropout on the bi-interaction layer of embodiments of the present invention; Figures 6A and 6B are graphs showing training and validation error of each epoch to illustrate the effect of dropout in the bi-interaction layer of embodiments of the present invention;

Figures 7A and 7B are graphs showing training and validation error of each epoch to illustrate the effect of batch normalization in the bi-interaction layer of embodiments of the present invention;

Figures 8A and 8B are graphs showing root mean square error for different activation functions in a first hidden layer in embodiments of the present invention;

Figures 9A and 9B are graphs showing the effect of pre-training in embodiments of the present invention; and

Figures 10A and 10B are graphs showing a performance comparison for root mean square error with respect to different numbers of latent factors for embodiments of the present disclosure and state of the art methods.

DETAILED DESCRIPTION In the present disclosure, we propose a novel model for predictive analytics with sparse inputs named Neural Factorization Machines (NFMs), which enhances FMs by modeling higher-order and non-linear feature interactions. By devising a new operation in neural network modeling— Bilinear Interaction (Bi-lnteraction) pooling — we subsume FM under the neural network framework for the first time. Through stacking nonlinear layers above the Bi-lnteraction layer, we are able to deepen the shallow linear FM, modeling higher-order and non-linear feature interactions effectively to improve FM's expressiveness. In contrast to traditional deep learning methods that simply concatenate or average embedding vectors in the low level, our use of Bi-lnteraction pooling encodes more informative feature interactions, greatly facilitating the following "deep" layers to learn meaningful information. Compared to the state-of-the-art deep learning methods— the 3-layer Wide&Deep [H.-T. Cheng, L. Koc, J. Harmsen, et al. 'Wide & deep learning for recommender systems', In DLRS, pages 7-10, 2016] and 10-layer DeepCross [Y. Shan, T. R. Hoens, J. Jiao, H. Wang, D. Yu, and J. Mao 'Deep crossing: Webscale modeling without manually crafted combinatorial features', In KDD, pages 255-262] our 1 -layer NFM shows consistent improvements with a much simpler structure and fewer model parameters.

1. The Neural Factorization Machines Model

Similar to factorization machine, NFM is a general machine learner working with any real valued feature vector. Given a sparse vector x ε W¹ as input, where a feature value Xj = 0 means the /-th feature does not exist in the instance, NFM estimates the target as:

where the first and second terms are the linear regression part similar to that for FM, which models global bias of data and weight of features. The third term f(x) is the core component of NFM for modeling feature interactions, which is a multi-layered feed-forward neural network as shown in Figure 1 . In what follows, we elaborate the design of f(x) layer by layer.

Figure 1 illustrates a neutral factorization machines model architecture according to an embodiment of the present invention. For clarity purposes, the linear regression part has been omitted from the figure. The input to the model 100 is a sparse input feature vector 1 10 which comprises a plurality of input feature values 1 12. The input feature vector 1 10 is input into an embedding layer 120. The embedding layer 120 is a fully connected layer that projects each feature 1 12 to a dense vector representation. The embedding layer 120 comprises a plurality of embedding vectors 122 each corresponding to a feature 1 12 of the input feature vector 1 10. Formally, let Vj≡ ^k be the embedding vector for the /-th feature. After embedding, we obtain a set of embedding vectors V_x = {xiv_1t x_nv_n } to represent the input feature vector x. Note that we have rescaled an embedding vector by its input feature value, rather than simply an embedding table lookup, so as to account for the real valued features.

The set of embedding vectors is then fed into a bi-interaction layer 130. the Bi- Interaction layer, which is a pooling operation that converts the set of embedding vectors V_x into one vector:

where 0 denotes the element-wise product of two vectors. Clearly, the output of Bi- Interaction pooling is a / -dimension vector that encodes the second-order interactions between features in the embedding space.

It is notable that the proposal of Bi-lnteraction pooling does not introduce extra model parameter, and more importantly, it can be efficiently computed in linear time. This property is the same with average/max pooling and concatenation that are rather simple but commonly used in neural network approaches. To show the linear time complexity of evaluatin Bi-lnteraction pooling, we reformulate the equation as:

where we use the symbol v² to denote the product vQv. By considering the sparsity of x, we can actually perform Bi-lnteraction pooling in 0(kN_x) time, where N_x denotes the number of non-zero entries in x. This is a very appealing property, meaning that the benefit of Bi-lnteraction pooling in modeling pairwise feature interactions does not involve any additional cost. Above the Bi-lnteraction pooling layer 130 is a stack of standard fully connected layers 140. The stack 140 comprises a plurality of hidden layers 142 which are capable of learning higher-order interactions between features. Formally, they are defined as follows:

where L denotes the number of hidden layers, Wj , bj and σ-, denote the weight matrix, bias vector and activation function for the /-th layer, respectively. This is advantageous over higher-order FM which only supports the learning of higher-order interactions in a linear way.

Finally, the output of the hidden layer stack 140 is transformed into a prediction score 150. The output vector of the last hidden layer z_L is transformed to the final prediction score 150 using the following projection:

where vector h denotes the neuron weights of the prediction layer.

To summarize, we give the formulation NFM's predictive model as:

n

VNFA4 (^x) = w₀ + ^ w/Xi + ¹ σι (W/ ..σι (Wi jfe /( - } + bi ) ...) + b/J, i = l

Compared to FM, the additional model parameters of NFM are mainly {Wj, bj}, which are used for learning higher-order interactions between features. It is clear that by setting L to zero, which means the output of Bi-lnteraction pooling is directly projected to the prediction score, we can exactly recover the FM model. 2 Learning

To optimize the model parameters of NFM, one need to specify an objective function to optimize, which can be tailored for different tasks. For example, for regression with real-valued targets, one can optimize the squared loss; for binary classification with 0/1 labels, one can optimize the log loss].

For any smooth objective function, stochastic gradient descent (SGD) is a universal solver for optimizing a neural network model. The gradient of the NFM model with respect to a model parameter can be obtained with the standard chain rule. As the Bi-lnteraction pooling layer is a new operation proposed by this work, we give its derivative as follows:

which can be computed in {kN_x) time, the same time complexity with computing the Bi-lnteraction operation.

While neural network models have strong representation ability, they are also easy to overfit to the training data. Dropout is a regularization technique for neural networks to prevent overfitting. The idea is to randomly drop neurons (along with their connections) of the neural network during training. That is, in each parameter update, only part of the model parameters that contributes to the prediction of (x) will be updated. Through this process, it can prevent complex co-adaptations of neurons on training data. It is important to note that in the testing phase, dropout must be disabled and the whole network is used for estimating y(x) . As such, dropout can also be seen as performing model averaging with smaller neural networks.

In NFM, to avoid feature embeddings co-adapting to each other and overfitting the data, we propose to adopt dropout on the Bi-lnteraction layer. Specifically, after obtaining f_BI(V_x) which is a k-dimensional vector of latent factors, we randomly drop p percent of latent factors, where p is termed as the dropout ratio. Since NFM with no hidden layer degrades to the FM model, it can be seen as a new way to regularize FM. Moreover, we also apply dropout on each hidden layer of NFM to prevent the learning of higher-order feature interactions from co-adaptations and overfiting. One difficulty of training multi-layered neural networks is caused by the fact of covariance shift. It means that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. As a result, the later layer needs to adapt to these changes (which are often noisy) when updating its parameters, which adversely slows down the training. To address the problem, loffe and Szegedy [S. loffe and C. Szegedy. 'Batch normalization: Accelerating deep network training by reducing internal covariate shift', In ICML, pages 448-456, 2015] proposed batch normalization (BN), which normalizes layer inputs to a zero-mean unit-variance Gaussian distribution for each training mini-batch. It has been shown that BN leads to faster convergence and better performance in several computer vision tasks.

Formally, let the input vector to a layer be x_t ε R^d and all input vectors to the layer of the mini-batch be: Έ = {xj, then BN normalizes Xj as:

Where:

denotes the mini-batch mean, and

· . = !*! ∑, - « <«, - «. )* denotes the mini-batch variance, and γ and β are trainable parameters (vectors) to scale and shift the normalized value to restore the representation power of the network. Note that in testing, BN also needs to be applied, where μ_Β and σ_Β are estimated from the whole training data. In NFM, to avoid the update of feature embeddings changing the input distribution to hidden layers or prediction layer, we perform BN on the output of the Bi-lnteraction pooling. For each successive hidden layer, the BN is also applied.

3. Implementation of a Neural Factorization Machines System Figure 2 is a block diagram showing a technical architecture 200 of a data processing system according to an embodiment of the present invention. Typically, the methods of optimizing model parameters and predictive analysis using neural factorization machines according to embodiments of the present invention are implemented on a computer, or a number of computers, each having a data- processing unit. The block diagram as shown in Figure 2 illustrates a technical architecture 200 of a computer which is suitable for implementing one or more embodiments herein. The technical architecture 200 includes a processor 222 (which may be referred to as a central processor unit or CPU) that is in communication with memory devices including secondary storage 224 (such as disk drives), read only memory (ROM) 226, random access memory (RAM) 228. The processor 222 may be implemented as one or more CPU chips. The technical architecture 220 may further comprise input/output (I/O) devices 230, and network connectivity devices 232. The technical architecture 200 further comprises activity table storage 240 which may be implemented as a hard disk drive or other type of storage device.

The secondary storage 224 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if RAM 228 is not large enough to hold all working data. Secondary storage 224 may be used to store programs which are loaded into RAM 228 when such programs are selected for execution. In this embodiment, the secondary storage 224 has an input vector receiving module 224a, an embedding layer module 224b, a bi- interaction pooling module 224c, hidden layer stack module 224d, a prediction score calculation module 224e and an optimization module 224f comprising non-transitory instructions operative by the processor 222 to perform various operations of the methods of the present disclosure. As depicted in Figure 2, the modules 224a-224f are distinct modules which perform respective functions implemented by the data processing system. It will be appreciated that the boundaries between these modules are exemplary only, and that alternative embodiments may merge modules or impose an alternative decomposition of functionality of modules. For example, the modules discussed herein may be decomposed into sub-modules to be executed as multiple computer processes, and, optionally, on multiple computers. Moreover, alternative embodiments may combine multiple instances of a particular module or sub-module. It will also be appreciated that, while a software implementation of the modules 224a-224f is described herein, these may alternatively be implemented as one or more hardware modules (such as field-programmable gate array(s) or application-specific integrated circuit(s)) comprising circuitry which implements equivalent functionality to that implemented in software. The ROM 226 is used to store instructions and perhaps data which are read during program execution. The secondary storage 224, the RAM 228, and/or the ROM 226 may be referred to in some contexts as computer readable storage media and/or non-transitory computer readable media.

The I/O devices 230 may include printers, video monitors, liquid crystal displays (LCDs), plasma displays, touch screen displays, keyboards, keypads, switches, dials, mice, track balls, voice recognizers, card readers, paper tape readers, or other well- known input devices.

The network connectivity devices 332 may take the form of modems, modem banks, Ethernet cards, universal serial bus (USB) interface cards, serial interfaces, token ring cards, fiber distributed data interface (FDDI) cards, wireless local area network (WLAN) cards, radio transceiver cards that promote radio communications using protocols such as code division multiple access (CDMA), global system for mobile communications (GSM), long-term evolution (LTE), worldwide interoperability for microwave access (WiMAX), near field communications (NFC), radio frequency identity (RFID), and/or other air interface protocol radio transceiver cards, and other well-known network devices. These network connectivity devices 232 may enable the processor 222 to communicate with the Internet or one or more intranets. With such a network connection, it is contemplated that the processor 222 might receive information from the network, or might output information to the network in the course of performing the method operations described herein. Such information, which is often represented as a sequence of instructions to be executed using processor 222, may be received from and outputted to the network, for example, in the form of a computer data signal embodied in a carrier wave. The processor 222 executes instructions, codes, computer programs, scripts which it accesses from hard disk, floppy disk, optical disk (these various disk based systems may all be considered secondary storage 224), flash drive, ROM 226, RAM 228, or the network connectivity devices 232. While only one processor 222 is shown, multiple processors may be present. Thus, while instructions may be discussed as executed by a processor, the instructions may be executed simultaneously, serially, or otherwise executed by one or multiple processors.

It is understood that by programming and/or loading executable instructions onto the technical architecture 200, at least one of the CPU 222, the RAM 228, and the ROM 226 are changed, transforming the technical architecture 200 in part into a specific purpose machine or apparatus having the novel functionality taught by the present disclosure. It is fundamental to the electrical engineering and software engineering arts that functionality that can be implemented by loading executable software into a computer can be converted to a hardware implementation by well-known design rules.

Although the technical architecture 200 is described with reference to a computer, it should be appreciated that the technical architecture may be formed by two or more computers in communication with each other that collaborate to perform a task. For example, but not by way of limitation, an application may be partitioned in such a way as to permit concurrent and/or parallel processing of the instructions of the application. Alternatively, the data processed by the application may be partitioned in such a way as to permit concurrent and/or parallel processing of different portions of a data set by the two or more computers. In an embodiment, virtualization software may be employed by the technical architecture 200 to provide the functionality of a number of servers that is not directly bound to the number of computers in the technical architecture 200. In an embodiment, the functionality disclosed above may be provided by executing the application and/or applications in a cloud computing environment. Cloud computing may comprise providing computing services via a network connection using dynamically scalable computing resources. A cloud computing environment may be established by an enterprise and/or may be hired on an as-needed basis from a third party provider. Figure 3 is a flowchart showing a method of predictive analysis using neural factorization machines according to an embodiment of the present invention. The method 300 is carried out on the data processing system 200 shown in Figure 2. In step 302, the input vector receiving module 224a of the data processing system 200 receives predictor variables as an input feature vector. The input feature vector may be a sparse vector 1 12 encoded by one-hot encoding.

In step 304, the embedding layer module 224b of the data processing system 200 projects the input feature vector onto an embedding space to obtain a set of embedding vectors 122.

In step 306, the bi-interaction pooling module 224c of the data processing system 200 converts the set of embedding vectors 122 into a bi-interaction pooling vector that encodes second order interactions between features of the feature vector in the embedding space.

In step 308, the bi-interaction pooling vector is input into the hidden layer stack 140 by the hidden layer stack module 224d of the data processing system 200. Each of the layers 142 of the hidden layer stack 140 comprises a plurality of nodes which perform calculations and transfer information from Layer 1 to Layer L.

In step 310, the prediction score calculation module 224e of the data processing system 200 transforms the output vector of the hidden layer stack 140 into a prediction score 150.

Figure 4 is a flowchart showing a method of learning the parameters of a neural factorization machines system according to an embodiment of the present invention. The method 400 shown in Figure 4 may be carried out on the data processing system 200 shown in Figure 2.

In step 402, the data processing system 200 receives training data which comprises a plurality of sets of predictor variables, with each set forming an input feature vector. The input feature vectors may be sparse vectors encoded by one-hot encoding. In step 404, the data processing system 200 calculates a prediction score for each set of predictor variables using the training data received in step 402. The prediction scores calculated for the training data are parameterized by model variables. Step 404 is carried out according to the method 300 shown in Figure 3.

In step 406, the optimization module 224f of the data processing system 200 optimizes the parameters of the model. The optimization carried out in step 406 may comprise minimizing an objective function.

4. Applications of Neural Factorization Machines

4.1. Applications to Recommendation Systems Embodiments of the NFM systems and method can be used as the ranking engine for recommender systems. We now discuss how to apply NFM to build an E- commerce recommendation system.

In practical E-commerce systems, we typically have three types of data to build a recommendation service: 1 ) users' interaction histories on products, such as purchasing, rating, clicking histories etc.; 2) user profiles, such as demographics like age, gender, hometown, income level etc.; 3) product properties, such as categories, prices, descriptive tags, product images etc. For each interaction, we convert it to a training instance with the basic features include user ID and product ID; this will provide the basic collaborative filtering system. To incorporate the side information of user profiles and product properties, we need to do feature engineering based on the types of side information. For categorical variables like ages (male or female) and hometown (Shanghai, Beijing or other cities), we can append them to the feature vector via one-hot encoding. For real-value variables like user ages and product prices, we can append it to the feature vector as it is, or discretize the real-value feature into categorical variable.

To build a visual-aware recommender system, such as accounting for 2D product images that carry rich semantics, we shall use some deep learning methods to extract a representation vector for an image first. For example, we can use ResNet [K. He, X. Zhang, S. Ren, and J. Sun, 'Deep Residual Learning for Image Recognition', In CVPR, pages 770-778, 2016] to extract a 4096-dimension feature representation (denote it as f) for the image, and then convert it to an embedding vector (e.g. by using Wr where W denotes the convert matrix to be learned from data). We can then apply AFM on the original feature embedding vectors and the generated image embedding vector (i.e. the embedding layer has one more embedding vector Wr representing the product image embedding). In the following subsection, we show how to deploy NFM to two recommendation scenarios— context-aware recommendation and personalized tag recommendation.

5.1 Experiment Setting Datasets.

We experimented with two publicly accessible datasets: Frappe and MovieLens:

Frappe is a context-aware app discovery tool. This dataset is constructed by Baltrunas et al. [L. Baltrunas, K. Church, A. Karatzoglou, and N. Oliver, 'Frappe: Understanding the usage and perception of mobile app recommendations in-the- wild', CoRR, abs/1505.03014, 2015]. It contains 96,203 app usage logs of users under different contexts. Besides user ID and app ID, each log contains 8 context variables, including weather, city and daytime (e.g., morning or afternoon). We converted each log (i.e., user ID, app ID and all context variables) to a feature vector using one-hot encoding, resulting in 5,382 features in total. A target value of 1 means the user has used the app under the context.

MovieLens is is the Full version of the latest MovieLens data published by GroupLens [F. M. Harper and J. A. Konstan, 'The movielens datasets: History and context', ACM Transactions on Interactive Intelligent Systems, 5:19: 1-19: 19, 2015]. As this work concerns higher-order interactions between features, we study the task of personalized tag recommendation rather than collaborative filtering that considers the second-order interactions only. The tagging part of the data includes 668, 953 tag applications of 17,045 users on 23,743 items with 49,657 distinct tags. We converted each tag application (i.e., user ID, movie ID and tag) to a feature vector, resulting in 90,445 features in total. A target value of 1 means the user has assigned the tag to the movie.

As both original datasets contain positive instances only (i.e., all instances have target value 1 ), we sampled two negative instances to pair with one positive instance to ensure the generalization of the predictive model. For each log of Frappe, we randomly sampled two apps that the user has not used in the context; for each tag application of MovieLens, we randomly sampled two tags that the user has not assigned to the movie. Each negative instance is assigned to a target value of -1 . Table 1 summarizes the statistics of the final evaluation datasets.

Table 1. Statistics of the evaluation datasets

Dataset lnstance# Feature# User# Item*

Frappe 288,609 5,382 957 4,082

MovieLens 2,006,859 90,445 17,045 23,743

Evaluation criteria

We randomly split the dataset into training (70%), validation (20%), and test (10%) sets. The validation set was used for tuning hyper-parameters and the final performance comparison was conducted on the test set. We consider the regression task in this work and evaluate the prediction performance with root mean square error (RMSE). We rounded up the prediction of each model to 1 or -1 if it was out of the range. The one-sample paired t-test was performed to judge the statistical significance where necessary.

We compared with the following competitive embedding-based models that are specifically designed for prediction with sparse inputs:

LibFM [S. Rendle 'Factorization machines with libfm', ACM Transactions on Intelligent Systems and Technology, 3:57: 1-57:22, 2012]. This is the official implementation (http://www. libfm.org) of FM released by Rendle. HOFM. This is the third-part implementation (https://github.com/geffy/tffm) of higher- order FM. We experimented with order size 3, since the MovieLens data concerns the ternary relationship between users, movies and tags.

Wide&Deep [H.-T. Cheng, L. Koc, J. Harmsen, et al. 'Wide & deep learning for recommender systems', In DLRS, pages 7-10, 2016]. We used the same network structure as reported in their paper, which has three layers with size 1024, 512 and 256, respectively.

DeepCross [Y. Shan, T. R. Hoens, J. Jiao, H. Wang, D. Yu, and J. Mao, 'Deep crossing: Webscale modeling without manually crafted combinatorial features', In KDD, pages 255-262.]. We used the same structure as reported in their paper, which stacks 5 residual units (each unit has two layers) with the hidden dimension 512, 512, 256, 128 and 64, respectively.

For the network structure of NFM, we use one hidden layer with the rectified linear unit (ReLU) as the activation function, since the baselines DeepCross and Wide&Deep also choose ReLU in their original papers.

To fairly compare models' capability, we learned all models by optimizing the square loss. To prevent overfitting, we tuned the L₂ regularization for linear models LibFM and HOFM and the dropout ratio for neural network models Wide&Deep, DeepCross and NFM. Besides LibFM that optimized FM with the vanilla SGD, all other methods were optimized with mini-batch Adagrad [J. Duchi, E. Hazan, and Y. Singer, 'Adaptive subgradient methods for online learning and stochastic optimization', Journal of Machine Learning Research, 12:2121-2159, 201 1 ], where the batch size was set to 128 for Frappe and 4,096 for MovieLens. For all methods, the early stopping strategy was performed, where we stopped training if the RMSE on validation set increased for 4 successive epochs. Without special mention, the embedding size is set to 64 by default.

5.2 Study of Bi-lnteraction Pooling We empirically study the Bi-lnteraction pooling operation. To avoid other components (e.g., hidden layers) affecting the analysis, we study the NFM-0 model that directly projects the output of Bi-lnteraction pooling to prediction score with no hidden layer. As discussed above, NFM-0 is identical to FM as the trainable h does not impact model's expressiveness. We firrst compare dropout with traditional L₂ regularization for preventing model overfitting, and then explore the impact of batch normalization.

Figures 5A and 5B show the validation error of NFM-0 with respect to dropout ratio on the Bi-lnteraction layer and L₂ regularization on feature embeddings for Frappe and MovieLens datasets respectively. The performance of linear regression (LR) is also shown for benchmarking the performance of prediction that does not consider feature interactions. First, LR leads to very poor performance, highlighting the importance of modelling interactions between sparse features for prediction. Second, we see that both regularization and dropout can well prevent overfitting and improve NFM-0's generalization to unseen data. Between the two strategies, dropout offers better performance. Specifically, on Frappe, using a dropout ratio of 0.3 leads to a lowest validation error of 0.3562, which is significantly better than that of L2 regularization 0.3799. One reason might be that enforcing L₂ regularization only suppresses the values of parameters in each update numerically, while using dropout can be seen as ensembling multiple sub-models, which can be more effective. Considering the genericity of FM that subsumes many factorization models, we believe this is a new interesting finding, meaning that dropout can also be an effective strategy to address overfitting of linear latent-factor models.

Figures 6A and 6B show training and validation error of each epoch of NFM-0 with and without dropout for Frappe and MovieLens datasets respectively. Both datasets show that with a dropout ratio of 0.3, although the training error is higher, the validation error becomes lower. This demonstrates the ability of dropout in preventing overfitting and as such, better generalization can be achieved.

Figures 7A and 7B show training error of each epoch of NFM-0 with and without batch normalisation on the Bi-lnteraction layer for Frappe and MovieLens datasets respectively. The dropout is enabled with a ratio of 0.3, and the learning rate is set to 0.02. Focusing on the training error, we can see that batch normalization (BN) leads to a faster convergence; on Frappe, when BN is applied, the training error of epoch 20 is even lower than that of epoch 60 without BN; and the validation error indicates that the lower training error is not overfitting. It has been shown [K. He, X. Zhang, S. Ren, and J. Sun, 'Deep residual learning for image recognition', In CVPR, pages 770-778; and S. loffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift', In ICML, pages 448-456, 2015] that by addressing the internal covariate shift with BN, the model's generalization ability can be improved. Our result also verifies this point, where using BN leads to slight improvement (although the improvement is not statistically significant). Furthermore, we notice that BN makes the learning less stable, as evidenced by the larger performance fluctuation of the Oropout+BN' lines. This is caused by our use of dropout and BN together, as randomly dropping neurons can change the input distribution normalized by BN.

5.3 Impact of Hidden Layers

The hidden layers of NFM play a pivotal role in capturing higher order interactions between features. To explore the impact, we first add one hidden layer above the Bi- Interaction layer and slightly overuse the name NFM to indicate this specific model. To ensure the same model capability with NFM-0, we set the size of the hidden layer the same as the embedding size.

Figures 8A and 8B show the validation error of NFM with respect to different activation functions and dropout ratios for the hidden layer for Frappe and MovieLens datasets respectively. The performance of LibFM and NFM-0 are also shown for benchmarking purposes. First and foremost, we observe that by using nonlinear activations, NFM's performance is improved with a large margin — compared to NFM-0 which has a similar performance with LibFM, the relative improvement is 1 1.3% and 5.2% for Frappe and MovieLens, respectively. This highlights the importance of modelling higher-order feature interactions for quality prediction. Among the different non-linear activation functions, there is no obvious winner. Second, when we use the identity function as the activation function, i.e., the hidden layer performs a linear transformation, NFM does not perform that well. This provides evidence to the necessity of learning higher-order feature interactions with non-linear functions. Separately, we use the same setting for all layers and tune them the same way as NFM-1.

Table 2: NFM with respect to different number of hidden layers

As can be seen from Table 2, when we stack more layers, the performance is not further improved, and best performance is when we use one hidden layer only. We have also explored other designs for hidden layers, such as the tower structure and residual units, however, the performance is still not improved. We think the reason is because the Bi-lnteraction layer has encoded informative second-order feature interactions, and based on which, a simple non-linear function is sufficient to capture higher order interactions. To verify this, we replaced the Bi-lnteraction layer with concatenation (which leads to the same architecture as Wide&Deep), and found that the performance can be gradually improved with more hidden layers (up to three); however, the best performance achievable is still inferior to that of NFM-1 . This demonstrates the value of using a more informative operation for low-level layers, which can ease the burden of higher-level layers for learning meaningful information. As a result, a deep structure becomes not necessarily required. To see whether a deeper NFM can further improve the performance, we stack more layers above the Bi-lnteraction layer. The activation function is ReLU, which has been shown to have good performance for deep networks. As it is computationally expensive to tune the size and dropout ratio for each hidden layer same way as NFM-1 . As can be seen from Table 2, when we stack more layers, the performance is not further improved, and best performance is when we use one hidden layer only. We have also explored other designs for hidden layers, such as the tower structure and residual units, however, the performance is still not improved. We think the reason is because the Bi-lnteraction layer has encoded informative second-order feature interactions, and based on which, a simple non-linear function is sufficient to capture higher order interactions. To verify this, we replaced the Bi-lnteraction layer with concatenation (which leads to the same architecture as Wide&Deep), and found that the performance can be gradually improved with more hidden layers (up to three); however, the best performance achievable is still inferior to that of NFM-1 . This demonstrates the value of using a more informative operation for low-level layers, which can ease the burden of higher-level layers for learning meaningful information. As a result, a deep structure becomes not necessarily required.

It is known that parameter initialization can greatly affect the convergence and performance of deep neutral networks (DNNs), since gradient-based methods can only find local optima for DNNs. As have been shown in Section above, initializing with feature embeddings learned by FM can significantly enhance Wide&Deep and DeepCross.

Figures 9A and 9B show the state of each epoch of NFM-1 with and without pre- training for Frappe and MovieLens datasets respectively. First, we can see that by using FM embeddings as pre-training, NFM exhibits extremely fast convergence— on both datasets, with 5 epochs only, the performance is on par with 40 epochs of NFM that is trained from scratch (with BN enabled). Second, we find that pre-training does not improve NFM's final performance, and a random initialization can achieve a result that is slightly better than that with pre-training. This demonstrates the robustness of NFM, which is relatively insensitive to parameter initialization. In contrast to the huge impact of pre-training on Wide&Deep and DeepCross (cf. Figures 5A and 5B) that improves both their convergence and final performance, we draw the conclusion that NFM is much easier to train and optimize, which is due largely to the informative and effective Bi-lnteraction pooling operation.

5.4 Performance Comparison We now compare with state-of-the-art methods. For NFM, we use one hidden layer with rectified linear unit (ReLU) as the activation function, since the baselines DeepCross and Wide&Deep also choose ReLU in their original papers. Note that the most important hyper-parameter for NFM is the dropout ratio, which we use 0.5 for the Bi-lnteraction layer and tune the value for the hidden layer.

Figures 10A and 10B are graphs showing the test RMSE with respect to different number of latent factors (i.e., embedding sizes), where Wide&Deep and DeepCross are pre-trained with FM to better explore the two methods for Frappe and MovieLens datasets respectively.

Table 3 shows the concrete scores obtained on factors 128 and 256, and the number of model parameters of each method. The scores of Wide&Deep and DeepCross without pre-training are also shown.

Table 3: Test error and number of trainable parameters for di.erent methods on latent factors 128 and 256. M denotes "million"; * and ** denote the statistical significance for p < 0.05 and p < 0.01, respectively.

We have the following three key observations.

First and foremost, NFM consistently achieves the best performance on both datasets with the fewest model parameters besides FM. This demonstrates the effectiveness and rationality of NFM in modelling higher-order and non-linear feature interactions for prediction with sparse data. The performance is followed by Wide&Deep, which uses a 3-layer MLP to learn feature interactions. We have also tried deeper layers for Wide&Deep, however the performance has not been improved. This further verifies the utility of using the informative Bi-lnteraction pooling in the low level. Second, we observe that HOFM shows slight improvement over FM with 1 .45% and 1 .04% average improvement on Frappe and MovieLens, respectively. This sheds light on the limitation of FM that models only the second-order feature interactions, and thus the usefulness of modelling higher-order interactions. Meanwhile, the large performance gap between HOFM and NFM reflects the value of modelling higher- order interactions in a non-linear way, since HOFM models higher-order interactions linearly and uses many more parameters than NFM.

Lastly, the relatively weak performance of DeepCross reveals that deeper learnings are not always be.er, as DeepCross is the deepest method among all baselines that utilizes a 10-layer network. On Frappe, DeepCross only achieves a comparable performance with the shallow FM model, while it underperforms FM significantly on MovieLens. We believe that the reasons are due to optimization difficulties and overfitting (as evidenced by the worse performance on factors 128 and 256). To conclude the performance study, we summarize the key advantages of our NFM over the baseline methods:

- NFM is more expressive than FM by modelling higher-order and non-linear feature interactions with only k² more parameters. - Compared to HOFM, NFM models higher-order interactions in a non-linear way with much fewer parameters.

- Compared to existing deep learning methods Wide&Deep and DeepCross, NFM captures more informative interactions at the lower level and does not require a deep structure to predict well.

6. Conclusions In this disclosure, we proposed a novel neural network model NFM, which brings together the effectiveness of linear factorization machines with the strong representation ability of non-linear neural networks for sparse data prediction. The key of NFM's architecture is the newly proposed Bi-lnteraction operation, based on which we allow a neural network model to learn more informative feature interactions at the lower level. Extensive experiments on two real-world datasets show that with one hidden layer only, NFM significantly outperforms FM, higher-order FM, and the state-of-the-art deep learning approaches Wide&Deep and DeepCross. The work bridges the gap between linear models and deep learning. Linear models, such as various factorization methods, have shown to be effective information retrieval (IR) and data mining (DM) tasks and are easy to interpret. However, their limited expressiveness may hinder the performance when modeling real-world data with complex inherent patterns. While deep learning models have exhibited great expressive power and yielded immense success on speech processing and computer vision, their performance is still unsatisfactory for IR tasks, such as collaborative filtering. In our view, one reason is that most data of IR and DM tasks are naturally sparse; and to date, there still lacks effective deep learning solutions for prediction with sparse data. By connecting neural networks with FM— one of the most powerful linear models for supervised learning— we are able to design a simple yet effective deep learning solution for sparse data prediction.

Whilst the foregoing description has described exemplary embodiments, it will be understood by those skilled in the art that many variations of the embodiments can be made within the scope and spirit of the present invention.

Claims

1 . A predictive analytical method comprising

receiving a set of predictor variables as an input feature vector comprising a plurality of features;

projecting each feature of the feature vector onto a dense vector

representation to obtain a set of embedding vectors representing the input feature vector in an embedding space;

converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space;

inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and

transforming an output vector of the hidden layer stack into a prediction score.

2. A method according to claim 1 , wherein at least one layer of the hidden layer stack has a non-linear activation function.

3. A method according to claim 2, wherein the non-linear activation function is a sigmond, hyperbolic tangent, or rectifier function.

4. A method according to any preceding claim wherein converting the set of embedding vectors comprises performing a pooling operation on the set of embedding vectors.

5. A method according to any preceding claim wherein converting the set of embedding vectors into the bi-interaction pooling vector comprises calculating an element-wise product of embedding vectors of the set of embedding vectors.

6. A method according to any preceding claim wherein the input feature vector is sparse vector.

7. A method according to claim 6 wherein the input feature vector is one-hot encoded.

8. A ranking method comprising the predictive analytical method of any preceding claim.

9. A supervised machine learning method comprising:

receiving training data comprising a plurality of sets of predictor variables and target values;

for each set of predictor variables:

projecting each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space;

converting the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space by performing a pooling operation on the set of embedding vectors

inputting the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and transforming an output vector of the hidden layer stack into a prediction score; and

optimizing parameters of the pooling operation and the hidden layer stack by- optimizing an objective function.

10. A method according to claim 9, wherein optimizing the objective function comprises carrying out stochastic gradient descent.

1 1 . A computer readable medium carrying processor executable instructions which when executed on a processor cause the processor to carry out a method according to any one of claims 1 to 10.

12. A data processing system comprising a processor and a data storage device, the data storage device storing computer executable instructions operable by the processor to:

receive a set of predictor variables as an input feature vector comprising a plurality of features; project each feature of the feature vector onto a dense vector representation to obtain a set of embedding vectors representing the input feature vector in an embedding space;

convert the set of embedding vectors into a bi-interaction pooling vector that encodes second-order interactions between features of the feature vector in the embedding space;

input the bi-interaction pooling vector into a hidden layer stack, the hidden layer stack comprising at least one hidden layer of neural network nodes; and

transform an output vector of the hidden layer stack into a prediction score.

13. A data processing system according to claim 12, wherein at least one layer of the hidden layer stack has a non-linear activation function.

14. A data processing system according to claim 13, wherein the non-linear activation function is a sigmond, hyperbolic tangent, or rectifier function.

15. A data processing system according to any one of claims 12 to 14, wherein the data storage device comprises instructions operable by the processor to convert the set of embedding vectors into a bi-interaction pooling vector by performing a pooling operation on the set of embedding vectors.

16. A data processing system according to any one of claims 12 to 15, wherein the data storage device comprises instructions operable by the processor to convert the set of embedding vectors into the bi-interaction pooling vector by calculating an element-wise product of embedding vectors of the set of embedding vectors,

17. A data processing system according to any one of claims 12 to 16, wherein the input feature vector is a sparse vector and / or is one-hot encoded.

18. A data processing system according to any one of claims 12 to 17, the data storage device further comprising instructions to perform a ranking operation using the prediction score.

19. A data processing system according to any one of claims 12 to 18, the data storage device storing computer executable instructions operable by the processor to perform a machine learning method comprising;

for each set of predictor variables:

optimizing parameters of the pooling operation and the hidden layer stack by optimizing an objective function.

20. A data processing system according to claim 19, wherein optimizing the objective function comprises carrying out stochastic gradient descent.