CN108804577A

CN108804577A - A kind of predictor method of information label interest-degree

Info

Publication number: CN108804577A
Application number: CN201810505164.8A
Authority: CN
Inventors: 常剑; 孙宇; 张洪刚; 徐彬; 高珊
Original assignee: Beijing University of Posts and Telecommunications; China Unicom Online Information Technology Co Ltd
Current assignee: Beijing University of Posts and Telecommunications; China Unicom Online Information Technology Co Ltd
Priority date: 2018-05-24
Filing date: 2018-05-24
Publication date: 2018-11-13
Anticipated expiration: 2038-05-24
Also published as: CN108804577B

Abstract

The invention discloses a method for estimating the degree of interest in information tags, which includes: creating and maintaining a candidate information base containing tags; obtaining user attribute information tag interest degree vectors according to user demographic information; and obtaining information tags of multiple users within a preset time period Preprocess the historical behavior data to obtain a trained deep learning model; obtain the current user's historical behavior data and perform preprocessing to obtain the current user's user behavior information label interest degree vector; according to the current user's user attribute information label interest degree The vector and the user behavior information tag interest degree vector are calculated to obtain the user-information tag interest degree vector, and finally determine several information tags that the user is most interested in. The invention solves the cold start problem of user interest degree estimation, avoids the problem of low information quality that often occurs when directly selecting information from the Internet, and reduces the calculation amount of user interest degree estimation problem, and is applicable to each sample Scenes with multiple tags.

Description

A Method for Predicting the Interest Degree of Information Tags

技术领域technical field

本发明涉及的是一种资讯标签兴趣度的预估方法，属于计算技术领域。The invention relates to a method for estimating interest degree of information tags, which belongs to the technical field of computing.

背景技术Background technique

随着互联网的快速发展，网络上资讯数量庞大且呈爆炸式增长，而资讯的质量却良莠不齐，若直接对获取的所有资讯进行用户兴趣度预估操作，很可能会将内容质量较差的资讯推送给用户，影响用户体验，且对所有资讯进行用户兴趣度预估操作会导致算法计算量增大，对计算资源造成浪费。With the rapid development of the Internet, the amount of information on the Internet is huge and growing explosively, but the quality of the information is uneven. If the user interest degree estimation operation is directly performed on all the obtained information, it is likely that the information with poor content quality Pushing it to users will affect the user experience, and the operation of estimating user interest for all information will increase the calculation amount of the algorithm and cause a waste of computing resources.

虽然用户浏览的资讯不同，但资讯对应的资讯标签往往可划分为几大类别，且用户对某资讯标签的感兴趣持续时间要远超过对某一个资讯的感兴趣持续时间。例如用户浏览了某个资讯标签为“财经”的资讯后基本不会重新阅读同一个资讯，但用户仍然对“财经”标签的其他资讯感兴趣。因此通过对用户对资讯标签的兴趣度预估方法找到用户感兴趣的资讯标签，对资讯个性化推送等研究和应用具有重要意义。Although the information browsed by users is different, the information tags corresponding to the information can often be divided into several major categories, and the user's interest in a certain information tag lasts much longer than that of a certain information. For example, after a user browses an information tagged with "Finance", the user will not read the same information again, but the user is still interested in other information with the "Finance" tag. Therefore, it is of great significance to find the information tags that users are interested in by estimating the user's interest in information tags, which is of great significance to the research and application of personalized information push.

当前实际应用中所使用的用户兴趣度预估方法中普遍存在的问题是冷启动问题，即用户未浏览过资讯时如何对用户进行兴趣度预估。A common problem in current user interest estimation methods used in practical applications is the cold-start problem, that is, how to estimate user interest when the user has not browsed information.

现有技术中，有通过基于递归神经网络的方法进行预估，该方法通过将用户浏览的资讯对应的资讯标签依次输入递归神经网络中来进行训练和预测用户感兴趣的资讯标签。该方法可以利用用户历史行为中的时序特征，因此在训练样本充足和递归神经网络参数调节合适的情况下，效果较好。但是该方法存在以下缺陷：In the prior art, there is a method based on a recurrent neural network for estimation. This method sequentially inputs the information tags corresponding to the information browsed by the user into the recurrent neural network to train and predict the information tags that the user is interested in. This method can take advantage of the time series features in the user's historical behavior, so the effect is better when the training samples are sufficient and the recurrent neural network parameters are adjusted properly. But this method has the following defects:

1在用户未浏览过任何资讯时无法进行用户兴趣度预估。1 It is impossible to estimate user interest when the user has not browsed any information.

2无法利用用户的人口学信息，如性别、年龄、地域等。2 Unable to use the user's demographic information, such as gender, age, region, etc.

现有技术中，有通过基于TF-IDF(词频-逆文本频率指数)的方法得到每个资讯的关键词，通过对当前用户浏览的资讯中的关键词进行统计分析得到用户对各个关键词的兴趣度。In the prior art, there is a method based on TF-IDF (term frequency-inverse text frequency index) to obtain the keywords of each information, and to obtain the user's opinion of each keyword by statistically analyzing the keywords in the information currently browsed by the user. interest.

TF-IDF方法是一种统计方法，对于一个资讯来说，某字词出现在该资讯中的频率体现了该字词的重要程度，某字词在该资讯中出现的次数越多，该字词在该资讯中的重要性越大，但该字词的重要性会随着该字词在全部资讯中出现的频率的升高而下降。即如果某字词在当前资讯中出现的频率高，并且在其他资讯中很少出现，则认为该字词可以很好地代表该资讯，此时该字词即为当前资讯的关键词。通过对用户浏览的资讯中的关键词进行统计分析，可得到用户对各个关键词的兴趣度，可用于后续基于资讯关键词的资讯个性化推送，但是该方法存在以下缺陷：The TF-IDF method is a statistical method. For a piece of information, the frequency of a word appearing in the information reflects the importance of the word. The more times a word appears in the information, the more the word appears. The more important the word is in the news, the importance of the word decreases as the frequency of occurrence of the word in the overall news increases. That is, if a word appears frequently in the current information and rarely appears in other information, it is considered that the word can well represent the information, and at this time the word is the keyword of the current information. By statistically analyzing the keywords in the information browsed by users, the user's interest in each keyword can be obtained, which can be used for subsequent personalized information push based on information keywords, but this method has the following defects:

1对每个资讯中的每个字词统计该字词在当前资讯中出现的次数、在所有资讯中出现的次数，计算量较大。1 Count the number of times the word appears in the current information and the number of times it appears in all the information for each word in each information, which requires a large amount of calculation.

2统计得到的关键词分布太广泛，且每个关键词代表的内容可能针对于很小的领域，不利于控制预估的用户感兴趣的资讯范围。例如使用TF-IDF得到用户浏览的某资讯对应的关键词为“林黛玉”，若根据该关键词对用户进行用户兴趣度预估，则后续给用户的资讯推送中很可能会过于集中于包含“林黛玉”的资讯，而难以很好地扩展到“红楼梦”或“中国文学”等，影响兴趣度预估和资讯推送效果。2 The distribution of keywords obtained through statistics is too wide, and the content represented by each keyword may be specific to a small field, which is not conducive to controlling the estimated range of information that users are interested in. For example, using TF-IDF to obtain the keyword corresponding to a certain information browsed by the user is "Lin Daiyu". If the user's interest degree is estimated based on this keyword, the subsequent information push to the user is likely to be too focused on including " "Lin Daiyu" information, but it is difficult to well expand to "Dream of Red Mansions" or "Chinese Literature", which affects the interest rate estimation and information push effect.

3即使某资讯包含的关键词在用户兴趣度预估时得分很高，也可能因为资讯质量问题而不能引起用户兴趣。3 Even if the keywords contained in a piece of information have a high score in user interest estimation, they may not be able to arouse user interest due to information quality problems.

4在用户未浏览过任何资讯时无法进行用户兴趣度预估。4. When the user has not browsed any information, it is impossible to estimate the user interest.

5用户浏览过所有资讯在进行用户兴趣度预估时具有同等地位，不能体现出不同浏览先后顺序的资讯对当前时刻进行用户兴趣度预估时的差异。而往往较近浏览的资讯对当前的兴趣度预估的影响较大。5. All the information that a user has browsed has the same status in estimating user interest, and it cannot reflect the difference in estimating user interest at the current moment for information in different browsing sequences. However, information that has been browsed recently has a greater impact on the current interest degree estimation.

6无法利用用户的人口学信息，如性别、年龄、地域等。6 Unable to use the user's demographic information, such as gender, age, region, etc.

现有技术中，有通过基于梯度提升决策树方法的用户兴趣度预估方法进行预估，梯度提升决策树(GBDT)是一种通过迭代多棵回归树来共同决策的机器学习方法。梯度提升决策树由多棵回归树组成，每一棵回归树通过学习之前所有回归树的结果和残差，拟合得到当前回归树。此处残差指的是真实值与预测值相减的值。所有回归树的结果累加起来作为梯度提升决策树的最终结果。该方法可以同时利用用户的人口学信息和用户浏览的资讯对应的资讯标签信息。但是该方法存在以下缺陷：In the prior art, there is a method for estimating user interest based on a gradient boosting decision tree method. The gradient boosting decision tree (GBDT) is a machine learning method that makes joint decisions by iterating multiple regression trees. The gradient boosting decision tree is composed of multiple regression trees, and each regression tree is fitted to obtain the current regression tree by learning the results and residuals of all previous regression trees. Residual here refers to the actual value minus the predicted value. The results of all regression trees are added together as the final result of the gradient boosted decision tree. The method can simultaneously utilize the demographic information of the user and the information label information corresponding to the information browsed by the user. But this method has the following defects:

1梯度提升决策树本质上适用于回归问题，或通过设置阈值而完成二分类问题。对于用户对资讯标签兴趣度预估问题来说，资讯标签库中含有的标签数量较多，且每个资讯所含的标签往往不止一个，梯度提升决策树每次计算只能得到用户对某一个标签的兴趣度预估值，如果想要获取用户对各资讯标签的兴趣度，需对每个资讯标签分别采用梯度提升决策树方法来进行兴趣度预估，计算量是解决二分类问题时的梯度提升决策树计算量的m倍(m为资讯标签库中的标签总数)，计算量较大。1 The gradient boosting decision tree is essentially suitable for regression problems, or completes binary classification problems by setting thresholds. For the problem of estimating the user's interest in information tags, the information tag library contains a large number of tags, and each information contains more than one tag. The gradient boosting decision tree can only get the user's interest in a certain The estimated value of the interest degree of the label. If you want to obtain the user's interest degree for each information label, you need to use the gradient boosting decision tree method for each information label to estimate the interest degree. The calculation amount is when solving the binary classification problem. The calculation amount of the gradient boosting decision tree is m times (m is the total number of labels in the information label library), and the calculation amount is relatively large.

2用户浏览过所有资讯在进行用户兴趣度预估时具有同等地位，不能体现出不同浏览先后顺序的资讯对当前时刻进行用户兴趣度预估时的差异。而往往较近浏览的资讯对当前的兴趣度预估的影响较大。2. All the information that the user has browsed has the same status in estimating the user interest degree, which cannot reflect the difference in estimating the user interest degree at the current moment for the information in different browsing order. However, information that has been browsed recently has a greater impact on the current interest degree estimation.

发明内容Contents of the invention

针对上述缺陷，本发明提供了一种资讯标签兴趣度的预估方法，通过建立用户属性-资讯标签兴趣度向量，解决用户兴趣度预估的冷启动问题，建立含标签的候选资讯库，避免了直接从互联网中选取资讯时常会出现的资讯质量低的问题，降低了对用户兴趣度预估问题的计算量，适用于每个样本含有多个标签的场景。In view of the above-mentioned defects, the present invention provides a method for estimating the interest degree of information tags. By establishing a user attribute-information tag interest degree vector, the cold start problem of user interest degree estimation is solved, and a candidate information base containing tags is established to avoid It solves the problem of low information quality that often occurs when directly selecting information from the Internet, reduces the amount of calculations for user interest estimation, and is suitable for scenarios where each sample contains multiple tags.

为达到上述目的，本发明通过以下技术方案来具体实现：In order to achieve the above object, the present invention is specifically realized through the following technical solutions:

本发明提供了一种资讯标签兴趣度的预估方法，该方法包括：The present invention provides a method for estimating interest in information tags, the method comprising:

创建并维护含标签的候选资讯库；Create and maintain a tagged candidate repository;

根据用户人口学信息得到用户属性-资讯标签兴趣度向量；According to the demographic information of the user, the user attribute-information tag interest degree vector is obtained;

获取预设时间段内多个用户的历史行为数据并进行预处理，输入深度学习模型进行训练得到已训练的深度学习模型；Obtain and preprocess the historical behavior data of multiple users within a preset time period, input the deep learning model for training to obtain the trained deep learning model;

获取当前用户的历史行为数据并进行预处理，使用已训练的深度学习模型计算得到当前用户的用户行为-资讯标签兴趣度向量；Obtain and preprocess the historical behavior data of the current user, and use the trained deep learning model to calculate the current user's user behavior-information tag interest degree vector;

根据当前用户的用户属性-资讯标签兴趣度向量和用户行为-资讯标签兴趣度向量计算得到用户-资讯标签兴趣度向量，最终确定用户最感兴趣的若干个资讯标签。Calculate the user-information tag interest degree vector according to the current user's user attribute-information tag interest degree vector and user behavior-information tag interest degree vector, and finally determine several information tags that the user is most interested in.

进一步的，所述创建并维护含标签的候选资讯库的步骤包括：Further, the steps of creating and maintaining a tagged candidate information base include:

从预设的资讯标签库中选择与资讯内容最匹配的一个或多个标签作为该资讯的标签，将添加标签后的资讯加入含标签的候选资讯库中；对候选资讯库中的每个资讯，根据资讯对应的资讯标签将每个资讯用一个m维的资讯向量表示，m为预设的资讯标签库中的标签总数；当该资讯含有标签T_j时，m维资讯向量的第j维取值为1，否则第j维取值为0；Select one or more tags that best match the content of the information from the preset information tag library as the tag of the information, and add the tagged information to the candidate information library containing tags; for each information in the candidate information library , according to the information tag corresponding to the information, each information is represented by an m-dimensional information vector, m is the total number of tags in the default information tag library; when the information contains a tag T _j , the j-th dimension of the m-dimensional information vector The value is 1, otherwise the value of the jth dimension is 0;

定期对含标签的候选资讯库进行维护，添加新的资讯，移除失去时效性的资讯。Regularly maintain the tagged candidate information base, add new information, and remove out-of-date information.

进一步的，所述用户人口学信息包括但不限于：可获取的性别、年龄和/或地域信息中的一种或多种对用户划分出若干群体的信息。Further, the user demographic information includes, but is not limited to: one or more of the obtainable gender, age and/or geographical information to classify users into several groups.

进一步的，所述根据用户人口学信息得到用户属性-资讯标签兴趣度向量包括：Further, said obtaining the user attribute-information label interest degree vector according to the user demographic information includes:

第i个群体G_i对第j个资讯标签T_j的用户属性-资讯标签兴趣度向量H_ij为：The user attribute-information tag interest degree vector H _ij of the i-th group G _i to the j-th information tag T _j is:

H_ij的值介于[0，1]之间。The value of H _ij is between [0, 1].

进一步的，所述获取预设时间段内多个用户的历史行为数据并进行预处理，输入深度学习模型进行训练得到已训练的深度学习模型，包括：Further, the acquisition of historical behavior data of multiple users within a preset time period and preprocessing, inputting the deep learning model for training to obtain a trained deep learning model, including:

获取预设时间段内多个用户的历史行为数据中浏览的每个资讯对应的资讯向量，并将资讯向量按浏览资讯的时间先后顺序输入递归神经网络模型中进行递归神经网络模型的训练，得到已训练的深度学习模型。Obtain the information vector corresponding to each information browsed in the historical behavior data of multiple users within the preset time period, and input the information vector into the recurrent neural network model in the order of browsing information time sequence to train the recurrent neural network model, and obtain A trained deep learning model.

进一步的，所述获取当前用户的历史行为数据并进行预处理，使用已训练的深度学习模型计算得到当前用户的用户行为-资讯标签兴趣度向量，包括：Further, the acquisition of the historical behavior data of the current user and preprocessing, using the trained deep learning model to calculate the current user's user behavior-information label interest degree vector, including:

获取当前用户的历史行为数据中浏览的每个资讯对应的资讯向量，按照时间先后顺序排列；Obtain the information vector corresponding to each information browsed in the current user's historical behavior data, and arrange them in chronological order;

将当前用户的历史行为数据中的每个资讯对应的资讯向量按时间先后顺序依次输入到已训练的深度学习模型中，当历史行为数据中的每个资讯对应的资讯向量全部输入完毕后，此时已训练的深度学习模型得到的m维预测向量即为当前用户的用户行为-资讯标签兴趣度向量。Input the information vectors corresponding to each information in the current user's historical behavior data into the trained deep learning model in chronological order. After all the information vectors corresponding to each information in the historical behavior data are input, the The m-dimensional prediction vector obtained by the trained deep learning model is the current user's user behavior-information tag interest degree vector.

进一步的，所述根据当前用户的用户属性-资讯标签兴趣度向量和用户行为-资讯标签兴趣度向量计算得到用户-资讯标签兴趣度向量，最终确定用户最感兴趣的若干个资讯标签，包括：Further, the user-information tag interest degree vector is calculated according to the current user's user attribute-information tag interest degree vector and user behavior-information tag interest degree vector, and finally determines several information tags that the user is most interested in, including:

根据已计算出的当前用户的用户属性-资讯标签兴趣度向量和用户行为-资讯标签兴趣度向量，可计算出当前用户对资讯标签的m维兴趣度向量，计算公式如下：According to the calculated user attribute-information tag interest degree vector and user behavior-information tag interest degree vector of the current user, the m-dimensional interest degree vector of the current user to the information tag can be calculated, and the calculation formula is as follows:

V(用户，资讯标签)＝(1-w)*V(用户属性，资讯标签)+w*V(用户行为，资讯标签)V (user, information label) = (1-w) * V (user attribute, information label) + w*V (user behavior, information label)

其中，V(用户，资讯标签)为当前用户对资讯标签的m维兴趣度向量；V(用户行为，资讯标签)为用户行为-资讯标签兴趣度向量；V(用户属性，资讯标签)为当前用户的用户属性-资讯标签兴趣度向量；w表示V(用户行为，资讯标签)在计算当前用户对资讯标签的兴趣度向量中所占的权重，w取值应满足始终在[0,1]范围内。Among them, V (user, information label) is the m-dimensional interest degree vector of the current user to the information label; V (user behavior, information label) is the user behavior-information label interest degree vector; V (user attribute, information label) is the current The user's user attribute - information label interest degree vector; w represents the weight of V (user behavior, information label) in calculating the current user's interest degree vector for information label, and the value of w should always be in [0,1] within range.

进一步的，所述w的计算公式如下：Further, the calculation formula of w is as follows:

w＝tanh(a*当前用户在预设时间段T内浏览的资讯数量)w=tanh(a*number of information browsed by the current user within the preset time period T)

其中，tanh为双曲正切函数，a为大于0的常数。Among them, tanh is the hyperbolic tangent function, and a is a constant greater than 0.

本发明提供的一种资讯标签兴趣度的预估方法，创新地提出了将用户属性-资讯标签兴趣度向量，用户行为-资讯标签兴趣度向量结合得到用户-资讯标签兴趣度向量的方法，使得在用户未浏览过资讯时，利用用户人口学信息找到其所在群体对资讯标签的兴趣度来避免冷启动问题，而当用户浏览的资讯逐渐增多时，用户行为-资讯标签兴趣度向量所占权重逐渐增大，由于用户行为-资讯标签兴趣度向量是通过基于递归神经网络的深度学习模型计算得到，利用了用户历史行为数据中的时序特征，故当训练样本充足和递归神经网络参数调节合适的情况下，效果优于常用的非深度学习模型。A method for estimating the interest degree of information tags provided by the present invention innovatively proposes a method of combining user attributes-information tag interest degree vectors and user behavior-information tag interest degree vectors to obtain user-information tag interest degree vectors, so that When the user has not browsed the information, use the demographic information of the user to find the interest of the information label in the group to avoid the cold start problem, and when the information that the user browses gradually increases, the weight of the user behavior-information label interest degree vector Gradually increasing, because the user behavior-information tag interest degree vector is calculated by the deep learning model based on the recurrent neural network, using the time series features in the user's historical behavior data, so when the training samples are sufficient and the recurrent neural network parameters are adjusted appropriately In some cases, the effect is better than that of commonly used non-deep learning models.

用户属性-资讯标签兴趣度向量可以弥补用户未浏览过资讯时的冷启动问题，单纯采用基于递归神经网络的深度学习模型来进行资讯标签兴趣度预估时无法在用户未浏览过资讯时预估兴趣度。User attribute-information tag interest degree vector can make up for the cold start problem when the user has not browsed the information. Simply using the deep learning model based on the recurrent neural network to estimate the interest degree of the information tag cannot be estimated when the user has not browsed the information. interest.

本发明建立并维护含标签的候选资讯库，避免了直接从互联网中选取资讯时常会出现的资讯质量低的问题，且由于对资讯进行了筛选，降低了对用户兴趣度预估问题的计算量。对筛选的资讯添加标签后可用于后续的用户对资讯标签兴趣度的预估操作。The present invention establishes and maintains a tagged candidate information base, avoids the problem of low information quality that often occurs when information is directly selected from the Internet, and reduces the amount of calculation for the user interest degree estimation problem due to the screening of information . Adding tags to the filtered information can be used for subsequent operations of estimating the user's interest in the information tags.

本发明充分利用了用户的人口学信息和用户的历史行为数据及历史行为数据中资讯的时序信息。通过建立用户属性-资讯标签兴趣度向量，可以解决用户兴趣度预估的冷启动问题。通过建立用户行为-资讯标签兴趣度向量，利用了深度学习模型的优势。The present invention makes full use of the demographic information of the user, the historical behavior data of the user and the time sequence information of the information in the historical behavior data. By establishing a user attribute-information tag interest degree vector, the cold start problem of user interest degree estimation can be solved. By establishing a user behavior-information tag interest degree vector, the advantages of the deep learning model are used.

建立含标签的候选资讯库，避免了直接从互联网中选取资讯时常会出现的资讯质量低的问题，且由于对资讯进行了筛选，降低了对用户兴趣度预估问题的计算量。Establishing a candidate information base with tags avoids the problem of low information quality that often occurs when directly selecting information from the Internet, and reduces the amount of calculation for user interest estimation due to the screening of information.

候选资讯库中的资讯含有一个或多个资讯标签，很多其他的兴趣度预估方法只支持每个样本含有一个资讯标签，而本方法中用户行为-资讯标签兴趣度向量计算采用的是基于递归神经网络的深度学习模型计算得到，深度学习模型的输入和输出可直接采用含多标签信息的向量，适用于每个样本含有多个标签的场景。The information in the candidate information base contains one or more information tags. Many other interest estimation methods only support one information tag per sample. In this method, the user behavior-information tag interest degree vector calculation is based on recursion The deep learning model of the neural network is calculated, and the input and output of the deep learning model can directly use vectors containing multi-label information, which is suitable for scenarios where each sample contains multiple labels.

计算用户-资讯标签兴趣度向量时采用的tanh双曲正切函数，可以将[0，+∞)的数转化为[0,1)之间的数，恰好满足本方法将用户属性-资讯标签兴趣度向量和用户行为-资讯标签兴趣度向量相结合，且使计算得到的用户-资讯标签兴趣度向量的每一维度的预估值都在[0,1]之间以反映出用户兴趣度的需求。The tanh hyperbolic tangent function used when calculating the user-information tag interest degree vector can convert the number [0, +∞) into a number between [0, 1), which just meets the user attribute-information tag interest in this method. The degree vector is combined with the user behavior-information tag interest degree vector, and the estimated value of each dimension of the calculated user-information tag interest degree vector is between [0,1] to reflect the user interest degree need.

附图说明Description of drawings

图1所示为本发明提供的一种资讯标签兴趣度的预估方法的实施例一流程图。FIG. 1 is a flow chart of Embodiment 1 of a method for estimating the interestingness of information tags provided by the present invention.

具体实施方式Detailed ways

下面对本发明的技术方案进行具体阐述，需要指出的是，本发明的技术方案不限于实施例所述的实施方式，本领域的技术人员参考和借鉴本发明技术方案的内容，在本发明的基础上进行的改进和设计，应属于本发明的保护范围。The technical solution of the present invention is described in detail below, it should be pointed out that the technical solution of the present invention is not limited to the implementation manner described in the examples, those skilled in the art refer to and learn from the content of the technical solution of the present invention, on the basis of the present invention The improvement and design carried out above shall belong to the protection scope of the present invention.

实施例一Embodiment one

本发明实施例一提供了一种资讯标签兴趣度的预估方法，该方法包括步骤S110-S150：Embodiment 1 of the present invention provides a method for estimating interest in information tags, the method includes steps S110-S150:

步骤S110、创建并维护含标签的候选资讯库。Step S110, creating and maintaining a tagged candidate information base.

为保证推送给用户的资讯的质量，同时使算法计算量处于合理范围内，需创建含标签的候选资讯库，对网络上的资讯进行筛选，选择高质量的资讯并根据资讯内容从预设的资讯标签库中选择与资讯内容最匹配的一个或多个标签作为该资讯的标签，将添加标签后的资讯加入含标签的候选资讯库中，用于后续操作。In order to ensure the quality of the information pushed to users and keep the amount of algorithm calculation within a reasonable range, it is necessary to create a candidate information library with tags, filter the information on the network, select high-quality information and select from the preset information according to the content of the information. One or more tags that best match the content of the news are selected from the information tag library as the tags of the information, and the tagged information is added to the tagged candidate information library for subsequent operations.

预设的资讯标签库中的标签编号为T₁，T₂，…，T_m，m为预设的资讯标签库中的标签总数。供参考地，在具体实施时预设的资讯标签库可以为[财经，体育，军事，娱乐，生活，教育，健康，科技，文化，旅行，其他]，或根据实际情况进行设置，各资讯标签之间需满足相对独立性，且资讯标签划分不宜过细，以避免因每个资讯标签对应的资讯过少而影响后续操作结果的准确性。The tag numbers in the default information tag library are T ₁ , T ₂ , . . . , T _m , where m is the total number of tags in the default information tag library. For reference, the preset information tag library during specific implementation can be [finance, sports, military, entertainment, life, education, health, technology, culture, travel, others], or set according to the actual situation, each information tag They need to be relatively independent, and the division of information labels should not be too detailed, so as to avoid affecting the accuracy of subsequent operation results due to too little information corresponding to each information label.

对候选资讯库中的每个资讯，根据资讯对应的资讯标签将每个资讯用一个m维的资讯向量表示，m为预设的资讯标签库中的标签总数。当该资讯含有标签T_j时，m维资讯向量的第j维取值为1，否则第j维取值为0。例如某资讯含有标签T₁，T₂，T₅,则该资讯对应的m维资讯向量为[1，1，0，0，1，0，0，…，0，0]。For each information in the candidate information base, each information is represented by an m-dimensional information vector according to the information tag corresponding to the information, where m is the total number of tags in the preset information tag library. When the information contains a tag T _j , the value of the jth dimension of the m-dimensional information vector is 1, otherwise the value of the jth dimension is 0. For example, some information contains tags T ₁ , T ₂ , and T ₅ , and the m-dimensional information vector corresponding to the information is [1, 1, 0, 0, 1, 0, 0, . . . , 0, 0].

由于资讯具有一定的时效性，因此需定期对含标签的候选资讯库进行维护，添加新的资讯，移除失去时效性的资讯。Since the information has certain timeliness, it is necessary to regularly maintain the tagged candidate information base, add new information, and remove out-of-timeliness information.

步骤S120、根据用户人口学信息得到用户属性-资讯标签兴趣度向量。Step S120. Obtain the user attribute-information tag interest degree vector according to the demographic information of the user.

用户人口学信息包括性别、年龄、地域及其他可获取的信息，以此对用户划分出若干群体。为避免因群体划分过细导致每个群体的样本数量不足而产生较大误差，对于用户年龄可以划分为几个梯度，例如：20岁及以下，21-30岁，31-40岁，41-50岁，51-60岁，60岁及以上。对用户的地域信息，当样本数充足时可依据省份划分，样本数较少时可以将若干省数据进行合并，例如将黑龙江，吉林，辽宁合并为“东北地区”。User demographic information includes gender, age, region and other available information, so as to classify users into several groups. In order to avoid large errors due to the insufficient number of samples in each group due to too fine division of groups, the age of users can be divided into several gradients, for example: 20 years old and below, 21-30 years old, 31-40 years old, 41-50 years old Age, 51-60 years, 60 years and above. For the user's geographical information, when the number of samples is sufficient, it can be divided according to provinces. When the number of samples is small, the data of several provinces can be merged. For example, Heilongjiang, Jilin, and Liaoning can be merged into "Northeast Region".

根据用户人口学信息将用户划分成若干群体，例如[男，31-40岁，北京]为一个群体，[女，21-30岁，上海]为一个群体，并对群体进行编号G₁，G₂，…，G_n。第i个群体G_i对第j个资讯标签T_j的兴趣度H_ij为：According to user demographic information, users are divided into several groups, for example [male, 31-40 years old, Beijing] is a group, [female, 21-30 years old, Shanghai] is a group, and the groups are numbered G ₁ , G ₂ , . . . , G _n . The interest degree H _ij of the i-th group G _i to the j-th information tag T _j is:

H_ij的值介于[0，1]之间，且H_ij取值越大说明第i个群体G_i对第j个资讯标签T_j的兴趣度越大。对每个群体，可得到一个m维向量，称为该群体的用户属性-资讯标签兴趣度向量,用V(用户属性，资讯标签)表示。用户属性-资讯标签兴趣度向量V(用户属性，资讯标签)中第j维的取值即为该群体对第j个资讯标签T_j的兴趣度。举例来说，第i个群体的用户属性-资讯标签兴趣度向量为[H_i1，H_i2，…，H_im]，该群体内所有用户共享同一个用户属性-资讯标签兴趣度向量。The value of H _ij is between [0, 1], and the larger the value of H _ij is, the greater the interest of the i-th group G _i to the j-th information tag T _j is. For each group, an m-dimensional vector can be obtained, which is called the user attribute-information tag interest degree vector of the group, represented by V (user attribute, information tag). The value of the jth dimension in the user attribute-information tag interest degree vector V (user attribute, information tag) is the group's interest degree to the jth information tag T _j . For example, the user attribute-information tag interest degree vector of the i-th group is [H _i1 , H _i2 , . . . , H _im ], and all users in this group share the same user attribute-information tag interest degree vector.

本方法中利用用户人口学信息进行群体划分，所使用的人口学信息包括但不限于用户的性别、年龄、地域等信息，且划分群体的方法可根据具体情况而定，本方法中提供的划分样例仅供参考。In this method, user demographic information is used for group division. The demographic information used includes but is not limited to the user's gender, age, region, etc., and the method of group division can be determined according to specific circumstances. The division provided in this method The samples are for reference only.

步骤S130、获取预设时间段内多个用户的历史行为数据并进行预处理，输入深度学习模型进行训练得到已训练的深度学习模型。Step S130, acquiring and preprocessing the historical behavior data of multiple users within a preset time period, inputting the deep learning model for training to obtain a trained deep learning model.

获取预设时间段内多个用户的历史行为数据并进行预处理。预设时间段T及提取的用户数量可根据实际应用情况进行设定，例如设置时间段T为三个月，提取用户数量为N_user，则从当前所有数据中提取三个月内的随机选择的N_user个用户的历史行为数据。为保护用户隐私和便于数据处理，将N_user个用户编号为1，2，3，…，N_user。对每个用户，获取到其历史行为数据中浏览的每个资讯对应的资讯向量，按照时间先后顺序排列，先浏览的资讯对应的资讯向量排在后浏览的资讯对应的资讯向量的前面。Obtain and preprocess the historical behavior data of multiple users within a preset time period. The preset time period T and the number of extracted users can be set according to the actual application situation. For example, if the time period T is set to three months and the number of extracted users is N _user , the random selection within three months will be extracted from all current data Historical behavior data of N _users . To protect user privacy and facilitate data processing, the N _user users are numbered as 1, 2, 3, . . . , N _user . For each user, obtain the information vectors corresponding to each information browsed in the historical behavior data, and arrange them in chronological order, and the information vectors corresponding to the information browsed first are arranged in front of the information vectors corresponding to the information browsed later.

本方法采用的深度学习模型是递归神经网络模型，该递归神经网络模型可以为RNN模型及其改进模型，如LSTM等。对于深度学习模型中的各参数先随机初始化，随后根据预处理后的历史行为数据中的用户编号，对于每个用户，按该用户浏览资讯的先后顺序将浏览的资讯对应的m维资讯向量输入递归神经网络模型中进行递归神经网络模型的训练。The deep learning model adopted in this method is a recursive neural network model, and the recurrent neural network model can be an RNN model and its improved models, such as LSTM. For each parameter in the deep learning model, first randomly initialize, and then according to the user number in the preprocessed historical behavior data, for each user, input the m-dimensional information vector corresponding to the browsed information according to the order in which the user browsed the information The recurrent neural network model is trained in the recurrent neural network model.

对于递归神经网络模型来说，对于每个用户，第k次的输入为预处理后的行为数据中该用户浏览的第k个资讯对应的资讯向量，此时模型得到一个预测的输出向量，将该输出向量与预处理后的行为数据中该用户浏览的第k+1个资讯对应的资讯向量进行对比，计算出递归神经网络的偏差并根据偏差不断修正神经网络模型的参数。当一个用户在预处理后的历史行为数据中全部浏览的资讯对应的资讯向量均依次输入到基于递归神经网络的深度学习模型中进行训练后，将预处理后的历史行为数据中下一个用户浏览的资讯对应的资讯向量按时间顺序依次输入正在训练的递归神经网络模型中，继续进行训练，直至预处理后的行为数据中N_user个用户浏览的资讯对应的资讯向量全部输入深度学习模型训练完毕。此时得到已训练的深度学习模型。由于递归神经网络模型已经在深度学习领域普遍使用，因此对递归神经网络模型的具体搭建本方法不再赘述。For the recurrent neural network model, for each user, the kth input is the information vector corresponding to the kth information browsed by the user in the preprocessed behavior data. At this time, the model obtains a predicted output vector, which is The output vector is compared with the information vector corresponding to the k+1th information browsed by the user in the preprocessed behavior data, the deviation of the recursive neural network is calculated, and the parameters of the neural network model are continuously corrected according to the deviation. When the information vectors corresponding to all the information browsed by a user in the preprocessed historical behavior data are sequentially input into the deep learning model based on the recurrent neural network for training, the next user browsed in the preprocessed historical behavior data The information vectors corresponding to the information are input into the recursive neural network model being trained in chronological order, and the training continues until the information vectors corresponding to the information browsed by N _users in the preprocessed behavior data are all input into the deep learning model for training. . At this point, the trained deep learning model is obtained. Since the recurrent neural network model has been widely used in the field of deep learning, the specific construction of the recurrent neural network model will not be repeated in this method.

已训练的深度学习模型可通过将某个用户预处理后的历史行为数据按浏览资讯的时间先后顺序依次输入模型中来进行用户对资讯标签的兴趣度预估，已训练的深度学习模型的具体输出为一个m维预测向量，m维预测向量中第j维的取值代表已训练的深度学习模型预测的该用户对资讯标签T_j的兴趣度，第j维的取值越高表示用户对资讯标签T_j感兴趣的可能性越大。The trained deep learning model can predict the user's interest in information tags by inputting the preprocessed historical behavior data of a user into the model in order of time when browsing information. The specific details of the trained deep learning model The output is an m-dimensional prediction vector. The value of the j-th dimension in the m-dimensional prediction vector represents the user’s interest in the information tag T _j predicted by the trained deep learning model. The higher the value of the j-th dimension, the higher the user’s interest in The more likely the information tag T _j is interested.

步骤S140、获取当前用户的历史行为数据并进行预处理，使用已训练的深度学习模型计算得到当前用户的用户行为-资讯标签兴趣度向量。Step S140: Acquire and preprocess the historical behavior data of the current user, and use the trained deep learning model to calculate the current user's user behavior-information tag interest degree vector.

使用已训练的深度学习模型对当前用户进行资讯标签兴趣度预估时，获取到其历史行为数据中浏览的每个资讯对应的资讯向量，按照时间先后顺序排列，先浏览的资讯对应的资讯向量排在后浏览的资讯对应的资讯向量的前面。When using the trained deep learning model to estimate the interest degree of information tags for the current user, the information vector corresponding to each information browsed in its historical behavior data is obtained, arranged in chronological order, and the information vector corresponding to the information browsed first Rank in front of the information vector corresponding to the information browsed later.

将当前用户的历史行为数据中的每个资讯对应的资讯向量按时间先后顺序依次输入到已训练的深度学习模型中，当历史行为数据中的每个资讯对应的资讯向量全部输入完毕后，此时已训练的深度学习模型得到的m维预测向量即为当前用户的用户行为-资讯标签兴趣度向量，用V(用户行为，资讯标签)表示。Input the information vectors corresponding to each information in the current user's historical behavior data into the trained deep learning model in chronological order. After all the information vectors corresponding to each information in the historical behavior data are input, the The m-dimensional prediction vector obtained by the deep learning model that has been trained is the current user's user behavior-information tag interest degree vector, represented by V (user behavior, information tag).

本方法中计算用户行为-资讯标签兴趣度向量的深度学习模型包括递归神经网络及其改进后的模型。改进的递归神经网络模型可能是网络模型的神经元数量、网络模型层数、添加门限函数等，若改进结构后的递归神经网络模型不直接影响本方法提出的计算用户行为-资讯标签兴趣度向量的方式，则可视为本方法的实现方法之一。In this method, the deep learning model for calculating the user behavior-information tag interest degree vector includes a recurrent neural network and its improved model. The improved recursive neural network model may be the number of neurons of the network model, the number of layers of the network model, the addition of threshold functions, etc. If the recurrent neural network model after the improved structure does not directly affect the calculation of user behavior-information label interest vector way, it can be regarded as one of the realization methods of this method.

步骤S150、根据当前用户的用户属性-资讯标签兴趣度向量和用户行为-资讯标签兴趣度向量计算得到用户-资讯标签兴趣度向量，最终确定用户最感兴趣的若干个资讯标签。Step S150: Calculate the user-information tag interest degree vector according to the current user's user attribute-information tag interest degree vector and user behavior-information tag interest degree vector, and finally determine several information tags that the user is most interested in.

根据已计算出的当前用户的用户属性-资讯标签兴趣度向量V(用户属性，资讯标签)和用户行为-资讯标签兴趣度向量V(用户行为，资讯标签)，可计算出当前用户对资讯标签的m维兴趣度向量，用V(用户，资讯标签)表示。计算公式如下：According to the calculated user attributes of the current user-information tag interest degree vector V (user attribute, information tag) and user behavior-information tag interest degree vector V (user behavior, information tag), the current user's interest in information tags can be calculated The m-dimensional interest degree vector of is denoted by V(user, information tag). Calculated as follows:

上述公式中的w表示V(用户行为，资讯标签)在计算当前用户对资讯标签的兴趣度向量中所占的权重。对于新用户来说，无历史行为数据，因此w需满足在当前用户无历史行为数时w为0，且由于随着用户历史行为数据逐渐增多，V(用户行为，资讯标签)能够更加准确地反映用户兴趣度，因此V(用户行为，资讯标签)所占的权重应随着用户历史行为数据的增加而逐渐增大，且w取值应满足始终在[0,1]范围内。基于上述需求，w的计算公式如下：The w in the above formula represents the weight of V (user behavior, information tag) in calculating the current user's interest degree vector for the information tag. For new users, there is no historical behavior data, so w needs to satisfy that w is 0 when the current user has no historical behavior data, and as the user's historical behavior data gradually increases, V (user behavior, information label) can be more accurately Reflects user interest, so the weight of V (user behavior, information label) should gradually increase with the increase of user historical behavior data, and the value of w should always be in the range of [0,1]. Based on the above requirements, the calculation formula of w is as follows:

其中tanh为双曲正切函数，a为大于0的常数，在当前用户在预设时间段T内浏览的资讯数量不变的情况下，a越大，w越大，a的取值可根据实际应用情况而设置。供参考地，a可设置为0.05。Among them, tanh is the hyperbolic tangent function, a is a constant greater than 0, when the amount of information browsed by the current user within the preset time period T remains unchanged, the larger a is, the larger w is, and the value of a can be determined according to the actual situation. It is set according to the application situation. For reference, a can be set to 0.05.

计算得到的当前用户对资讯标签的兴趣度向量V(用户，资讯标签)中第j维的取值代表本方法计算得到的最终的该用户对标签T_j的兴趣度，第j维的取值越高表示用户对标签T_j感兴趣的可能性越大。计算得到的当前用户对资讯标签的兴趣度向量V(用户，资讯标签)完整地体现了当前用户对资讯标签库中所有标签的兴趣度，在对用户按标签进行资讯推送时，可根据具体需要选择V(用户，资讯标签)中取值最高的若干标签进行资讯推送。The calculated value of the jth dimension in the current user's interest degree vector V (user, information label) of the information tag represents the final user's degree of interest in the tag T _j calculated by this method, and the value of the jth dimension The higher the value, the greater the possibility that the user is interested in the tag T _j . The calculated interest degree vector V(user, information label) of the current user on the information label completely reflects the current user's interest degree on all the labels in the information label library. Select the tags with the highest values in V (user, information tag) to push information.

例如当资讯标签库为[财经，体育，军事，娱乐，生活，教育，健康，科技，文化，旅行，其他]时，若计算得到的当前用户对资讯标签的兴趣度向量V(用户，资讯标签)为[0.42，0.08，0.02，0.20，0.01，0.06，0.33，0.41，0.05，0.19，0.14]，由用户-资讯标签兴趣度向量可知本方法预估的当前用户对各资讯标签的兴趣度由高到低分别为财经、科技、健康、娱乐、旅行、其他、体育、教育、文化、军事、生活。若优先选取当前用户最感兴趣的三个标签的资讯进行个性化推送，则推送的资讯对应的资讯标签为财经、科技、健康。For example, when the information tag library is [financial, sports, military, entertainment, life, education, health, technology, culture, travel, others], if the current user’s interest degree vector V(user, information tag ) is [0.42, 0.08, 0.02, 0.20, 0.01, 0.06, 0.33, 0.41, 0.05, 0.19, 0.14], from the user-information label interest degree vector, we can know that the current user’s interest in each information label estimated by this method is given by High to low are finance, technology, health, entertainment, travel, others, sports, education, culture, military, and life. If the information of the three tags that the current user is most interested in is preferentially selected for personalized push, the corresponding information tags of the pushed information are finance, technology, and health.

H_ij的值介于[0，1]之间。The value of H _ij is between [0, 1].

本发明实施例提出了一种资讯标签兴趣度的预估方法，本方法通过对网络上的资讯进行筛选并对筛选后的资讯添加标签得到含标签的候选资讯库。对需要进行资讯标签兴趣度预估的当前用户，获取当前用户的人口学特征并对数据进行整理和分析，利用当前用户的人口学特征计算得到当前用户的用户属性-资讯标签兴趣度向量。本方法训练了一个基于递归神经网络的深度学习模型，获取当前用户的历史行为数据并对数据进行整理和分析，并将当前用户的历史行为数据输入已训练的深度学习模型中，利用深度学习模型得到当前用户的用户行为-资讯标签兴趣度向量。根据所提出的方法利用当前用户的用户属性-资讯标签兴趣度向量和用户行为-资讯标签兴趣度向量计算得到用户-资讯标签兴趣度向量。计算得到的当前用户的用户-资讯标签兴趣度向量体现了本方法对当前用户对各资讯标签兴趣度的预估结果。The embodiment of the present invention proposes a method for estimating the interest degree of information tags. In this method, information on the network is screened and tags are added to the screened information to obtain a tagged candidate information base. For the current users who need to estimate the interest degree of information tags, obtain the demographic characteristics of the current users and organize and analyze the data, and use the demographic characteristics of the current users to calculate the user attributes of the current users-information tag interest degree vector. This method trains a deep learning model based on recurrent neural network, obtains the historical behavior data of the current user, organizes and analyzes the data, and inputs the historical behavior data of the current user into the trained deep learning model. Obtain the current user's user behavior-information tag interest degree vector. According to the proposed method, the current user's user attribute-information tag interest degree vector and user behavior-information tag interest degree vector are used to calculate the user-information tag interest degree vector. The calculated user-information tag interest degree vector of the current user reflects the estimation result of the current user's interest degree to each information tag by this method.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到本发明可以通过硬件实现，也可以借助软件加必要的通用硬件平台的方式来实现。基于这样的理解，本发明的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the present invention can be realized by hardware, or by software plus a necessary general hardware platform. Based on this understanding, the technical solution of the present invention can be embodied in the form of software products, which can be stored in a non-volatile storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.), including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in various embodiments of the present invention.

以上公开的仅为本发明的几个具体实施例，但是，本发明并非局限于上述实施例，任何本领域的技术人员能思之的变化都应落入本发明的保护范围。The above disclosures are only a few specific embodiments of the present invention, but the present invention is not limited to the above embodiments, and any changes conceivable by those skilled in the art should fall within the protection scope of the present invention.

Claims

1. A method for estimating interest in information tags, characterized in that the method comprises:

Create and maintain a tagged candidate repository;

According to the demographic information of the user, the user attribute-information tag interest degree vector is obtained;

Obtain and preprocess the historical behavior data of multiple users within a preset time period, input the deep learning model for training to obtain the trained deep learning model;

Obtain and preprocess the historical behavior data of the current user, and use the trained deep learning model to calculate the current user's user behavior-information tag interest degree vector;

Calculate the user-information tag interest degree vector according to the current user's user attribute-information tag interest degree vector and user behavior-information tag interest degree vector, and finally determine several information tags that the user is most interested in.

2. The method according to claim 1, wherein the step of creating and maintaining a tagged candidate information base comprises:

Select one or more tags that best match the content of the information from the preset information tag library as the tag of the information, and add the tagged information to the candidate information library containing tags; for each information in the candidate information library , according to the information tag corresponding to the information, each information is represented by an m-dimensional information vector, m is the total number of tags in the default information tag library; when the information contains a tag T _j , the j-th dimension of the m-dimensional information vector The value is 1, otherwise the value of the jth dimension is 0;

Regularly maintain the tagged candidate information base, add new information, and remove out-of-date information.

3. The method according to claim 1, wherein the user demographic information includes but is not limited to: one or more of the available gender, age and/or geographical information to classify users into several groups Information.

4. The method according to claim 1 or 3, wherein said obtaining user attribute-information label interest degree vector according to user demographic information comprises:

The user attribute-information tag interest degree vector H _ij of the i-th group G _i to the j-th information tag T _j is:

The value of H _ij is between [0, 1].

5. The method according to claim 1, wherein the acquisition of historical behavior data of multiple users within a preset period of time and preprocessing, inputting a deep learning model for training to obtain a trained deep learning model, includes :

Obtain the information vector corresponding to each information browsed in the historical behavior data of multiple users within the preset time period, and input the information vector into the recurrent neural network model in the order of browsing information time sequence to train the recurrent neural network model, and obtain A trained deep learning model.

6. The method according to claim 1 or 5, wherein the historical behavior data of the current user is acquired and pre-processed, and the user behavior of the current user-information tag interest degree is calculated by using the trained deep learning model vector, including:

Obtain the information vector corresponding to each information browsed in the current user's historical behavior data, and arrange them in chronological order;

Input the information vectors corresponding to each information in the current user's historical behavior data into the trained deep learning model in chronological order. After all the information vectors corresponding to each information in the historical behavior data are input, the The m-dimensional prediction vector obtained by the trained deep learning model is the current user's user behavior-information tag interest degree vector.

7. The method according to any one of claims 1-6, wherein the user-information tag interest degree vector is calculated according to the current user's user attribute-information tag interest degree vector and user behavior-information tag interest degree vector Vector, and finally determine several information tags that users are most interested in, including:

According to the calculated user attribute-information tag interest degree vector and user behavior-information tag interest degree vector of the current user, the m-dimensional interest degree vector of the current user to the information tag can be calculated, and the calculation formula is as follows:

V (user, information label) = (1-w) * V (user attribute, information label) + w*V (user behavior, information label)

Among them, V (user, information label) is the m-dimensional interest degree vector of the current user to the information label; V (user behavior, information label) is the user behavior-information label interest degree vector; V (user attribute, information label) is the current The user's user attribute-information tag interest degree vector; w represents the weight of V (user behavior, information tag) in calculating the current user's interest degree vector for information tags, and the value of w should always be in [0,1] within range.

8. the method for claim 7, is characterized in that, the computing formula of described w is as follows:

w=tanh(a*number of information browsed by the current user within the preset time period T)

Among them, tanh is the hyperbolic tangent function, and a is a constant greater than 0.