[go: up one dir, main page]

CN105205699A - User label and hotel label matching method and device based on hotel comments - Google Patents

User label and hotel label matching method and device based on hotel comments Download PDF

Info

Publication number
CN105205699A
CN105205699A CN201510593613.5A CN201510593613A CN105205699A CN 105205699 A CN105205699 A CN 105205699A CN 201510593613 A CN201510593613 A CN 201510593613A CN 105205699 A CN105205699 A CN 105205699A
Authority
CN
China
Prior art keywords
hotel
user
emotional
label
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510593613.5A
Other languages
Chinese (zh)
Inventor
林小俊
张猛
暴筱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhonghui Information Technology Co Ltd
Original Assignee
Beijing Zhonghui Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhonghui Information Technology Co Ltd filed Critical Beijing Zhonghui Information Technology Co Ltd
Priority to CN201510593613.5A priority Critical patent/CN105205699A/en
Publication of CN105205699A publication Critical patent/CN105205699A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

本发明公开一种基于酒店点评的用户标签和酒店标签匹配方法及装置,本发明的方法包括:准备酒店业情感语句模板库;准备至少三个酒店的最终酒店标签;从互联网获取特定用户针对同一酒店或不同酒店的至少两条用户点评;将情感语句与情感语句模板进行比对,筛选出相匹配的情感语句并识别为不同的维度,再以所识别的所有维度形成特定用户的用户标签集合;分别计算每个用户标签的权重,其中,在特定用户的全部用户点评中出现的频率越高且在所有用户针对所有酒店的所有用户点评中出现的频率越低则用户标签权重越高;选择权重较高的用户标签作为特定用户的最终用户标签;以及将最终酒店标签与特定用户的最终用户标签匹配率高的酒店推荐给特定用户。

The invention discloses a method and device for matching user tags and hotel tags based on hotel reviews. The method of the invention includes: preparing a database of emotional sentence templates for the hotel industry; preparing final hotel tags for at least three hotels; At least two user reviews of the hotel or different hotels; compare the emotional sentence with the emotional sentence template, filter out the matching emotional sentence and identify it as different dimensions, and then form a user tag set for a specific user with all the identified dimensions ;Calculate the weight of each user tag separately, wherein the higher the frequency of occurrence in all user reviews of a specific user and the lower the frequency of appearance in all user reviews of all users for all hotels, the higher the weight of the user tag; select The user label with higher weight is used as the end-user label of the specific user; and the hotel with a high matching rate between the final hotel label and the specific user's end-user label is recommended to the specific user.

Description

基于酒店点评的用户标签和酒店标签匹配方法及装置Method and device for matching user tags and hotel tags based on hotel reviews

技术领域technical field

本发明涉及一种互联网信息处理方法,特别涉及一种用户画像生成方法及装置。The invention relates to an Internet information processing method, in particular to a method and device for generating user portraits.

背景技术Background technique

时代的变迁,不可避免会带来诸多社会变化。在互联网逐渐步入大数据时代后,不可避免的为企业及消费者行为带来一系列改变与重塑。互联网唯快不破的节奏,打乱了原有商业演变的逻辑,使得商业的参与方不得不面临着前所未有的变革,加速适应时代的变化。如何利用大数据挖掘潜在的商业价值,如何在企业中实实在在的应用大数据技术。伴随着大数据应用的讨论、创新,个性化技术成为了一个重要落地点。相比传统的线下会员管理、问卷调查、购物篮分析,大数据第一次使得企业能够通过互联网便利地获取用户更为广泛的反馈信息,为进一步精准、快速地分析用户行为习惯、消费习惯等重要商业信息,提供了足够的数据基础。伴随着对人的了解逐步深入,“用户画像”的概念应运而生,它完美地抽象出一个用户的信息全貌,可以看作企业应用大数据的根基。The changing times will inevitably bring about many social changes. After the Internet gradually enters the era of big data, it will inevitably bring a series of changes and reshaping to the behavior of enterprises and consumers. The fast and unbreakable rhythm of the Internet has disrupted the logic of the original business evolution, making business participants have to face unprecedented changes and adapt to the changes of the times at an accelerated pace. How to use big data to mine potential business value, and how to actually apply big data technology in enterprises. With the discussion and innovation of big data applications, personalized technology has become an important landing point. Compared with traditional offline member management, questionnaire survey, and shopping basket analysis, big data enables enterprises to obtain more extensive feedback information from users conveniently through the Internet for the first time, in order to further accurately and quickly analyze user behavior habits and consumption habits and other important business information, providing a sufficient data base. With the gradual deepening of understanding of people, the concept of "user portrait" came into being. It perfectly abstracts the whole picture of a user's information, which can be regarded as the foundation of enterprise application of big data.

用户画像是真实用户的虚拟代表,是在深刻理解真实数据的基础上得出的一个虚拟用户。企业通过收集与分析消费者社会属性、生活习惯、消费行为、观点差异等主要信息的数据之后,将他们区分为不同的类型,然后每种类型中抽取出典型特征,赋予一个名字、一张照片、一些人口统计学要素、场景等描述,就形成了一个用户画像,这是用户的商业全貌,可以看作是企业应用大数据技术的基本方式。用户画像为企业提供了足够的信息基础,能够帮助企业快速找到精准用户群体以及用户需求等更为广泛的反馈信息。A user portrait is a virtual representative of a real user, a virtual user based on a deep understanding of real data. After collecting and analyzing the main information data of consumers such as social attributes, living habits, consumption behavior, and differences in opinions, the company distinguishes them into different types, and then extracts typical characteristics from each type, and gives them a name and a photo. , some demographic elements, scenes and other descriptions form a user portrait, which is the overall business picture of the user, and can be regarded as the basic way for enterprises to apply big data technology. User portraits provide a sufficient information base for companies, and can help companies quickly find accurate user groups and broader feedback information such as user needs.

大数据处理,离不开计算机的运算,用户画像可以用标签集合来表示,标签是某一种用户特征的符号表示,用户信息标签化提供了一种便捷的方式,使得计算机能够程序化处理与人相关的信息,甚至通过算法、模型能够“理解”人。Big data processing is inseparable from computer operations. User portraits can be represented by a set of tags. Tags are symbolic representations of certain user characteristics. User information tagging provides a convenient way for computers to programmatically process and People-related information, and even "understand" people through algorithms and models.

一个标签通常是预先定义的高度精炼的特征标识,如年龄段标签:25~35岁,地域标签:北京,标签呈现出两个重要特征:(1)语义化,人能很方便地理解每个标签含义,这也使得用户画像模型具备实际意义,能够较好的满足业务需求,如判断用户偏好;(2)短文本,每个标签通常只表示一种含义,标签本身无需再做过多文本分析等预处理工作,这为利用机器提取标准化信息提供了便利。A label is usually a pre-defined highly refined feature identification, such as age group label: 25 to 35 years old, region label: Beijing, the label presents two important features: (1) Semantic, people can easily understand each Label meaning, which also makes the user portrait model have practical significance, and can better meet business needs, such as judging user preferences; (2) short text, each label usually only expresses one meaning, and the label itself does not need to do too much text Analysis and other preprocessing work, which facilitates the use of machines to extract standardized information.

用户画像标签具体来说包括两方面:标签及其权重。标签,表征了内容,用户对该内容有兴趣、偏好、需求等等。权重,表征了指数,用户的兴趣、偏好指数,也可能表征用户的需求度,可以简单的理解为置信度。User portrait tags specifically include two aspects: tags and their weights. Tags represent content, and users have interests, preferences, needs, etc. in this content. Weight, which represents the index, the user's interest and preference index, may also represent the user's demand, which can be simply understood as confidence.

为用户画像的焦点工作就是为用户打“标签”,而一个标签通常是人为规定的高度精炼的特征标识,如年龄、性别、地域、用户偏好等,最后将用户的所有标签综合来看,基本就可以勾勒出该用户的立体“画像”了。The focus of user portrait work is to "label" users, and a label is usually a highly refined feature identifier that is artificially specified, such as age, gender, region, user preference, etc. Finally, when all the labels of the user are combined, it is basically A three-dimensional "portrait" of the user can be drawn.

具体来讲,当为用户画像时,需要收集数据、分析标签两个步骤。Specifically, when profiling users, two steps are required: collecting data and analyzing tags.

首先,收集到用户所有的相关数据并将用户数据划分为静态信息数据、动态信息数据两大类,静态数据就是用户相对稳定的信息,如性别、年龄、地域、职业等,动态数据就是用户不停变化的行为信息,如浏览网页、搜索商品、发表点评、接触渠道等。First of all, collect all the relevant data of the user and divide the user data into two categories: static information data and dynamic information data. Static data refers to the relatively stable information of the user, such as gender, age, region, occupation, etc. Constantly changing behavioral information, such as browsing the web, searching for products, posting comments, contact channels, etc.

其次,通过剖析数据为用户贴上相应的标签及指数,标签代表用户对该内容有兴趣、偏好、需求等,指数代表用户的兴趣程度、需求程度、购买概率等。Secondly, by analyzing the data, affix corresponding tags and indexes to the users. The tags represent the user's interest, preference, and demand for the content, and the index represents the user's degree of interest, demand, and purchase probability.

如中国专利申请公开第104750731A号揭示的一种获取完整用户画像的方法,包括:获取残缺的用户画像矩阵,以及随机生成用户参数矩阵P和标签矩阵Q;计算第一部分用户的画像误差,更新用户参数矩阵和标签参数矩阵,其中,选择的第一部分用户的第一变化差值大于第一剩余用户的第一变化差值,第一剩余用户为多个用户中的除第一部分用户之外的用户,第一变化差值为用户第r-1次更新的第一预测值与用户第r-2次更新的第一预测值之间的差值;在第R次更新用户参数矩阵P和标签参数矩阵Q之后,根据矩阵分解的结果,获取完整的用户画像矩阵。For example, a method for obtaining a complete user portrait disclosed in Chinese Patent Application Publication No. 104750731A includes: obtaining an incomplete user portrait matrix, and randomly generating a user parameter matrix P and a label matrix Q; calculating the first part of the user portrait error, and updating the user A parameter matrix and a label parameter matrix, wherein the first change difference of the selected first part of users is greater than the first change difference of the first remaining users, and the first remaining users are users other than the first part of users among the plurality of users , the first change difference is the difference between the first predicted value updated by the user at the r-1th time and the first predicted value updated by the user at the r-2th time; update the user parameter matrix P and label parameters at the Rth time After matrix Q, according to the result of matrix decomposition, a complete user portrait matrix is obtained.

又如中国专利申请公开第104268292A号揭示的一种画像系统的标签词库更新方法,其包括:获取用户的画像数据,所述画像数据包括用于描述所述用户的标签和所述用户发表的原始文本;当标签的数量与原始文本的数量的比值小于预设的第一阈值时,对所述用户发表的所有原始文本进行分词处理,以得到多个标签候选词,并将标签候选词发送至推荐系统;推荐系统计算每一个标签候选词与预设的词向量模型文件中每一个词的向量距离,将存在向量距离大于预设的第二阈值的标签候选词加入到标签词库中,将不存在向量距离大于第二阈值的标签候选词删除。Another example is a method for updating tag thesaurus of a portrait system disclosed in Chinese Patent Application Publication No. 104268292A, which includes: acquiring user portrait data, the portrait data including tags used to describe the user and published by the user Original text; when the ratio of the number of labels to the number of original texts is less than the preset first threshold, word segmentation is performed on all original texts published by the user to obtain multiple label candidates, and the label candidates are sent To the recommendation system; the recommendation system calculates the vector distance between each label candidate word and each word in the preset word vector model file, and adds the label candidate words whose vector distance is greater than the preset second threshold to the label lexicon, Deleting no label candidate words whose vector distance is greater than the second threshold.

再如中国专利申请公开第103577549A号揭示的一种基于微博标签的人群画像系统和方法,包含微博标签推荐和标签主题聚类两大模块,其中第一模块中采用一个涵盖三个步骤的标签推荐算法。第一步为同质性标签推荐,第二步为共现性标签扩展;第三步则是以中文知识图谱为基础建立语义网络,利用网络拓扑特性来度量标签之间的语义相似度,从而去除语义相同或相似的标签,保证用来刻画用户的标签精炼性。Another example is a microblog tag-based crowd portrait system and method disclosed in Chinese Patent Application Publication No. 103577549A, which includes two modules: microblog tag recommendation and tag topic clustering. The first module uses a three-step Tag recommendation algorithm. The first step is the recommendation of homogeneous tags, the second step is the expansion of co-occurrence tags; the third step is to establish a semantic network based on the Chinese knowledge graph, and use the network topology characteristics to measure the semantic similarity between tags, so that Remove tags with the same or similar semantics to ensure the refinement of tags used to describe users.

然而,上述三篇专利文献公开的用户画像技术的应用领域均不属于本发明所涉及的酒店行业。However, none of the application fields of the user portrait technology disclosed in the above three patent documents belong to the hotel industry involved in the present invention.

在酒店行业中,目前的用户画像标签化分析的研究和应用主要集中在用户属性和用户行为等数据上,用户属性数据包括年龄、性别、地域等,用户行为数据包括用户在官网或者移动应用端的访问历史、点击历史、消费历史等数据,基于点评数据的研究和应用较少。这方面的主要问题在于点评文本的分析理解很难,需要借助自然语言处理等技术,将非结构化的数据转化为结构化的数据,常见的用户标签分析算法才可以加以应用。In the hotel industry, the current research and application of user portrait labeling analysis mainly focus on data such as user attributes and user behaviors. User attribute data includes age, gender, region, etc. Access history, click history, consumption history and other data, research and application based on review data are less. The main problem in this regard is that it is difficult to analyze and understand review texts. It is necessary to use technologies such as natural language processing to convert unstructured data into structured data, so that common user tag analysis algorithms can be applied.

因此,提供一种基于酒店点评的用户标签和酒店标签匹配方法成为业内急需解决的问题。Therefore, providing a method for matching user tags and hotel tags based on hotel reviews has become an urgent problem in the industry.

发明内容Contents of the invention

本发明的目的是提供一种基于酒店点评的用户标签和酒店标签匹配方法及装置,其通过标签为酒店和用户建模,从而更好地在酒店和用户之间建立关联。The purpose of the present invention is to provide a method and device for matching user tags and hotel tags based on hotel reviews, which model hotels and users through tags, so as to better establish associations between hotels and users.

常见的用户点评分析方法都是基于结构化数据,如用户属性数据,包括年龄、性别、地域等,或者用户行为数据包括用户在官网或者移动应用端的访问历史、点击历史、消费历史等。本发明针对研究和应用较少的酒店点评数据,不仅能分析出用户对酒店的评价是好评还是差评,还可以挖掘出维度,基于此构建酒店和用户的标签。Common user comment analysis methods are based on structured data, such as user attribute data, including age, gender, region, etc., or user behavior data including user visit history, click history, and consumption history on official websites or mobile applications. The present invention aims at less research and application of hotel review data, not only can analyze whether the user's evaluation of the hotel is positive or negative, but also can dig out the dimensions, based on which the hotel and user tags can be constructed.

本发明首先通过聚焦爬虫从各大主流点评(OnlineTravelAgent,OTA)网站获取在线点评数据。然后针对大规模点评,通过自动/半自动方式整理酒店业情感词库以及领域知识库。最后,针对点评中的每个句子,进行分词、词性标注、短语结构句法分析等自然语言处理技术等分析,在此基础上提取关键词或关键句式作为特征,通过最大熵分类器实现情感分类。对于表达情感的句子,进一步根据领域关键词及知识库推理得到维度。每个维度都反映了人们观察、认识和描述酒店或用户的一个角度。The present invention first obtains online review data from major mainstream review (Online Travel Agent, OTA) websites by focusing on crawlers. Then, for large-scale reviews, the hotel industry emotional lexicon and domain knowledge base are sorted out in an automatic/semi-automatic manner. Finally, for each sentence in the review, analyze natural language processing technologies such as word segmentation, part-of-speech tagging, and phrase structure syntax analysis. On this basis, keywords or key sentence patterns are extracted as features, and sentiment classification is realized through the maximum entropy classifier . For sentences expressing emotions, further dimensionality is obtained based on domain keywords and knowledge base reasoning. Each dimension reflects a perspective from which people observe, understand and describe a hotel or user.

本发明通过维度详细描述酒店业酒店和用户双方关注的焦点,并以此作为标签集。用户标签反映了用户在意的方面,而酒店标签反映了酒店擅长的方面。以向用户推荐酒店这样的场景为例,当用户在意的标签与酒店擅长的标签越相似,或者匹配程度越高,则越适合推荐给用户。有了标签集合,下一步就是针对某个用户的所有点评或者某家酒店的所有点评,计算标签权重。权重计算主要基于标签在点评中出现的频次。酒店标签与用户标签的差异在于,为了反映酒店某方面的擅长程度,需要考虑标签对应点评点情感极性。在某个标签上,好的评价越多,则认为酒店这方面越擅长,做得越好。The present invention describes in detail the focus of attention of both hotels and users in the hotel industry through dimensions, and uses this as a label set. The user tags reflect what the user cares about, while the hotel tags reflect what the hotel is good at. Take the scenario of recommending hotels to users as an example. When the tags that users care about are more similar to the tags that the hotel is good at, or the degree of matching is higher, the more suitable to recommend to users. With the set of labels, the next step is to calculate the label weight for all reviews of a user or all reviews of a hotel. The weight calculation is mainly based on how often tags appear in reviews. The difference between hotel tags and user tags is that in order to reflect the hotel's expertise in a certain aspect, it is necessary to consider the emotional polarity of the tag corresponding to the review point. On a certain label, the more good reviews there are, the better the hotel is considered to be good at this aspect and the better it is doing.

本发明中所指的维度是指能够表达对酒店某一方面评价的语句情感类型,比如酒店的卫生级别、交通便利度、周边环境指数、房间空间大小等等方面,具体可以包括若干个维度,例如维度1表示卫生级别为A级;……维度12表示交通便利度为B级;……维度53表示周边环境指数为C级;……维度104表示房间空间大小为D级等等。The dimension referred to in the present invention refers to the sentence emotion type that can express the evaluation of a certain aspect of the hotel, such as the hygiene level of the hotel, the convenience of transportation, the surrounding environment index, the size of the room space, etc. It can specifically include several dimensions, For example, dimension 1 indicates that the hygiene level is grade A; ...dimension 12 indicates that the transportation convenience is grade B; ...dimension 53 indicates that the surrounding environment index is grade C; ...dimension 104 indicates that the room size is grade D, etc.

本发明中所指的词汇的不同属性是指将词汇分为评价对象词、评价属性词以及情感词等属性。The different attributes of vocabulary referred to in the present invention refer to the classification of vocabulary into attributes such as evaluation object words, evaluation attribute words, and emotional words.

根据本发明的一个方面,提供一种基于酒店点评的用户标签和酒店标签匹配方法,包括:(1)、准备酒店业情感语句模板库,酒店业情感语句模板库包括至少100个情感语句模板;(2)、准备至少三个酒店的最终酒店标签;(3)、从互联网获取特定用户针对同一酒店或不同酒店的至少两条用户点评;(4)、将特定用户的所有用户点评的情感语句逐一与至少100个情感语句模板进行比对,筛选出与至少100个情感语句模板相匹配的情感语句,并将所筛选出的情感语句根据所表达的情感类型识别为不同的维度,再以所识别的所有维度形成特定用户的用户标签集合;(5)、分别计算特定用户的用户标签集合中的每个用户标签的权重,其中,在特定用户的全部用户点评中出现的频率越高且在所有用户针对所有酒店的所有用户点评中出现的频率越低则用户标签权重越高;(6)、从特定用户的用户标签集合中选择权重大于第一设定阈值的用户标签作为特定用户的最终用户标签;以及(7)、至少将最终酒店标签与特定用户的最终用户标签匹配率位于前三名的酒店推荐给特定用户。According to one aspect of the present invention, a kind of user tag and hotel tag matching method based on hotel comments are provided, comprising: (1), preparing the hotel industry emotional sentence template library, the hotel industry emotional sentence template library includes at least 100 emotional sentence templates; (2), prepare the final hotel labels of at least three hotels; (3), obtain at least two user reviews of the same hotel or different hotels from a specific user from the Internet; (4), combine the emotional sentences of all user reviews of a specific user Compare with at least 100 emotional sentence templates one by one, filter out the emotional sentences that match at least 100 emotional sentence templates, and identify the screened emotional sentences into different dimensions according to the type of emotion expressed, and then use the All dimensions identified form a user label set of a specific user; (5), respectively calculate the weight of each user label in the user label set of a specific user, wherein the higher the frequency of occurrence in all user comments of a specific user and the higher the The lower the frequency that all users appear in all user reviews of all hotels, the higher the weight of the user tag; (6), select the user tag with a weight greater than the first set threshold from the user tag set of the specific user as the final result of the specific user user tags; and (7), at least recommending to the specific user the hotels whose matching rate between the final hotel tag and the specific user's end user tag is among the top three.

其中,根据具体使用条件,准备至少三个酒店的最终酒店标签可为准备至少10个、至少100个或者至少500个最终酒店标签。Wherein, according to the specific conditions of use, preparing the final hotel labels of at least three hotels may be at least 10, at least 100 or at least 500 final hotel labels.

可选择地,可以事先通过其它装置或通过人工从点评网站获取点评数据备用。Optionally, the review data can be obtained from the review website through other devices or manually for backup.

可选择地,可以事先通过其它装置或通过人工整理出酒店业语义词典备用。Optionally, the hotel industry semantic dictionary can be sorted out in advance by other devices or manually for future use.

可选择地,可以事先通过其它装置或通过人工整理出酒店业情感语句模板库备用。Optionally, the hotel industry emotional sentence template library can be sorted out in advance by other devices or manually for future use.

可选择地,可以事先通过其它装置或通过人工整理出种子语义词典备用。Optionally, the seed semantic dictionary can be sorted out in advance through other devices or manually for future use.

可选择地,步骤(2)中准备至少三个酒店的最终酒店标签包括:(2.1)、从互联网获取分别针对至少三个酒店的用户点评,其中针对每个酒店包括至少三个用户的用户点评;(2.2)、将针对特定酒店的所有用户点评的情感语句逐一与至少100个情感语句模板进行比对,筛选出与至少100个情感语句模板相匹配的情感语句,并将所筛选出的情感语句根据所表达的情感类型识别为不同的维度,再以所识别的所有维度形成特定酒店的酒店标签集合;(2.3)、分别计算特定酒店的酒店标签集合中的每个酒店标签的权重,其中,在针对同一酒店的所有用户点评中出现的频率越高且在针对所有酒店的所有用户点评中出现的频率越低则酒店标签权重越高;(2.4)、从酒店标签集合中选择权重大于第二设定阈值的酒店标签作为特定酒店的最终酒店标签;以及(2.5)、重复步骤(2.2)-(2.4)直至获得所有酒店的最终酒店标签。Optionally, preparing the final hotel labels of at least three hotels in step (2) includes: (2.1), obtaining user reviews for at least three hotels from the Internet, wherein each hotel includes user reviews for at least three users (2.2), comparing the emotional sentences of all user reviews of a specific hotel with at least 100 emotional sentence templates one by one, screening out the emotional sentences matching at least 100 emotional sentence templates, and using the selected emotional sentences Sentences are identified as different dimensions according to the type of emotion expressed, and then form a hotel tag set of a specific hotel with all the identified dimensions; (2.3), respectively calculate the weight of each hotel tag in the hotel tag set of a specific hotel, where , the higher the frequency of appearing in all user reviews of the same hotel and the lower the frequency of appearing in all user reviews of all hotels, the higher the weight of the hotel label; (2.4), select the weight greater than the first 2. The hotel label of the threshold value is used as the final hotel label of the specific hotel; and (2.5), repeating steps (2.2)-(2.4) until the final hotel labels of all hotels are obtained.

可选择地,步骤(1)中准备酒店业情感语句模板库可包括从互联网获取的至少10000条酒店用户点评中根据语句出现的频率高低筛选出至少100个情感语句作为情感语句模板。Optionally, preparing the hotel industry emotion statement template library in step (1) may include selecting at least 100 emotion statements as emotion statement templates according to the frequency of occurrence of the statements in at least 10,000 hotel user reviews obtained from the Internet.

可选择地,进一步包括根据词汇出现的频率高低从至少10000条酒店用户点评中筛选出至少1000个酒店业常用词汇用以构建酒店业语义词典。Optionally, it further includes selecting at least 1,000 commonly used words in the hotel industry from at least 10,000 hotel user reviews according to the frequency of occurrence of the words to construct a semantic dictionary for the hotel industry.

可选择地,在步骤(1)中,在准备酒店业情感语句模板库之前,进一步包括构建酒店业语义词典的步骤,步骤(4)中将特定用户的所有用户点评的情感语句逐一与至少100个情感语句模板进行比对包括:(4.1)、将特定情感语句切分成与酒店业语义词典中相应的若干个酒店业常用词汇;(4.2)、根据特定情感语句中每个词汇的不同属性分别与至少100个情感语句模板进行比对,从而确定与至少100个情感语句模板中的任一个情感语句模板是否相匹配;以及(4.3)、重复步骤(4.1)-(4.2),直至筛选出与至少100个情感语句模板相匹配的所有情感语句。Optionally, in step (1), before preparing the hotel industry emotional sentence template storehouse, further comprise the step of constructing hotel industry semantic dictionary, in step (4), the emotional sentence of all user comments of specific user is compared one by one with at least 100 Comparing the emotional sentence templates includes: (4.1), cutting the specific emotional sentence into a number of corresponding hotel industry vocabulary in the semantic dictionary of the hotel industry; (4.2), according to the different attributes of each vocabulary in the specific emotional sentence, respectively Compare with at least 100 emotional sentence templates, thereby determine whether match with any emotional sentence template in at least 100 emotional sentence templates; And (4.3), repeat steps (4.1)-(4.2), until screening out and All sentiment statements that match at least 100 sentiment statement templates.

可选择地,步骤(2.2)中将针对特定酒店的所有用户点评的情感语句逐一与至少100个情感语句模板进行比对包括:(2.2.1)、将特定情感语句切分成与酒店业语义词典中相应的若干个酒店业常用词汇;(2.2.2)、根据特定情感语句中每个词汇的不同属性分别与至少100个情感语句模板进行比对,从而确定与至少100个情感语句模板中的任一个情感语句模板是否相匹配;以及(2.2.3)、重复步骤(2.2.1)-(2.2.2),直至筛选出与至少100个情感语句模板相匹配的所有情感语句。Optionally, in step (2.2), the emotional sentences for all user reviews of specific hotels are compared with at least 100 emotional sentence templates one by one and include: (2.2.1), specific emotional sentences are segmented into the hotel industry semantic dictionary (2.2.2), compare with at least 100 emotional sentence templates respectively according to the different attributes of each vocabulary in the specific emotional sentence, thereby determine and at least 100 emotional sentence templates Whether any emotional statement template matches; and (2.2.3), repeat steps (2.2.1)-(2.2.2), until all emotional statements matching at least 100 emotional statement templates are screened out.

可选择地,步骤(3)中可以通过聚焦爬虫从点评网站获取用户点评。Optionally, in step (3), the focused crawler may be used to obtain user reviews from review websites.

可选择地,步骤(1)中准备酒店业情感语句模板库可以通过基于用户点评的自举方法提取句式模版,从而获得酒店业情感语句模板库。Optionally, in step (1), preparing the emotional sentence template library for the hotel industry may extract sentence pattern templates through a bootstrapping method based on user comments, thereby obtaining the emotional sentence template library for the hotel industry.

可选择地,准备酒店业情感语句模板库以及构建酒店业语义词典的步骤包括:(1.1)、获取点评数据,通过整理各个情感要素的词形成种子词典;(1.2)、对点评数据的句子进行分词处理,然后逐词判定其语义类并用语义类标签进行替换;(1.3)、对标签替换后的点评数据进行断句,根据各语义类的名称及各语义类包含的具体词语生成模版;(1.4)、将模版应用到语义类标签替换后的点评数据中,以抽取各语义类的语义词;(1.5)、根据模版的重要性、推广性和准确性,对各模版进行打分;(1.6)、选取得分最高的部分模版,根据选取的模版及其打分计算各模版抽取的语义词的得分,进而选取得分最高的部分语义词对语义词典进行扩充;以及(1.7)、步骤(1.2)至步骤(1.6)迭代进行,直到挑选出来的语义词不正确时迭代终止,得到最终的酒店业语义词典,并由各模版构成酒店业情感语句模板库。Optionally, the steps of preparing the hotel industry emotional sentence template library and building the hotel industry semantic dictionary include: (1.1), obtaining comment data, forming a seed dictionary by sorting out the words of each emotional element; Word segmentation processing, and then determine its semantic class word by word and replace it with the semantic class label; (1.3), segment the comment data after the label replacement, and generate a template according to the name of each semantic class and the specific words contained in each semantic class; (1.4 ), apply the template to the comment data after the replacement of the semantic class label, to extract the semantic words of each semantic class; (1.5), score each template according to the importance, generalization and accuracy of the template; (1.6) , select the part template with the highest score, calculate the score of the semantic words extracted by each template according to the selected template and scoring, and then select the part semantic word with the highest score to expand the semantic dictionary; and (1.7), step (1.2) Go to step (1.6) iteratively, until the selected semantic words are incorrect, the iteration is terminated, and the final hotel industry semantic dictionary is obtained, and the template library of hotel industry emotion sentences is formed by each template.

可选择地,步骤(1.1)通过聚焦爬虫从点评网站获取在线点评数据,并通过人工查看少量点评,整理各个语义类的词,形成种子词典。Optionally, step (1.1) obtains online review data from review websites by focusing crawlers, and manually checks a small number of reviews to sort out words of each semantic category to form a seed dictionary.

可选择地,步骤(1.2)首先采用基于词典的最大匹配分词方法进行分词,然后针对分词有歧义的部分采用序列标注的分词方法得到正确的分词结果;所述序列标注的分词方法将词的切分问题转换为字的分类问题,每个字根据其在词中的不同位置,赋予不同的位置类别标记,基于这样的标记序列确定句子的切分方式。Optionally, step (1.2) first adopts the maximum matching word segmentation method based on the dictionary to perform word segmentation, and then uses the word segmentation method of sequence labeling to obtain correct word segmentation results for the ambiguous part of the word segmentation; The segmentation problem is transformed into a character classification problem. Each character is assigned a different position category mark according to its different position in the word, and the sentence segmentation method is determined based on such a mark sequence.

可选择地,不同的位置类别标记,包括词首、词中、词尾和单字词,并采用条件随机场模型实现序列标注任务。Optionally, different position categories are labeled, including initials, midwords, endings, and single-word words, and a conditional random field model is used to implement the sequence labeling task.

可选择地,步骤(1.2)中语义类包括评价对象词、评价属性词、情感词、程度副词、普通副词、否定词、插入词。Optionally, the semantic classes in step (1.2) include evaluation object words, evaluation attribute words, emotional words, degree adverbs, common adverbs, negative words, and insertion words.

可选择地,步骤(1.3)根据“。”、“!”、“?”3个标点符号进行断句,并限定模版的最小长度为3个词,最大长度为7个词。Optionally, step (1.3) performs sentence segmentation according to the three punctuation marks ".", "!", and "?", and limits the minimum length of the template to 3 words and the maximum length to 7 words.

可选择地,步骤(1.4)抽取各语义类的语义词时,当某个点评片段对应的模版与步骤(1.3)所得模版的差异只有一个词时,将该词作为相应语义类的实例词。Optionally, when step (1.4) extracts semantic words of each semantic class, when the difference between the template corresponding to a certain review segment and the template obtained in step (1.3) is only one word, the word is used as an instance word of the corresponding semantic class.

可选择地,步骤(1.5)对各模版进行打分的方法是:Optionally, the method for scoring each template in step (1.5) is:

1)对模版重要性和推广性打分S(pati)的计算公式如下:1) The formula for calculating the score S(pat i ) for the importance and promotion of the template is as follows:

其中,|pati|是模版pati的长度,以词数计算,f(pati)表示模版pati的频次,C(pati)表示嵌套pati的模版集合;Among them, |pat i | is the length of the template pat i , calculated by the number of words, f(pat i ) indicates the frequency of the template pat i , and C(pat i ) indicates the template set of the nested pat i ;

2)对模版准确性打分P(pati)的计算公式如下:2) The formula for calculating the template accuracy score P(pat i ) is as follows:

PP (( patpat ii )) == ΣΣ tt ∈∈ SS ee mm LL ee xx ,, tt ∈∈ TT (( patpat ii )) ff (( tt )) ΣΣ tt ∈∈ TT (( patpat ii )) ff (( tt )) ,,

其中,T(pati)表示模版pati抽取的语义词集合,f(t)表示语义词t的频次,SemLex为种子语义词典;Wherein, T(pat i ) represents the set of semantic words extracted from template pat i , f(t) represents the frequency of semantic word t, and SemLex is a seed semantic dictionary;

3)采用Sigmoid函数将S(pati)归一化到(0,1),进而融合两方面的打分得到F(pati),计算公式如下:3) Using the Sigmoid function Normalize S(pat i ) to (0,1), and then integrate the two scores to get F(pat i ), the calculation formula is as follows:

Ff (( patpat ii )) == αα ** loglog 22 11 11 ++ ee -- SS (( patpat ii )) ++ (( 11 -- αα )) ** loglog 22 PP (( patpat ii )) ,,

其中α为重要性和推广性打分S(pati)的权重,取值范围为[0,1]。Where α is the weight of the importance and generalization score S(pat i ), and the value range is [0,1].

可选择地,步骤(1.6)所述得分最高的部分模版是得分最高的前5~10%的模版,所述得分最高的部分语义词是得分最高的前5~10%的语义词。Optionally, the partial templates with the highest score in step (1.6) are the top 5-10% templates with the highest score, and the partial semantic words with the highest score are the top 5-10% semantic words with the highest score.

可选择地,在步骤(1.7)之后,由人工进行确定语义词典中情感词的极性,以及情感词与评价对象词、评价属性词的搭配极性;人工确定过程中,将其所属模版对应的点评片段作为判定的依据。Optionally, after step (1.7), manually determine the polarity of the emotional word in the semantic dictionary, and the collocation polarity of the emotional word, the evaluation object word, and the evaluation attribute word; The review fragments are used as the basis for judgment.

可选择地,本发明中对点评进行情感分析的步骤包括:获取点评数据,对其进行规范化处理;对规范化处理后的点评数据的句子进行分词处理;对分词后的句子进行要素分析,识别出影响文本情感倾向性检测分析的各类词语;根据句式模版库对进行要素分析后的点评数据进行句式模版匹配;确定点评数据的句子中指代语对应的先行语,并恢复省略的主语;将出现评价对象词、评价属性词或情感词的句子作为候选情感句,采用最大熵模型对候选情感句的句子极性进行判别,得到句子的情感倾向性。Optionally, the step of performing sentiment analysis on comments in the present invention includes: obtaining comment data, and standardizing it; performing word segmentation processing on sentences of comment data after normalization; performing element analysis on sentences after word segmentation, and identifying All kinds of words that affect the detection and analysis of the text’s emotional tendency; perform sentence template matching on the comment data after element analysis according to the sentence template library; determine the antecedent corresponding to the pronoun in the sentence of the comment data, and restore the omitted subject; Sentences with evaluation object words, evaluation attribute words or emotional words are used as candidate emotional sentences, and the maximum entropy model is used to judge the sentence polarity of candidate emotional sentences to obtain the emotional orientation of the sentence.

可选择地,规范化处理是采用基于规则的方法处理点评文本中的拼写错误,所述规则是“包含错别字的字串或词串”到“相应正确字串或词串”的映射;所述规则通过两种方法获取:一是根据现有知识,即前人总结的常见拼写错误;二是根据每个字或词的上下文的抽取相似字或词,通过人工校验确定正确的字串或词串。Optionally, the standardization process is to adopt a rule-based method to process spelling errors in the review text, and the rule is a mapping from "a word string or a word string that contains a typo" to "the corresponding correct word string or a word string"; Obtained through two methods: one is based on existing knowledge, that is, common spelling mistakes summarized by predecessors; the other is to extract similar words or words based on the context of each word or word, and determine the correct string or word through manual verification string.

可选择地,首先采用基于词典的最大匹配分词方法进行分词,然后针对分词有歧义的部分采用序列标注的分词方法得到正确的分词结果;所述序列标注的分词方法将词的切分问题转换为字的分类问题,每个字根据其在词中的不同位置,赋予不同的位置类别标记,基于这样的标记序列确定句子的切分方式。Optionally, at first adopt the dictionary-based maximum matching word segmentation method to carry out word segmentation, then use the word segmentation method of sequence labeling to obtain correct word segmentation results for the ambiguous part of word segmentation; the word segmentation method of said sequence labeling converts the segmentation problem of words into In the word classification problem, each word is assigned a different position category mark according to its different position in the word, and the segmentation method of the sentence is determined based on such a mark sequence.

可选择地,不同的位置类别标记,包括词首、词中、词尾和单字词,并采用条件随机场模型实现序列标注任务。Optionally, different position categories are labeled, including initials, midwords, endings, and single-word words, and a conditional random field model is used to implement the sequence labeling task.

可选择地,要素包括点评数据中的评价对象词、评价属性词、情感词、程度副词、普通副词、否定词、插入词,以及关于城市、景点的词语,在将句子中的要素识别出来后,标记上相应的类别标签。Optionally, the elements include evaluation object words, evaluation attribute words, emotional words, degree adverbs, common adverbs, negative words, interjections, and words about cities and scenic spots in the review data. After the elements in the sentence are identified , labeled with the corresponding category label.

可选择地,通过基于点评的自举方法提取句式模版,从而建立句式模版库。Optionally, sentence pattern templates are extracted through a comment-based bootstrapping method, thereby establishing a sentence pattern template library.

可选择地,如果当前句中没有评价对象词或评价属性词,则选择上一句最后提及的评价对象或评价属性词引入到当前句;如果当前句中只有评价属性词,则当上一句出现评价对象时将其引入到当前句。Optionally, if there are no evaluation object words or evaluation attribute words in the current sentence, then select the last mentioned evaluation object or evaluation attribute words in the previous sentence to introduce into the current sentence; if there are only evaluation attribute words in the current sentence, then when the previous sentence appears Introduce it into the current sentence when evaluating an object.

可选择地,最大熵模型通过建立条件概率模型预测不同情感类别并估计其概率,情感类别包括-1、0、1三类,分别表示差评、无情感、好评。Optionally, the maximum entropy model predicts different emotional categories and estimates their probabilities by establishing a conditional probability model. The emotional categories include -1, 0, and 1, which represent negative reviews, no emotions, and favorable reviews, respectively.

根据本发明的另一方面,提供一种基于酒店点评的用户标签和酒店标签匹配装置,包括:酒店业情感语句模板库生成模块,酒店业情感语句模板库包括至少100个情感语句模板;最终酒店标签生成模块,其用于生成至少三个酒店的最终酒店标签;用户点评获取模块,其从互联网获取特定用户针对同一酒店或不同酒店的至少两条用户点评;用户标签集合生成模块,其将特定用户的所有用户点评的情感语句逐一与至少100个情感语句模板进行比对,筛选出与至少100个情感语句模板相匹配的情感语句,并将所筛选出的情感语句根据所表达的情感类型识别为不同的维度,再以所识别的所有维度形成特定用户的用户标签集合;用户标签权重计算模块,其分别计算特定用户的用户标签集合中的每个用户标签的权重,其中,在特定用户的全部用户点评中出现的频率越高且在所有用户针对所有酒店的所有用户点评中出现的频率越低则用户标签权重越高;最终用户标签生成模块,其从特定用户的用户标签集合中选择权重大于第一设定阈值的用户标签作为特定用户的最终用户标签;以及酒店推荐模块,其至少将最终酒店标签与特定用户的最终用户标签匹配率位于前三名的酒店推荐给特定用户。According to another aspect of the present invention, a kind of user tag and hotel tag matching device based on hotel reviews is provided, comprising: a hotel industry emotional sentence template library generating module, the hotel industry emotional sentence template library including at least 100 emotional sentence templates; A tag generation module, which is used to generate the final hotel tags of at least three hotels; a user review acquisition module, which obtains at least two user reviews of a specific user for the same hotel or different hotels from the Internet; a user tag set generation module, which uses specific All user commented emotional sentences are compared with at least 100 emotional sentence templates one by one, and the emotional sentences matching at least 100 emotional sentence templates are screened out, and the selected emotional sentences are identified according to the type of emotion expressed are different dimensions, and then form a specific user’s user tag set with all the identified dimensions; the user tag weight calculation module calculates the weight of each user tag in the specific user’s user tag set respectively, wherein, in the specific user’s The higher the frequency in all user reviews and the lower the frequency in all user reviews for all hotels, the higher the weight of the user tag; the end user tag generation module, which selects the weight from the user tag set of a specific user A user label greater than the first set threshold is used as the end user label of the specific user; and a hotel recommendation module, which at least recommends to the specific user the hotels whose matching rate between the final hotel label and the specific user's end user label is in the top three.

可选择地,最终酒店标签生成模块可通过用户点评获取模块从互联网获取分别针对至少三个酒店的用户点评,其中针对每个酒店包括至少三个用户的用户点评;最终酒店标签生成模块还可包括:酒店标签集合生成子模块,其将针对特定酒店的所有用户点评的情感语句逐一与至少100个情感语句模板进行比对,筛选出与至少100个情感语句模板相匹配的情感语句,并将所筛选出的情感语句根据所表达的情感类型识别为不同的维度,再以所识别的所有维度形成特定酒店的酒店标签集合;以及酒店标签权重计算子模块,其分别计算特定酒店的酒店标签集合中的每个酒店标签的权重,其中,在针对同一酒店的所有用户点评中出现的频率越高且在针对所有酒店的所有用户点评中出现的频率越低则酒店标签权重越高;其中,最终酒店标签生成模块从酒店标签集合中选择权重大于第二设定阈值的酒店标签作为特定酒店的最终酒店标签。Optionally, the final hotel label generation module can obtain user reviews for at least three hotels from the Internet through the user review acquisition module, wherein each hotel includes user reviews of at least three users; the final hotel label generation module can also include : The hotel tag collection generation sub-module, which compares the emotional sentences of all user reviews of a specific hotel with at least 100 emotional sentence templates one by one, screens out the emotional sentences matching at least 100 emotional sentence templates, and The screened emotional sentences are identified as different dimensions according to the type of emotion expressed, and then form a hotel label set of a specific hotel with all the identified dimensions; and the hotel label weight calculation sub-module, which calculates the hotel label set of a specific hotel respectively. The weight of each hotel tag in , where the higher the frequency of occurrence in all user reviews for the same hotel and the lower the frequency of occurrence in all user reviews for all hotels, the higher the weight of the hotel tag; where, the final hotel The label generation module selects the hotel label with a weight greater than the second set threshold from the hotel label set as the final hotel label of the specific hotel.

可选择地,酒店业情感语句模板库生成模块可通过用户点评获取模块从互联网获取至少10000条酒店用户点评并根据语句出现的频率高低从中筛选出至少100个情感语句作为情感语句模板。Optionally, the hotel industry emotion statement template library generation module can obtain at least 10,000 hotel user comments from the Internet through the user comment acquisition module and select at least 100 emotion statements as emotion statement templates according to the frequency of the statements.

可选择地,可进一步包括酒店业语义词典生成模块,其根据词汇出现的频率高低从至少10000条酒店用户点评中筛选出至少1000个酒店业常用词汇用以构建酒店业语义词典。Optionally, it may further include a hotel industry semantic dictionary generation module, which screens out at least 1000 commonly used words in the hotel industry from at least 10,000 hotel user reviews according to the frequency of occurrence of words to construct a hotel industry semantic dictionary.

可选择地,第一设定阈值或第二设定阈值可在0~1范围内任意选择。比如,第一设定阈值选为0.5,第二设定阈值选为0.3。Optionally, the first set threshold or the second set threshold can be arbitrarily selected within the range of 0-1. For example, the first set threshold is selected as 0.5, and the second set threshold is selected as 0.3.

作为一种替代方案,对于酒店业语义词典和句式模板库的构建,本发明可采用基于Bootstrapping的方法。As an alternative, the present invention can adopt a method based on Bootstrapping for the construction of the hotel industry semantic dictionary and sentence template library.

自举(Bootstrapping),即自扩展或自举,是一种半监督的机器学习方法,可以用于同时抽取语义词典和模板。这种方法的思想基于这样的观察:抽取模板可以用于抽取新的实例,反过来这些实例又可以用于抽取新的模板。这种方法的优势在于不需要标注的训练语料,仅仅需要少数种子。首先通过人工干预得到初始化的种子词语,利用种子词语获得模板,进而通过模板获得种子词语,如此迭代进行。在每一轮迭代中,都将产生新的标注数据,最优的词会添加到相应到语义词典中,最优的模版也会添加到模版库中,用这些新的标注数据重新学习模型,从而又可以产生新的数据,如此循环往复,直到最终收敛结束,从而获得更多的种子词语和模板。这就是最基本的Bootstrapping算法(或过程)。Bootstrapping, that is, self-expanding or bootstrapping, is a semi-supervised machine learning method that can be used to simultaneously extract semantic dictionaries and templates. The idea of this approach is based on the observation that extracted templates can be used to extract new instances, which in turn can be used to extract new templates. The advantage of this approach is that no labeled training corpus is required and only a few seeds are required. Firstly, the initialized seed words are obtained through manual intervention, the templates are obtained by using the seed words, and then the seed words are obtained through the templates, and so on. In each round of iteration, new annotation data will be generated, the optimal word will be added to the corresponding semantic dictionary, and the optimal template will also be added to the template library, and these new annotation data will be used to re-learn the model. In this way, new data can be generated, and so on, until the final convergence is over, so as to obtain more seed words and templates. This is the most basic Bootstrapping algorithm (or process).

语义词典的语义类包括评价对象词、评价属性词、情感词、程度副词、普通副词、否定词、插入词等,每个语义类都包括若干词语,模版就是由语义类名称或具体词语组成都序列。The semantic categories of the semantic dictionary include evaluation object words, evaluation attribute words, emotional words, degree adverbs, common adverbs, negative words, insertion words, etc. Each semantic category includes several words, and the template is composed of semantic category names or specific words. sequence.

下面是具体的实施步骤:The following are the specific implementation steps:

步骤1:数据准备。通过聚焦爬虫从携程等主流点评网站获取在线点评数据。Step 1: Data preparation. Obtain online review data from mainstream review sites such as Ctrip through focused crawlers.

步骤2:种子词典构建。人工查看少量(如500条)点评,整理各个语义类的词,语义词典记为SemLex。Step 2: Seed dictionary construction. Manually check a small number of comments (such as 500), sort out the words of each semantic category, and record the semantic dictionary as SemLex.

步骤3:点评分词。中文分词是中文自然语言处理的基础步骤,本发明分词采用词典分词和统计分词融合的方法。首先采用基于词典的最大匹配分词方法,针对分词有歧义的部分再采用序列标注的分词方法。Step 3: Click on the scoring words. Chinese word segmentation is the basic step of Chinese natural language processing, and the word segmentation of the present invention adopts the method of fusion of dictionary word segmentation and statistical word segmentation. Firstly, the maximum matching word segmentation method based on the dictionary is adopted, and then the word segmentation method of sequence labeling is used for the ambiguous part of the word segmentation.

基于词典的最大匹配分词方法,给定词典,对于待分词的汉字序列,依次寻找匹配的最长词典词,无匹配者则作为单字词处理,直至该汉字序列处理完毕。按照对汉字序列扫描方向的不同,该方法又可以分为:正向最大匹配(从左向右匹配)和逆向最大匹配(从右向左匹配)。例如,对于序列“当原子结合成分子时”,正向最大匹配结果为“当|原子|结合|成|分子|时”,而逆向最大匹配结果为“当|原子|结合|成分|子时”。显然,正向最大匹配和逆向最大匹配都不能很好地处理切分歧义问题。正向最大匹配和逆向最大匹配也可以结合形成双向最大匹配,双向匹配时正向和逆向匹配不一致的地方,往往是潜在歧义的地方。有歧义往往需要根据具体上下文确认分词结果。有监督的序列标注方法能够充分的挖掘上下文的丰富特征,因此有歧义的情况下本发明引入序列标注方法消除歧义。该方法将词的切分问题转换为字的分类问题,每个字根据其在词中的不同位置,赋予不同的位置类别标记,比如词首、词中、词尾和单字词。基于这样的标记序列,很容易确定句子的切分方式。其中,B(Begin)、M(Middle)、E(End)、S(Single)分别表示词首、词中、词尾、单字词。有了字的标记序列,符合正则表达式“S”或“B(M)*E”的字序列表示一个词,从而很容易地完成句子切分。为了实现序列标注任务,分发明采用条件随机场模型(ConditionalRandomFields,CRF),该模型在自然语言处理中得到广泛应用,并取得了很大成功。具体特征包括:前一个字、当前字、后一个字、前一个字与当前字、当前字与后一个字。条件随机场模型利用提取的这些特征,预测出的每个字的类别标记。Based on the dictionary-based maximum matching word segmentation method, given a dictionary, for the sequence of Chinese characters to be segmented, search for the longest matching dictionary words in turn, and those without a match will be treated as single-character words until the Chinese character sequence is processed. According to the different scanning directions of the Chinese character sequence, the method can be divided into: forward maximum matching (matching from left to right) and reverse maximum matching (matching from right to left). For example, for the sequence "when atoms combine into molecules", the forward maximum matching result is "when | ". Obviously, neither the forward maximum matching nor the reverse maximum matching can handle the severance ambiguity well. Forward maximum matching and reverse maximum matching can also be combined to form two-way maximum matching. In two-way matching, where the forward and reverse matching are inconsistent is often a potential ambiguity. If there is ambiguity, it is often necessary to confirm the word segmentation result according to the specific context. The supervised sequence labeling method can fully mine the rich features of the context, so the present invention introduces the sequence labeling method to eliminate the ambiguity when there is ambiguity. This method converts the word segmentation problem into a character classification problem, and each character is assigned a different position category mark according to its different position in the word, such as the beginning of the word, the middle of the word, the end of the word, and a single word. Based on such a sequence of tokens, it is easy to determine how a sentence should be segmented. Among them, B (Begin), M (Middle), E (End), and S (Single) represent the beginning of a word, the middle of a word, the end of a word, and a single word, respectively. With the tag sequence of words, the word sequence conforming to the regular expression "S" or "B(M)*E" represents a word, so that sentence segmentation can be easily completed. In order to realize the task of sequence labeling, the invention adopts the conditional random field model (Conditional Random Fields, CRF), which is widely used in natural language processing and has achieved great success. The specific features include: the previous character, the current character, the next character, the previous character and the current character, the current character and the next character. The conditional random field model uses these extracted features to predict the category label of each word.

最大匹配方法的词典以及有监督的条件随机场模型的训练学习语料都来自本发明人工标注的10万条酒店点评。The dictionary of the maximum matching method and the training and learning corpus of the supervised conditional random field model all come from the 100,000 hotel reviews manually marked by the present invention.

步骤4:语义类标签替换。对分词后的点评逐词判定其语义类并用语义类标签替换,如“餐厅|的|价格|很|高”,替换为“Obj|的|Attr|Dgr|Sent”,对于点评起始和结束位置分别添加“Start”和“End”标签,点评中除了“。”、“!”、“?”之外的标点符号也采用“Punc”标签替换。Step 4: Semantic class label replacement. Determine the semantic class of the comment after word segmentation word by word and replace it with the semantic class label, such as "Restaurant|的|price|very|high", replace it with "Obj|的|Attr|Dgr|Sent", for the start and end of the comment "Start" and "End" tags are added to the positions, and punctuation marks other than ".", "!", and "?" in the comments are also replaced with "Punc" tags.

步骤5:模版生成。根据“。”、“!”、“?”3个标点符号断句,限定模版最小长度3个词,最大长度7个词,扫描标签替换后的点评,生成模版。Step 5: Template generation. According to the three punctuation marks ".", "!", "?", the minimum length of the template is 3 words, and the maximum length is 7 words. Scan the comments after tag replacement to generate a template.

步骤6:模版打分。本发明从两方面打分,一方面通过频次衡量模版的重要性和推广性,另一方面通过在语义词典中的命中率衡量模版的准确性。Step 6: Score templates. The present invention scores from two aspects. On the one hand, the importance and popularization of the template is measured by the frequency, and on the other hand, the accuracy of the template is measured by the hit rate in the semantic dictionary.

pati重要性和推广性打分S(pati)的计算公式如下:The calculation formula of pat i importance and generalization score S(pat i ) is as follows:

其中,|pati|是模版pati的长度,以词数计算,f(pati)表示模版pati的频次,C(pati)表示嵌套pati的模版集合。Among them, |pat i | is the length of the template pat i , calculated by the number of words, f(pat i ) indicates the frequency of the template pat i , and C(pat i ) indicates the template set of the nested pat i .

pati准确性打分P(pati)的计算公式如下:The calculation formula of pat i accuracy score P(pat i ) is as follows:

PP (( patpat ii )) == ΣΣ tt ∈∈ SS ee mm LL ee xx ,, tt ∈∈ TT (( patpat ii )) ff (( tt )) ΣΣ tt ∈∈ TT (( patpat ii )) ff (( tt ))

其中,T(pati)表示模版pati抽取的语义词集合,f(t)表示语义词的频次。Among them, T(pat i ) represents the set of semantic words extracted from template pat i , and f(t) represents the frequency of semantic words.

采用Sigmoid函数将S(pati)归一化到(0,1),进而融合两方面的打分得到F(pati),计算公式如下:Using the Sigmoid function Normalize S(pat i ) to (0,1), and then integrate the two scores to get F(pat i ), the calculation formula is as follows:

Ff (( patpat ii )) == αα ** loglog 22 11 11 ++ ee -- SS (( patpat ii )) ++ (( 11 -- αα )) ** loglog 22 PP (( patpat ii ))

α=0.4,本发明更注重模版的准确性。α=0.4, the present invention pays more attention to the accuracy of the template.

步骤7:模版挑选。根据F(pati)选取得分最高的前5%。Step 7: Template selection. Select the top 5% with the highest score according to F(pat i ).

步骤8:语义词抽取。将挑选出来的模版应用到语义类标签替换后到点评中。当某个点评片段与挑选模版只有一个词有差异时,将该词作为相应语义类的实例词。Step 8: Semantic word extraction. Apply the selected template to the comment after replacing the semantic class label. When there is only one word difference between a review segment and the selected template, this word is used as an instance word of the corresponding semantic class.

步骤9:语义词打分。Step 9: Score semantic words.

PP (( tt jj )) == ΣΣ kk ,, tt jj ∈∈ TT (( patpat kk )) PP (( patpat kk ))

步骤10:语义词典扩充。选取得分最高的前5%。Step 10: Semantic Dictionary Expansion. Pick the top 5% with the highest scores.

步骤4到步骤10迭代进行。迭代终止条件。挑选出来的语义词明显不正确时终止。Steps 4 to 10 are performed iteratively. The iteration termination condition. Terminate when the selected semantic words are obviously incorrect.

步骤11:极性确定。对于情感词的极性,以及情感词与评价对象词、评价属性词的搭配极性,由人工完成。人工确定过程中,将其所属模版对应的点评片段作为判定的依据。Step 11: Polarity determination. The polarity of emotional words, as well as the collocation polarity of emotional words, evaluation object words, and evaluation attribute words are done manually. In the manual determination process, the comment segment corresponding to the template it belongs to is used as the basis for determination.

结果表明,本发明在准确率和召回率上都取得了不错的性能。产生高质量的语义词典和句式模板库。The results show that the present invention has achieved good performance in both accuracy and recall. Generate high-quality semantic dictionaries and sentence template libraries.

作为另一种替代方案,本发明的情感语句模板构建及语句比对分析方法如下。As another alternative, the emotional sentence template construction and sentence comparison and analysis method of the present invention are as follows.

本发明首先通过聚焦爬虫从各大主流点评网站获取在线点评数据。然后针对大规模点评,通过半自动方式整理语义词典以及句式库。最后,针对点评中的每个句子,进行分词等处理和分析,在此基础上提取关键词或关键句式作为特征,通过最大熵分类器实现情感分类。包括如下步骤:The present invention first obtains online review data from major mainstream review websites by focusing crawlers. Then, for large-scale reviews, the semantic dictionary and sentence library are sorted out in a semi-automatic manner. Finally, for each sentence in the review, word segmentation and other processing and analysis are performed, and on this basis, keywords or key sentence patterns are extracted as features, and sentiment classification is realized through the maximum entropy classifier. Including the following steps:

步骤1:文本规范化。Step 1: Text normalization.

互联网点评文本常会出现拼写错误,对于这些问题,我们采用基于规则的方法处理。这些规则是“包含错别字的字串或词串”到“相应正确字串或词串”的映射。这种规则通过两种方法获取:一是根据现有知识,即前人总结的常见拼写错误;二是根据每个字或词的上下文的抽取相似字或词,人工校验确定。这种方法简单,有效。系统这个模块的性能依赖于拼写错误纠正规则的数量,在系统运维的过程中可以不断总结,丰富规则库。Internet review texts are often subject to spelling errors, and we take a rule-based approach to these issues. These rules are the mapping of "strings or word strings containing typos" to "corresponding correct word strings or word strings". This rule is obtained through two methods: one is based on existing knowledge, that is, common spelling mistakes summarized by predecessors; the other is to extract similar words or words based on the context of each word or word, and manually check and confirm. This method is simple and effective. The performance of this module of the system depends on the number of spelling error correction rules, which can be continuously summarized and enriched during the operation and maintenance of the system.

中文还存在标点符号全半角问题,根据符号全半角映射关系,将标点符号统一标示为半角符号。There is also the problem of full-width punctuation marks in Chinese. According to the mapping relationship between full-width and half-width symbols, punctuation marks are uniformly marked as half-width symbols.

步骤2:点评分词。Step 2: Click on the scoring words.

点评分词。中文分词是中文自然语言处理的基础步骤,本发明分词采用词典分词和统计分词融合的方法。首先采用基于词典的最大匹配分词方法,针对分词有歧义的部分再采用序列标注的分词方法。Point rating words. Chinese word segmentation is the basic step of Chinese natural language processing, and the word segmentation of the present invention adopts the method of fusion of dictionary word segmentation and statistical word segmentation. Firstly, the maximum matching word segmentation method based on the dictionary is adopted, and then the word segmentation method of sequence labeling is used for the ambiguous part of the word segmentation.

基于词典的最大匹配分词方法,给定词典,对于待分词的汉字序列,依次寻找匹配的最长词典词,无匹配者则作为单字词处理,直至该汉字序列处理完毕。按照对汉字序列扫描方向的不同,该方法又可以分为:正向最大匹配(从左向右匹配)和逆向最大匹配(从右向左匹配)。例如,对于序列“当原子结合成分子时”,正向最大匹配结果为“当|原子|结合|成|分子|时”,而逆向最大匹配结果为“当|原子|结合|成分|子时”。显然,正向最大匹配和逆向最大匹配都不能很好地处理切分歧义问题。正向最大匹配和逆向最大匹配也可以结合形成双向最大匹配,双向匹配时正向和逆向匹配不一致的地方,往往是潜在歧义的地方。有歧义往往需要根据具体上下文确认分词结果。有监督的序列标注方法能够充分的挖掘上下文的丰富特征,因此有歧义的情况下本发明引入序列标注方法消除歧义。该方法将词的切分问题转换为字的分类问题,每个字根据其在词中的不同位置,赋予不同的位置类别标记,比如词首、词中、词尾和单字词。基于这样的标记序列,很容易确定句子的切分方式。其中,B(Begin)、M(Middle)、E(End)、S(Single)分别表示词首、词中、词尾、单字词。有了字的标记序列,符合正则表达式“S”或“B(M)*E”的字序列表示一个词,从而很容易地完成句子切分。为了实现序列标注任务,分发明采用条件随机场模型(ConditionalRandomFields,CRF),该模型在自然语言处理中得到广泛应用,并取得了很大成功。具体特征包括:前一个字、当前字、后一个字、前一个字与当前字、当前字与后一个字。条件随机场模型利用提取的这些特征,预测出的每个字的类别标记。Based on the dictionary-based maximum matching word segmentation method, given a dictionary, for the sequence of Chinese characters to be segmented, search for the longest matching dictionary words in turn, and those without a match will be treated as single-character words until the Chinese character sequence is processed. According to the different scanning directions of the Chinese character sequence, the method can be divided into: forward maximum matching (matching from left to right) and reverse maximum matching (matching from right to left). For example, for the sequence "when atoms combine into molecules", the forward maximum matching result is "when | ". Obviously, neither the forward maximum matching nor the reverse maximum matching can handle the severance ambiguity well. Forward maximum matching and reverse maximum matching can also be combined to form two-way maximum matching. In two-way matching, where the forward and reverse matching are inconsistent is often a potential ambiguity. If there is ambiguity, it is often necessary to confirm the word segmentation result according to the specific context. The supervised sequence labeling method can fully mine the rich features of the context, so the present invention introduces the sequence labeling method to eliminate the ambiguity when there is ambiguity. This method converts the word segmentation problem into a character classification problem, and each character is assigned a different position category mark according to its different position in the word, such as the beginning of the word, the middle of the word, the end of the word, and a single word. Based on such a sequence of tokens, it is easy to determine how a sentence should be segmented. Among them, B (Begin), M (Middle), E (End), and S (Single) represent the beginning of a word, the middle of a word, the end of a word, and a single word, respectively. With the tag sequence of words, the word sequence conforming to the regular expression "S" or "B(M)*E" represents a word, so that sentence segmentation can be easily completed. In order to realize the task of sequence labeling, the invention adopts the conditional random field model (Conditional Random Fields, CRF), which is widely used in natural language processing and has achieved great success. The specific features include: the previous character, the current character, the next character, the previous character and the current character, the current character and the next character. The conditional random field model uses these extracted features to predict the category label of each word.

最大匹配方法的词典以及有监督的条件随机场模型的训练学习语料都来自本发明人工标注的10万条酒店点评。The dictionary of the maximum matching method and the training and learning corpus of the supervised conditional random field model all come from the 100,000 hotel reviews manually marked by the present invention.

步骤3:要素分析。Step 3: Element analysis.

要素,指的是影响文本情感分析的重要因素,既包括上述的情感信息要素,如点评中的评价对象词、评价属性词、情感词、程度副词、普通副词、否定词、插入词等,又包括城市、景点等多个类别的词语。要素分析是将句子中的要素识别出来,并标记上其相应的类别标签。Elements refer to the important factors that affect text sentiment analysis, including the above-mentioned emotional information elements, such as evaluation object words, evaluation attribute words, emotional words, degree adverbs, common adverbs, negative words, insertion words, etc. Including words of multiple categories such as cities and scenic spots. Element analysis is to identify the elements in the sentence and mark their corresponding category labels.

步骤4:句式匹配。Step 4: Sentence matching.

对句子经过要素分析后得到句子语义类别化形式,即句式,句式反映的是其中的词或要素共同的上下文,所以具有一定的消歧能力。句式匹配过程中,已有的句式库起着关键作用,它反映了领域中表达情感的常见句式。句式库是本发明的核心资源,反映了点评中情感表达的常见句式。本发明通过基于点评的自举(Bootstrapping)方法提取抽句式。After analyzing the elements of the sentence, the semantic classification form of the sentence is obtained, that is, the sentence pattern. The sentence pattern reflects the common context of the words or elements in it, so it has a certain disambiguation ability. In the sentence pattern matching process, the existing sentence pattern library plays a key role, which reflects the common sentence patterns expressing emotion in the domain. The sentence pattern database is the core resource of the present invention, which reflects the common sentence patterns of emotional expression in comments. The present invention extracts abstract sentences through a comment-based bootstrapping (Bootstrapping) method.

步骤5:指代消解。Step 5: Referential resolution.

指代和省略是常见的语言现象。指代常表示共指,即两种表述均指称相同对象。指代有多种类型,我们主要针对人称代词、指示代词作为指代语的情况。省略可以视为零指代语的情况,所以我们将指代和省略都看成广义的“指代”,指代消解指的是发现指代语对应的先行语,或恢复省略的主语。如果当前句中没有评价对象词或评价属性词,选择上一句最后提及的评价对象或评价属性词引入到当前句。如果当前句中只有评价属性词,当上一句出现评价对象时引入到当前句。Reference and omission are common linguistic phenomena. Coreference often means coreference, that is, both expressions refer to the same object. There are many types of reference, and we mainly focus on the situation where personal pronouns and demonstrative pronouns are used as reference pronouns. Ellipsis can be regarded as the case of zero referential pronouns, so we regard both referential and ellipsis as "sufficiency" in a broad sense, and referential resolution refers to finding the antecedent corresponding to the referent, or recovering the omitted subject. If there is no evaluation object word or evaluation attribute word in the current sentence, select the evaluation object or evaluation attribute word mentioned last in the previous sentence to introduce into the current sentence. If there are only evaluation attribute words in the current sentence, when the evaluation object appears in the previous sentence, it will be introduced into the current sentence.

步骤6:情感分析。Step 6: Sentiment Analysis.

将出现评价对象词、评价属性词或情感词的句子作为候选情感句。针对候选情感句,采用最大熵(MaximumEntropy)模型,融合丰富的上下文特征,对句子极性进行判别,得到句子的情感倾向性。在分类任务中,判别式模型往往要优于产生式模型。产生式模型估计的是联合概率分布,在机器学习中用于对数据直接建模,或者借助贝叶斯规则作为得到条件概率的中间步骤。而判别式模型直接对条件概率建模,使得模型的训练和预测保持一致,从而更好地在类别之间进行区分。在判别式模型中,最大熵模型在自然处理领域得到广泛应用。对于给定上下文信息x∈X预测类别y∈Y这样的分类问题,最大熵模型建立条件概率模型P(y|x)预测不同类别y∈Y并估计其概率。类别包括-1(差评)、0(无情感)、1(好评)三类。特征包括评价对象词、评价属性词、情感词,以及它们的搭配,还有否定词、句式等特征。Sentences with evaluation object words, evaluation attribute words or emotional words appear as candidate emotional sentences. For candidate emotional sentences, the Maximum Entropy (Maximum Entropy) model is used to integrate rich context features to judge the polarity of the sentence and obtain the emotional orientation of the sentence. In classification tasks, discriminative models tend to outperform generative models. The production model estimates the joint probability distribution, which is used in machine learning to directly model the data, or use Bayesian rule as an intermediate step to obtain the conditional probability. Discriminative models, on the other hand, directly model conditional probabilities, making the training and prediction of the model consistent and thus better distinguishing between classes. Among the discriminative models, the maximum entropy model is widely used in the field of natural processing. For the classification problem of given context information x ∈ X to predict category y ∈ Y, the maximum entropy model establishes a conditional probability model P(y|x) to predict different categories y ∈ Y and estimate its probability. The categories include -1 (bad review), 0 (no emotion), and 1 (good review). Features include evaluation object words, evaluation attribute words, emotional words, and their collocations, as well as features such as negative words and sentence patterns.

本发明的有益效果是:本发明的方案可以有效利用酒店点评数据形成用户画像,并根据用户画像将最符合用户需求的酒店推荐给特定用户,这能够显著地节省用户在互联网上搜索酒店的时间和精力,还能够帮助酒店发现/克服自身的不足并进一步提高/优化自身的特色。The beneficial effects of the present invention are: the solution of the present invention can effectively use hotel review data to form user portraits, and recommend hotels that best meet user needs to specific users according to the user portraits, which can significantly save users the time spent searching for hotels on the Internet and energy, and can also help the hotel discover/overcome its own deficiencies and further improve/optimize its own characteristics.

附图说明Description of drawings

图1示出了本发明基于酒店点评的用户标签和酒店标签匹配方法的流程示意图。FIG. 1 shows a schematic flow chart of the method for matching user tags and hotel tags based on hotel reviews in the present invention.

具体实施方式Detailed ways

下面通过参考附图和实施例对本发明作进一步详细阐述,但这些阐述并不对本发明做任何形式的限定。除非另有说明,否则本文所用的所有科学和技术术语具有本发明所属和相关技术领域的一般技术人员通常理解的含义。The present invention will be described in further detail below with reference to the drawings and examples, but these descriptions do not limit the present invention in any form. Unless defined otherwise, all scientific and technical terms used herein have the meaning commonly understood by one of ordinary skill in the art to which this invention belongs and related technologies.

请参照图1,根据本发明的一种非限制性实施方式,提供一种基于酒店点评的用户标签和酒店标签匹配方法,具体包括以下步骤。Referring to FIG. 1 , according to a non-limiting embodiment of the present invention, a method for matching user tags and hotel tags based on hotel reviews is provided, which specifically includes the following steps.

在步骤S1中,从互联网获取约50000条酒店用户点评,并根据词汇出现的频率高低从中筛选出约5000个酒店业常用词汇用以构建酒店业语义词典。In step S1, about 50,000 hotel user reviews are obtained from the Internet, and about 5,000 commonly used words in the hotel industry are selected from them according to the frequency of occurrence of words to construct a semantic dictionary for the hotel industry.

在步骤S2中,准备酒店业情感语句模板库,包括从互联网获取的约50000条酒店用户点评中根据语句出现的频率高低筛选出约500个情感语句作为情感语句模板。In step S2, prepare the emotional sentence template database for the hotel industry, including selecting about 500 emotional sentences from about 50,000 hotel user reviews obtained from the Internet according to the frequency of the sentences as emotional sentence templates.

在步骤S3中,准备约200个酒店的最终酒店标签,具体包括:从以上获得的约50000条酒店中筛选出分别针对约200个酒店的用户点评,其中针对每个酒店包括约100个用户的用户点评;将针对特定酒店的所有用户点评的情感语句逐一与约500个情感语句模板进行比对,筛选出与约500个情感语句模板相匹配的情感语句,并将所筛选出的情感语句根据所表达的情感类型识别为不同的维度,再以所识别的所有维度形成特定酒店的酒店标签集合,比如,一号酒店的酒店标签集合包括:维度1(卫生级别为A级)、维度11(交通便利度为A级)、维度51(周边环境指数为A级)、维度101(房间空间大小为A级)等;二号酒店的酒店标签集合包括:维度2(卫生级别为B级)、维度12(交通便利度为B级)、维度52(周边环境指数为B级)、维度102(房间空间大小为B级)等;三号酒店的酒店标签集合包括:维度3(卫生级别为C级)、维度13(交通便利度为C级)、维度53(周边环境指数为C级)、维度103(房间空间大小为C级)等;分别计算特定酒店的酒店标签集合中的每个酒店标签的权重,其中,在针对同一酒店的所有用户点评中出现的频率越高且在针对所有酒店的所有用户点评中出现的频率越低则酒店标签权重越高;从酒店标签集合中选择权重大于第二设定阈值的酒店标签作为特定酒店的最终酒店标签,其中,第二设定阈值选为0.4。重复本步骤直至获得所有酒店的最终酒店标签。其中,将针对特定酒店的所有用户点评的情感语句逐一与约500个情感语句模板进行比对过程具体可包括:将特定情感语句切分成与酒店业语义词典中相应的若干个酒店业常用词汇;根据特定情感语句中每个词汇的不同属性分别与500个情感语句模板进行比对,从而确定与500个情感语句模板中的任一个情感语句模板是否相匹配;以及重复该过程直至筛选出与500个情感语句模板相匹配的所有情感语句。In step S3, the final hotel labels of about 200 hotels are prepared, specifically including: selecting user reviews for about 200 hotels from the about 50,000 hotels obtained above, wherein each hotel includes about 100 user comments User comments: compare the emotional sentences of all user comments for a specific hotel with about 500 emotional sentence templates one by one, filter out the emotional sentences that match about 500 emotional sentence templates, and use the selected emotional sentences according to The type of emotion expressed is identified as different dimensions, and then all the identified dimensions are used to form a hotel label set for a specific hotel. For example, the hotel label set of No. 1 Hotel includes: Dimension 1 (the hygiene level is A-level), dimension 11 ( Transportation convenience is grade A), dimension 51 (surrounding environment index is grade A), dimension 101 (room space size is grade A), etc.; the hotel label set of No. 2 Hotel includes: dimension 2 (level of sanitation is grade B), Dimension 12 (traffic convenience is grade B), dimension 52 (surrounding environment index is grade B), dimension 102 (room space size is grade B), etc.; the hotel label set of No. 3 Hotel includes: Dimension 3 (hygienic grade is C level), dimension 13 (level C for transportation convenience), dimension 53 (level C for surrounding environment index), dimension 103 (level C for room space size), etc.; calculate each hotel in the hotel label set of a specific hotel separately The weight of the label, wherein the higher the frequency of appearing in all user reviews of the same hotel and the lower the frequency of appearing in all user reviews of all hotels, the higher the weight of the hotel label; the weight of selecting from the hotel label set is greater than The hotel label of the second set threshold is used as the final hotel label of the specific hotel, wherein the second set threshold is selected as 0.4. Repeat this step until you have the final hotel labels for all hotels. Among them, the process of comparing the emotional sentences of all user reviews for a specific hotel with about 500 emotional sentence templates may specifically include: segmenting the specific emotional sentence into a number of common words in the hotel industry corresponding to the semantic dictionary of the hotel industry; Compare with 500 emotional statement templates respectively according to the different attribute of each vocabulary in specific emotional statement, thus determine whether match with any one emotional statement template in 500 emotional statement templates; And repeat this process until screening out and 500 All sentiment sentences matching a sentiment sentence template.

在步骤S4中,从互联网获得特定用户针对三个酒店的三次用户点评。In step S4, three user comments on three hotels by a specific user are obtained from the Internet.

在步骤S5中,将特定用户的所有用户点评的情感语句逐一与约500个情感语句模板进行比对,筛选出与约500个情感语句模板相匹配的情感语句,并将所筛选出的情感语句根据所表达的情感类型识别为不同的维度,再以所识别的所有维度形成特定用户的用户标签集合,比如,特定客户的用户标签集合包括:维度1(卫生级别为A级)、维度12(交通便利度为B级)、维度51(周边环境指数为A级)、维度103(房间空间大小为C级)等。其中,将特定用户的所有用户点评的情感语句逐一与约500个情感语句模板进行比对过程具体包括:将特定情感语句切分成与酒店业语义词典中相应的若干个酒店业常用词汇;根据特定情感语句中每个词汇的不同属性分别与500个情感语句模板进行比对,从而确定与500个情感语句模板中的任一个情感语句模板是否相匹配;以及重复该过程直至筛选出与500个情感语句模板相匹配的所有情感语句。In step S5, compare the emotional sentences of all user reviews of the specific user with about 500 emotional sentence templates one by one, filter out the emotional sentences matching the about 500 emotional sentence templates, and use the filtered emotional sentences Identify different dimensions according to the type of emotion expressed, and then use all the identified dimensions to form a user label set for a specific user. For example, the user label set for a specific customer includes: Dimension 1 (the hygiene level is A), dimension 12 ( Transportation convenience is grade B), dimension 51 (surrounding environment index is grade A), dimension 103 (room space size is grade C), etc. Among them, the process of comparing the emotional sentences of all user reviews of a specific user with about 500 emotional sentence templates includes: dividing the specific emotional sentence into a number of common words in the hotel industry corresponding to the semantic dictionary of the hotel industry; The different attributes of each vocabulary in the emotional sentence are compared with 500 emotional sentence templates respectively, thereby determine whether match with any emotional sentence template in the 500 emotional sentence templates; And repeat this process until screening out and 500 emotional sentence templates All sentiment statements that match the statement template.

在步骤S6中,分别计算特定用户的用户标签集合中的每个用户标签的权重,其中,在特定用户的全部用户点评中出现的频率越高且在所有用户针对所有酒店的所有用户点评中出现的频率越低则用户标签权重越高。In step S6, the weight of each user tag in the user tag set of a specific user is calculated separately, wherein the higher the frequency of occurrence in all user reviews of a specific user and the higher the frequency of appearance in all user reviews of all users for all hotels The lower the frequency of , the higher the user tag weight.

在步骤S7中,从特定用户的用户标签集合中选择权重大于第一设定阈值的用户标签作为特定用户的最终用户标签,其中,第一设定阈值选为0.6。In step S7, a user tag whose weight is greater than a first set threshold is selected from the user tag set of the specific user as the final user tag of the specific user, wherein the first set threshold is selected as 0.6.

在步骤S8中,将最终酒店标签与特定用户的最终用户标签匹配率最高的酒店推荐给特定用户,比如,在该非限制性实施方式中,将一号酒店推荐给该特定用户。In step S8, the hotel with the highest matching rate between the final hotel tag and the specific user's end user tag is recommended to the specific user, for example, in this non-limiting embodiment, Hotel No. 1 is recommended to the specific user.

根据本发明的另一种非限制性实施方式,提供一种基于酒店点评的用户标签和酒店标签匹配装置,包括:酒店业情感语句模板库生成模块,酒店业情感语句模板库包括1000个情感语句模板;最终酒店标签生成模块,其用于生成500个酒店的最终酒店标签;用户点评获取模块,其从互联网获取特定用户针对不同酒店的五次用户点评;用户标签集合生成模块,其将特定用户的所有用户点评的情感语句逐一与1000个情感语句模板进行比对,筛选出与1000个情感语句模板相匹配的情感语句,并将所筛选出的情感语句根据所表达的情感类型识别为不同的维度,再以所识别的所有维度形成特定用户的用户标签集合;用户标签权重计算模块,其分别计算特定用户的用户标签集合中的每个用户标签的权重,其中,在特定用户的全部用户点评中出现的频率越高且在所有用户针对所有酒店的所有用户点评中出现的频率越低则用户标签权重越高;最终用户标签生成模块,其从特定用户的用户标签集合中选择权重大于第一设定阈值的用户标签作为特定用户的最终用户标签;以及酒店推荐模块,其将最终酒店标签与特定用户的最终用户标签匹配率位于前十名的酒店推荐给特定用户。According to another non-limiting embodiment of the present invention, a kind of user tag and hotel tag matching device based on hotel reviews is provided, including: a hotel industry emotional sentence template library generation module, and the hotel industry emotional sentence template library includes 1000 emotional sentences Template; the final hotel label generation module, which is used to generate the final hotel labels of 500 hotels; the user comment acquisition module, which obtains five user comments from the Internet for different hotels by a specific user; the user label collection generation module, which uses a specific user The emotional sentences commented by all users are compared with 1000 emotional sentence templates one by one, and the emotional sentences matching the 1000 emotional sentence templates are screened out, and the screened emotional sentences are identified as different according to the type of emotion expressed. Dimensions, and then form a user tag set of a specific user with all the identified dimensions; a user tag weight calculation module, which calculates the weight of each user tag in the user tag set of a specific user, wherein, in all user reviews of a specific user The higher the frequency of occurrence in and the lower the frequency of occurrence in all user reviews of all users for all hotels, the higher the user tag weight; the end user tag generation module, which selects a weight greater than the first from the user tag set of a specific user The user tag with a threshold value is used as the end user tag of the specific user; and a hotel recommendation module, which recommends the top ten hotels whose matching rate between the final hotel tag and the specific user's end user tag is in the top ten to the specific user.

最终酒店标签生成模块通过用户点评获取模块从互联网获取分别针对500个酒店的用户点评,其中针对每个酒店包括200个用户的用户点评;最终酒店标签生成模块还包括:酒店标签集合生成子模块,其将针对特定酒店的所有用户点评的情感语句逐一与1000个情感语句模板进行比对,筛选出与1000个情感语句模板相匹配的情感语句,并将所筛选出的情感语句根据所表达的情感类型识别为不同的维度,再以所识别的所有维度形成特定酒店的酒店标签集合;以及酒店标签权重计算子模块,其分别计算特定酒店的酒店标签集合中的每个酒店标签的权重,其中,在针对同一酒店的所有用户点评中出现的频率越高且在针对所有酒店的所有用户点评中出现的频率越低则酒店标签权重越高;其中,最终酒店标签生成模块从酒店标签集合中选择权重大于第二设定阈值的酒店标签作为特定酒店的最终酒店标签。The final hotel tag generation module obtains user reviews for 500 hotels from the Internet through the user review acquisition module, wherein each hotel includes user reviews of 200 users; the final hotel tag generation module also includes: a hotel tag set generation sub-module, It compares the emotional sentences of all user reviews of a specific hotel with 1000 emotional sentence templates one by one, screens out the emotional sentences that match the 1000 emotional sentence templates, and uses the selected emotional sentences according to the expressed emotions. The type is identified as different dimensions, and then a hotel tag set of a specific hotel is formed with all the identified dimensions; and a hotel tag weight calculation sub-module, which calculates the weight of each hotel tag in the hotel tag set of a specific hotel, wherein, The higher the frequency in all user reviews for the same hotel and the lower the frequency in all user reviews for all hotels, the higher the weight of the hotel tag; wherein, the final hotel tag generation module selects the weight from the hotel tag set Hotel tags greater than the second set threshold are used as final hotel tags for the specific hotel.

酒店业情感语句模板库生成模块通过用户点评获取模块从互联网获取100000条酒店用户点评并根据语句出现的频率高低从中筛选出1000个情感语句作为情感语句模板。The hotel industry emotional statement template library generation module obtains 100,000 hotel user comments from the Internet through the user comment acquisition module and selects 1000 emotional statements as emotional statement templates according to the frequency of the statement.

本发明的装置进一步包括酒店业语义词典生成模块,其根据词汇出现的频率高低从100000条酒店用户点评中筛选出10000个酒店业常用词汇用以构建酒店业语义词典。The device of the present invention further includes a hotel industry semantic dictionary generating module, which screens out 10,000 commonly used words in the hotel industry from 100,000 hotel user reviews according to the frequency of occurrence of the words to construct a hotel industry semantic dictionary.

下面结合具体实施例对本发明作出进一步详细阐述,但实施例不应理解为对本发明保护范围的限制。The present invention will be further described in detail below in conjunction with specific examples, but the examples should not be construed as limiting the protection scope of the present invention.

一种基于酒店点评的用户标签和酒店标签匹配方法,其包括如下步骤:A method for matching user tags and hotel tags based on hotel reviews, comprising the steps of:

步骤1:通过聚焦爬虫从携程等主流点评网站获取在线点评数据;Step 1: Obtain online review data from mainstream review sites such as Ctrip through the focused crawler;

步骤2:过滤垃圾点评,垃圾点评包括无意义语句;Step 2: Filter spam comments, spam comments include meaningless sentences;

步骤3:构建酒店业语义词典和句式模板库;Step 3: Build a hotel industry semantic dictionary and sentence template library;

步骤4:对点评进行情感分析。Step 4: Perform sentiment analysis on reviews.

步骤5:标签分析。Step 5: Tag Analysis.

针对点评中每个表达情感的句子,挖掘其表达的观点,通过标签来表达。For each sentence that expresses emotion in the comments, dig out the views expressed, and express them through tags.

步骤6:按照标签聚合点评片段,根据TF-IDF算法计算不同用户不同标签的权重。TF-IDF(TermFrequency-InverseDocumentFrequency)是一种统计方法,用来评估词语对文件的重要程度,在信息检索和文本特征选择及计算等领域被广泛应用。TF-IDF的主要思想是:如果某个词语在一篇文档中出现很频繁,并且在其他文档中很少出现,则认为该词语具有很好的类别区分能力,适合用来表征该文档。Step 6: Aggregate comment fragments according to tags, and calculate the weights of different tags of different users according to the TF-IDF algorithm. TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical method used to evaluate the importance of words to documents, and is widely used in information retrieval, text feature selection and calculation. The main idea of TF-IDF is: if a word appears frequently in a document and rarely appears in other documents, it is considered that the word has a good category discrimination ability and is suitable for characterizing the document.

TF-IDF实际是TF和IDF的乘积。TF表示词语频率(TermFrequency),是某一个给定词语在文档中出现的频率,是对词语频次的归一化,以防止偏向词语多的文档。计算公式如下:TF-IDF is actually the product of TF and IDF. TF stands for Term Frequency, which is the frequency of a given word appearing in a document. It is the normalization of the frequency of words to prevent bias towards documents with many words. Calculated as follows:

tftf ii ,, jj == nno ii ,, jj ΣΣ kk nno kk ,, jj

其中,tfi,j表示词语i在文档j中的频率,ni,j表示词语i在文档j中的频次,Σknk,j表示文档中所有词语的频次之和。Among them, tf i, j represents the frequency of word i in document j, ni, j represents the frequency of word i in document j, and Σ k n k, j represents the sum of the frequencies of all words in the document.

IDF表示逆向文档频率(InverseDocumentFrequency),是一个词语普遍重要性的度量,计算公式如下:IDF stands for Inverse Document Frequency (InverseDocumentFrequency), which is a measure of the general importance of a word. The calculation formula is as follows:

idfidf ii == loglog || DD. || || {{ jj :: tt ii ∈∈ dd jj }} ||

其中,idfi表示词语i在语料库中的逆向文档频率,|D|表示语料库中的文档总数,|{j:ti∈dj}|表示包含词语i的文档数目。如果词语不在语料库中,就会导致分母为零,因此一般情况下分母使用|{j:ti∈dj}|+1。Among them, idf i represents the inverse document frequency of term i in the corpus, |D| represents the total number of documents in the corpus, and |{j:t i ∈ d j }| represents the number of documents containing term i. If the word is not in the corpus, the denominator will be zero, so in general the denominator uses |{j:t i ∈ d j }|+1.

有了TF和IDF,然后再计算得到TFIDF,计算公式如下:With TF and IDF, and then calculate TFIDF, the calculation formula is as follows:

tfidfi,j=tfi,j×idfi tfidf i, j = tf i, j × idf i

某一特定文档内的高频率词语,以及该词语在整个文档集合中的低文档频率,可以产生出高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。A high-frequency term within a particular document, and a low document frequency of that term in the entire document collection, can produce a highly weighted TF-IDF. Therefore, TF-IDF tends to filter out common words and keep important words.

步骤7:针对不同酒店和不同用户,根据其TF-IDF和预先设定的阈值挑选,从而得到最终的酒店标签和用户标签。Step 7: For different hotels and different users, select according to their TF-IDF and preset thresholds, so as to obtain the final hotel labels and user labels.

尽管在此已详细描述本发明的优选实施方式,但要理解的是本发明并不局限于这里详细描述和示出的具体构造,在不偏离本发明的实质和范围的情况下可由本领域的技术人员实现其它的变型和变体。Although the preferred embodiments of the present invention have been described in detail herein, it is to be understood that the present invention is not limited to the specific constructions described and shown in detail herein, and can be modified by those skilled in the art without departing from the spirit and scope of the present invention. Other modifications and variations will occur to the skilled artisan.

Claims (10)

1.一种基于酒店点评的用户标签和酒店标签匹配方法,包括:1. A method for matching user tags and hotel tags based on hotel reviews, comprising: (1)、准备酒店业情感语句模板库,所述酒店业情感语句模板库包括至少100个情感语句模板;(1), prepare hotel industry emotional sentence template library, described hotel industry emotional sentence template library includes at least 100 emotional sentence templates; (2)、准备至少三个酒店的最终酒店标签;(2) Prepare final hotel labels for at least three hotels; (3)、从互联网获取特定用户针对同一酒店或不同酒店的至少两条用户点评;(3) Obtain at least two user comments from a specific user on the same hotel or different hotels from the Internet; (4)、将所述特定用户的所有用户点评的情感语句逐一与所述至少100个情感语句模板进行比对,筛选出与所述至少100个情感语句模板相匹配的情感语句,并将所筛选出的情感语句根据所表达的情感类型识别为不同的维度,再以所识别的所有维度形成所述特定用户的用户标签集合;(4), compare the emotional sentences of all user reviews of the specific user with the at least 100 emotional sentence templates one by one, filter out the emotional sentences matching the at least 100 emotional sentence templates, and use the The filtered emotional sentences are identified as different dimensions according to the type of emotion expressed, and then form the user label set of the specific user with all the identified dimensions; (5)、分别计算所述特定用户的用户标签集合中的每个用户标签的权重,其中,在所述特定用户的全部用户点评中出现的频率越高且在所有用户针对所有酒店的所有用户点评中出现的频率越低则用户标签权重越高;(5), respectively calculate the weight of each user tag in the user tag set of the specific user, wherein the higher the frequency of occurrence in all user comments of the specific user and all users for all users of all hotels The lower the frequency of comments, the higher the weight of user tags; (6)、从所述特定用户的用户标签集合中选择权重大于第一设定阈值的用户标签作为所述特定用户的最终用户标签;以及(6), selecting a user tag whose weight is greater than a first set threshold from the user tag set of the specific user as the end user tag of the specific user; and (7)、至少将最终酒店标签与所述特定用户的最终用户标签匹配率位于前三名的酒店推荐给所述特定用户。(7) At least recommending to the specific user the hotels whose matching ratios of the final hotel tags and the specific user's final user tags are among the top three. 2.如权利要求1所述的基于酒店点评的用户标签和酒店标签匹配方法,其特征在于,所述步骤(2)中准备至少三个酒店的最终酒店标签包括:2. the user label and hotel label matching method based on hotel reviews as claimed in claim 1, is characterized in that, prepares the final hotel label of at least three hotels in the described step (2) and comprises: (2.1)、从互联网获取分别针对至少三个酒店的用户点评,其中针对每个酒店包括至少三个用户的用户点评;(2.1) Obtain user reviews for at least three hotels from the Internet, including user reviews for at least three users for each hotel; (2.2)、将针对特定酒店的所有用户点评的情感语句逐一与所述至少100个情感语句模板进行比对,筛选出与所述至少100个情感语句模板相匹配的情感语句,并将所筛选出的情感语句根据所表达的情感类型识别为不同的维度,再以所识别的所有维度形成所述特定酒店的酒店标签集合;(2.2), compare the emotional sentences for all user reviews of specific hotels one by one with the at least 100 emotional sentence templates, filter out the matching emotional sentences with the at least 100 emotional sentence templates, and filter the The emotion sentence that goes out is identified as different dimensions according to the emotion type expressed, forms the hotel label collection of described specific hotel with all dimensions identified again; (2.3)、分别计算所述特定酒店的酒店标签集合中的每个酒店标签的权重,其中,在针对同一酒店的所有用户点评中出现的频率越高且在针对所有酒店的所有用户点评中出现的频率越低则酒店标签权重越高;(2.3), respectively calculate the weight of each hotel tag in the hotel tag set of the specific hotel, wherein the higher the frequency of occurrence in all user reviews for the same hotel and appear in all user reviews for all hotels The lower the frequency, the higher the weight of the hotel label; (2.4)、从所述酒店标签集合中选择权重大于第二设定阈值的酒店标签作为所述特定酒店的最终酒店标签;以及(2.4), select the hotel label with weight greater than the second set threshold from the hotel label set as the final hotel label of the specific hotel; and (2.5)、重复步骤(2.2)-(2.4)直至获得所有酒店的最终酒店标签。(2.5), repeat steps (2.2)-(2.4) until the final hotel labels of all hotels are obtained. 3.如权利要求2所述的基于酒店点评的用户标签和酒店标签匹配方法,其特征在于,在所述步骤(1)中,在准备酒店业情感语句模板库之前,进一步包括构建酒店业语义词典的步骤,所述步骤(4)中将所述特定用户的所有用户点评的情感语句逐一与所述至少100个情感语句模板进行比对包括:3. the user label and hotel label matching method based on hotel reviews as claimed in claim 2, it is characterized in that, in described step (1), before preparing hotel industry emotional statement template storehouse, further comprise building hotel industry semantics The step of dictionary, in described step (4), the emotional sentence of all user reviews of described specific user is compared with described at least 100 emotional sentence templates one by one and comprises: (4.1)、将特定情感语句切分成与所述酒店业语义词典中相应的若干个酒店业常用词汇;(4.1), the specific emotional sentence is divided into some corresponding common words in the hotel industry in the semantic dictionary of the hotel industry; (4.2)、根据特定情感语句中每个词汇的不同属性分别与所述至少100个情感语句模板进行比对,从而确定与所述至少100个情感语句模板中的任一个情感语句模板是否相匹配;以及(4.2), compare respectively with described at least 100 emotional sentence templates according to the different attributes of each vocabulary in the specific emotional sentence, thereby determine whether to match with any one emotional sentence template in the described at least 100 emotional sentence templates ;as well as (4.3)、重复步骤(4.1)-(4.2),直至筛选出与所述至少100个情感语句模板相匹配的所有情感语句。(4.3), steps (4.1)-(4.2) are repeated until all emotional sentences matching the at least 100 emotional sentence templates are screened out. 4.如权利要求3所述的基于酒店点评的用户标签和酒店标签匹配方法,其特征在于,所述步骤(2.2)中将针对特定酒店的所有用户点评的情感语句逐一与所述至少100个情感语句模板进行比对包括:4. the user label and hotel label matching method based on hotel reviews as claimed in claim 3, is characterized in that, in described step (2.2), will be aimed at the emotion sentence of all user reviews of specific hotel one by one and described at least 100 The emotional statement templates for comparison include: (2.2.1)、将特定情感语句切分成与所述酒店业语义词典中相应的若干个酒店业常用词汇;(2.2.1), the specific emotional sentence is divided into several hotel industry commonly used words corresponding to the hotel industry semantic dictionary; (2.2.2)、根据特定情感语句中每个词汇的不同属性分别与所述至少100个情感语句模板进行比对,从而确定与所述至少100个情感语句模板中的任一个情感语句模板是否相匹配;以及(2.2.2), compare with described at least 100 emotional sentence templates respectively according to the different attribute of each vocabulary in specific emotional sentence, thereby determine whether any emotional sentence template in described at least 100 emotional sentence templates match; and (2.2.3)、重复步骤(2.2.1)-(2.2.2),直至筛选出与所述至少100个情感语句模板相匹配的所有情感语句。(2.2.3), repeating steps (2.2.1)-(2.2.2), until filtering out all emotional sentences matching the at least 100 emotional sentence templates. 5.如权利要求4所述的基于酒店点评的用户标签和酒店标签匹配方法,其特征在于,所述步骤(3)中是通过聚焦爬虫从点评网站获取用户点评。5. The method for matching user tags and hotel tags based on hotel reviews as claimed in claim 4, wherein in said step (3), the user reviews are obtained from review sites by focusing crawlers. 6.如权利要求5所述的基于酒店点评的用户标签和酒店标签匹配方法,其特征在于,所述步骤(1)中准备酒店业情感语句模板库是通过基于用户点评的自举方法提取句式模版,从而获得酒店业情感语句模板库。6. the user label and the hotel label matching method based on hotel comments as claimed in claim 5, it is characterized in that, preparing the hotel industry emotion sentence template storehouse in the described step (1) is to extract sentence by the bootstrapping method based on user comments templates, so as to obtain the template library of emotional sentences in the hotel industry. 7.如权利要求6所述的基于酒店点评的用户标签和酒店标签匹配方法,其特征在于,准备所述酒店业情感语句模板库以及构建所述酒店业语义词典的步骤包括:7. the user label and hotel label matching method based on hotel reviews as claimed in claim 6, is characterized in that, the step of preparing described hotel industry emotional sentence template storehouse and building described hotel industry semantic dictionary comprises: (1.1)、获取点评数据,通过整理各个情感要素的词形成种子词典;(1.1), obtain comment data, form the seed dictionary by sorting out the words of each emotional element; (1.2)、对点评数据的句子进行分词处理,然后逐词判定其语义类并用语义类标签进行替换;(1.2), word segmentation processing is performed on the sentence of the comment data, and then its semantic class is determined word by word and replaced with a semantic class label; (1.3)、对标签替换后的点评数据进行断句,根据各语义类的名称及各语义类包含的具体词语生成模版;(1.3), segment the comment data after label replacement, and generate templates according to the names of each semantic class and the specific words contained in each semantic class; (1.4)、将模版应用到语义类标签替换后的点评数据中,以抽取各语义类的语义词;(1.4), the template is applied to the comment data after the semantic class label is replaced, to extract the semantic words of each semantic class; (1.5)、根据模版的重要性、推广性和准确性,对各模版进行打分;(1.5) Score each template according to its importance, promotion and accuracy; (1.6)、选取得分最高的部分模版,根据选取的模版及其打分计算各模版抽取的语义词的得分,进而选取得分最高的部分语义词对语义词典进行扩充;以及(1.6), select the partial template with the highest score, calculate the score of the semantic words extracted by each template according to the selected template and its scoring, and then select the partial semantic word with the highest score to expand the semantic dictionary; and (1.7)、步骤(1.2)至步骤(1.6)迭代进行,直到挑选出来的语义词不正确时迭代终止,得到最终的酒店业语义词典,并由各模版构成酒店业情感语句模板库。(1.7), step (1.2) to step (1.6) are carried out iteratively, until the selected semantic words are incorrect, the iteration is terminated, and the final hotel industry semantic dictionary is obtained, and the hotel industry emotional sentence template library is formed by each template. 8.如权利要求7所述的基于酒店点评的用户标签和酒店标签匹配方法,其特征在于,步骤(1.6)中所述得分最高的部分模版是得分最高的前5~10%的模版,所述得分最高的部分语义词是得分最高的前5~10%的语义词。8. The user label and hotel label matching method based on hotel reviews as claimed in claim 7, wherein the part templates with the highest score described in the step (1.6) are the top 5-10% templates with the highest score, so The part of semantic words with the highest score is the top 5-10% semantic words with the highest score. 9.一种基于酒店点评的用户标签和酒店标签匹配装置,包括:9. A device for matching user tags and hotel tags based on hotel reviews, comprising: 酒店业情感语句模板库生成模块,所述酒店业情感语句模板库包括至少100个情感语句模板;The hotel industry emotional sentence template library generation module, the hotel industry emotional sentence template library includes at least 100 emotional sentence templates; 最终酒店标签生成模块,其用于生成至少三个酒店的最终酒店标签;a final hotel label generation module for generating final hotel labels for at least three hotels; 用户点评获取模块,其从互联网获取特定用户针对同一酒店或不同酒店的至少两条用户点评;A user review acquisition module, which obtains at least two user reviews of a specific user for the same hotel or different hotels from the Internet; 用户标签集合生成模块,其将所述特定用户的所有用户点评的情感语句逐一与所述至少100个情感语句模板进行比对,筛选出与所述至少100个情感语句模板相匹配的情感语句,并将所筛选出的情感语句根据所表达的情感类型识别为不同的维度,再以所识别的所有维度形成所述特定用户的用户标签集合;A user label set generation module, which compares the emotional sentences of all user comments of the specific user with the at least 100 emotional sentence templates one by one, and filters out the emotional sentences matching the at least 100 emotional sentence templates, And identify the emotional sentences that are screened out as different dimensions according to the expressed emotion type, and then form the user tag set of the specific user with all the identified dimensions; 用户标签权重计算模块,其分别计算所述特定用户的用户标签集合中的每个用户标签的权重,其中,在所述特定用户的全部用户点评中出现的频率越高且在所有用户针对所有酒店的所有用户点评中出现的频率越低则用户标签权重越高;User tag weight calculation module, which respectively calculates the weight of each user tag in the user tag set of the specific user, wherein the higher the frequency of occurrence in all user comments of the specific user and the higher the frequency of occurrence in all user reviews for all hotels The lower the frequency of occurrence in all user reviews of , the higher the weight of the user tag; 最终用户标签生成模块,其从所述特定用户的用户标签集合中选择权重大于第一设定阈值的用户标签作为所述特定用户的最终用户标签;以及An end-user tag generation module, which selects a user tag whose weight is greater than a first set threshold from the user tag set of the specific user as the end-user tag of the specific user; and 酒店推荐模块,其至少将最终酒店标签与所述特定用户的最终用户标签匹配率位于前三名的酒店推荐给所述特定用户。The hotel recommending module at least recommends to the specific user the hotels whose matching ratios of final hotel tags and the specific user's final user tags are among the top three. 10.如权利要求9所述的基于酒店点评的用户标签和酒店标签匹配装置,其特征在于,所述最终酒店标签生成模块通过所述用户点评获取模块从互联网获取分别针对至少三个酒店的用户点评,其中针对每个酒店包括至少三个用户的用户点评;10. The user tag and hotel tag matching device based on hotel reviews as claimed in claim 9, wherein said final hotel tag generation module obtains user reviews for at least three hotels from the Internet through said user review acquisition module. Reviews, which include, for each property, user reviews of at least three users; 所述最终酒店标签生成模块还包括:Described final hotel label generating module also includes: 酒店标签集合生成子模块,其将针对特定酒店的所有用户点评的情感语句逐一与所述至少100个情感语句模板进行比对,筛选出与所述至少100个情感语句模板相匹配的情感语句,并将所筛选出的情感语句根据所表达的情感类型识别为不同的维度,再以所识别的所有维度形成所述特定酒店的酒店标签集合;以及The hotel label set generates a submodule, which compares the emotional sentences for all user comments of a specific hotel with the at least 100 emotional sentence templates one by one, and screens out the matching emotional sentences with the at least 100 emotional sentence templates, And identify the filtered emotion sentences as different dimensions according to the type of emotion expressed, and then form the hotel tag set of the specific hotel with all the identified dimensions; and 酒店标签权重计算子模块,其分别计算所述特定酒店的酒店标签集合中的每个酒店标签的权重,其中,在针对同一酒店的所有用户点评中出现的频率越高且在针对所有酒店的所有用户点评中出现的频率越低则酒店标签权重越高;The hotel tag weight calculation sub-module, which calculates the weight of each hotel tag in the hotel tag set of the specific hotel respectively, wherein, the higher the frequency of occurrence in all user comments for the same hotel and the higher the frequency for all hotel tags The lower the frequency of user reviews, the higher the weight of the hotel label; 其中,所述最终酒店标签生成模块从所述酒店标签集合中选择权重大于第二设定阈值的酒店标签作为所述特定酒店的最终酒店标签。Wherein, the final hotel label generating module selects a hotel label with a weight greater than a second set threshold from the hotel label set as the final hotel label of the specific hotel.
CN201510593613.5A 2015-09-17 2015-09-17 User label and hotel label matching method and device based on hotel comments Pending CN105205699A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510593613.5A CN105205699A (en) 2015-09-17 2015-09-17 User label and hotel label matching method and device based on hotel comments

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510593613.5A CN105205699A (en) 2015-09-17 2015-09-17 User label and hotel label matching method and device based on hotel comments

Publications (1)

Publication Number Publication Date
CN105205699A true CN105205699A (en) 2015-12-30

Family

ID=54953364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510593613.5A Pending CN105205699A (en) 2015-09-17 2015-09-17 User label and hotel label matching method and device based on hotel comments

Country Status (1)

Country Link
CN (1) CN105205699A (en)

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844424A (en) * 2016-05-30 2016-08-10 中国计量学院 Product quality problem discovery and risk assessment method based on network comments
CN106257455A (en) * 2016-07-08 2016-12-28 闽江学院 A kind of Bootstrapping algorithm based on dependence template extraction viewpoint evaluation object
CN106874435A (en) * 2017-01-25 2017-06-20 北京航空航天大学 User portrait construction method and device
CN106909659A (en) * 2017-02-27 2017-06-30 携程旅游网络技术(上海)有限公司 Hotel's sort method based on traffic convenience degree in OTA websites
WO2017120739A1 (en) * 2016-01-11 2017-07-20 程强 Method and system for analyzing restaurant reviews
CN107515932A (en) * 2017-08-28 2017-12-26 北京智诚律法科技有限公司 Artificial intelligence law consulting system based on typical problem storehouse
WO2018035698A1 (en) * 2016-08-23 2018-03-01 盛玉伟 Method and system for house appraisal
CN108256067A (en) * 2018-01-16 2018-07-06 平安好房(上海)电子商务有限公司 Calculate method, apparatus, equipment and the storage medium of source of houses similarity
CN108289121A (en) * 2018-01-02 2018-07-17 阿里巴巴集团控股有限公司 The method for pushing and device of marketing message
CN108470023A (en) * 2018-01-18 2018-08-31 阿里巴巴集团控股有限公司 The recommendation method and device of business function
CN108664469A (en) * 2018-05-07 2018-10-16 首都师范大学 A kind of emotional category determines method, apparatus and server
CN108959253A (en) * 2018-06-28 2018-12-07 北京嘀嘀无限科技发展有限公司 Extracting method, device and the readable storage medium storing program for executing of core phrase
CN109272337A (en) * 2017-07-17 2019-01-25 阿里巴巴集团控股有限公司 The generation method and relevant device of object tag
CN109325186A (en) * 2018-08-11 2019-02-12 桂林理工大学 A behavioral motivation inference method based on the fusion of user preference features and geographical features
CN109446310A (en) * 2018-10-30 2019-03-08 腾讯科技(武汉)有限公司 A kind of method for evaluating quality, device and the storage medium of question sentence template
WO2019062081A1 (en) * 2017-09-28 2019-04-04 平安科技(深圳)有限公司 Salesman profile formation method, electronic device and computer readable storage medium
CN110020149A (en) * 2017-11-30 2019-07-16 Tcl集团股份有限公司 Labeling processing method, device, terminal device and the medium of user information
CN110097394A (en) * 2019-03-27 2019-08-06 青岛高校信息产业股份有限公司 The latent objective recommended method of product and device
CN110147483A (en) * 2017-09-12 2019-08-20 阿里巴巴集团控股有限公司 A kind of title method for reconstructing and device
CN110263022A (en) * 2019-05-08 2019-09-20 深圳丝路天地电子商务有限公司 Hotel's data matching method and device
CN110287341A (en) * 2019-06-26 2019-09-27 腾讯科技(深圳)有限公司 A kind of data processing method, device and readable storage medium storing program for executing
CN110457502A (en) * 2019-08-21 2019-11-15 京东方科技集团股份有限公司 Constructing knowledge map method, human-computer interaction method, electronic equipment and storage medium
CN110633469A (en) * 2019-09-10 2019-12-31 陈绪平 Method for accurately understanding Chinese sentence meaning
CN110633370A (en) * 2019-09-19 2019-12-31 携程计算机技术(上海)有限公司 Method, system, electronic device and medium for generating OTA hotel label
CN110674260A (en) * 2019-09-27 2020-01-10 北京百度网讯科技有限公司 Training method and device of semantic similarity model, electronic equipment and storage medium
CN110781394A (en) * 2019-10-24 2020-02-11 西北工业大学 Personalized commodity description generation method based on multi-source crowd-sourcing data
WO2020076179A1 (en) * 2018-10-11 2020-04-16 Общество С Ограниченной Ответственностью "Глобус Медиа" Method for determining tags for hotels and device for the implementation thereof
CN111080162A (en) * 2019-12-27 2020-04-28 南昌众荟智盈信息技术有限公司 Automatic service task allocation method for improving automation level of hotel service process
CN111737400A (en) * 2020-06-15 2020-10-02 上海理想信息产业(集团)有限公司 Knowledge reasoning-based big data service tag expansion method and system
CN111768213A (en) * 2020-09-03 2020-10-13 耀方信息技术(上海)有限公司 User label weight evaluation method
CN112035750A (en) * 2020-09-17 2020-12-04 上海二三四五网络科技有限公司 Control method and device for user tag expansion
CN112948677A (en) * 2021-02-26 2021-06-11 上海携旅信息技术有限公司 Recommendation reason determination method, system, device and medium based on comment aesthetic feeling
CN113139838A (en) * 2021-05-10 2021-07-20 上海华客信息科技有限公司 Hotel service evaluation method, system, equipment and storage medium
CN113361920A (en) * 2021-06-04 2021-09-07 上海华客信息科技有限公司 Hotel service optimization index recommendation method, system, equipment and storage medium
CN113761228A (en) * 2021-01-15 2021-12-07 北京沃东天骏信息技术有限公司 Label generating method and device based on multiple tasks, electronic equipment and medium
CN114117057A (en) * 2020-08-25 2022-03-01 武汉Tcl集团工业研究院有限公司 Keyword extraction method of product feedback information and terminal equipment
CN114943462A (en) * 2022-06-02 2022-08-26 上海华客信息科技有限公司 Hotel group data processing method, system, equipment and storage medium
CN115034213A (en) * 2022-08-15 2022-09-09 苏州大学 Recognition method of prefix and suffix negation words based on joint learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984984A (en) * 2014-06-11 2014-08-13 张劲松 Hotel room-reservation system and realizing method thereof
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN105095508A (en) * 2015-08-31 2015-11-25 北京奇艺世纪科技有限公司 Multimedia content recommendation method and multimedia content recommendation apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984984A (en) * 2014-06-11 2014-08-13 张劲松 Hotel room-reservation system and realizing method thereof
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN105095508A (en) * 2015-08-31 2015-11-25 北京奇艺世纪科技有限公司 Multimedia content recommendation method and multimedia content recommendation apparatus

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
娄小丰: "基于多属性打分的酒店推荐算法研究", 《中国优秀硕士学位论文全文数据库》 *
聂卉,杜嘉忠: "依存句法模板下的商品特征标签抽取研究", 《现代图书情报技术》 *

Cited By (50)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017120739A1 (en) * 2016-01-11 2017-07-20 程强 Method and system for analyzing restaurant reviews
CN105844424A (en) * 2016-05-30 2016-08-10 中国计量学院 Product quality problem discovery and risk assessment method based on network comments
CN106257455A (en) * 2016-07-08 2016-12-28 闽江学院 A kind of Bootstrapping algorithm based on dependence template extraction viewpoint evaluation object
WO2018035698A1 (en) * 2016-08-23 2018-03-01 盛玉伟 Method and system for house appraisal
CN106874435A (en) * 2017-01-25 2017-06-20 北京航空航天大学 User portrait construction method and device
CN106874435B (en) * 2017-01-25 2020-02-14 北京航空航天大学 User portrait construction method and device
CN106909659A (en) * 2017-02-27 2017-06-30 携程旅游网络技术(上海)有限公司 Hotel's sort method based on traffic convenience degree in OTA websites
CN109272337A (en) * 2017-07-17 2019-01-25 阿里巴巴集团控股有限公司 The generation method and relevant device of object tag
CN107515932A (en) * 2017-08-28 2017-12-26 北京智诚律法科技有限公司 Artificial intelligence law consulting system based on typical problem storehouse
CN110147483B (en) * 2017-09-12 2023-09-29 阿里巴巴集团控股有限公司 Title reconstruction method and device
CN110147483A (en) * 2017-09-12 2019-08-20 阿里巴巴集团控股有限公司 A kind of title method for reconstructing and device
WO2019062081A1 (en) * 2017-09-28 2019-04-04 平安科技(深圳)有限公司 Salesman profile formation method, electronic device and computer readable storage medium
CN110020149A (en) * 2017-11-30 2019-07-16 Tcl集团股份有限公司 Labeling processing method, device, terminal device and the medium of user information
CN108289121B (en) * 2018-01-02 2020-09-29 阿里巴巴集团控股有限公司 Marketing information pushing method and device
CN108289121A (en) * 2018-01-02 2018-07-17 阿里巴巴集团控股有限公司 The method for pushing and device of marketing message
CN108256067A (en) * 2018-01-16 2018-07-06 平安好房(上海)电子商务有限公司 Calculate method, apparatus, equipment and the storage medium of source of houses similarity
CN108470023A (en) * 2018-01-18 2018-08-31 阿里巴巴集团控股有限公司 The recommendation method and device of business function
CN108664469A (en) * 2018-05-07 2018-10-16 首都师范大学 A kind of emotional category determines method, apparatus and server
CN108664469B (en) * 2018-05-07 2021-11-19 首都师范大学 Emotion category determination method and device and server
CN108959253A (en) * 2018-06-28 2018-12-07 北京嘀嘀无限科技发展有限公司 Extracting method, device and the readable storage medium storing program for executing of core phrase
CN109325186B (en) * 2018-08-11 2021-08-17 桂林理工大学 A Behavioral Motivation Inference Algorithm Fusion of User Preferences and Geographical Features
CN109325186A (en) * 2018-08-11 2019-02-12 桂林理工大学 A behavioral motivation inference method based on the fusion of user preference features and geographical features
WO2020076179A1 (en) * 2018-10-11 2020-04-16 Общество С Ограниченной Ответственностью "Глобус Медиа" Method for determining tags for hotels and device for the implementation thereof
CN109446310B (en) * 2018-10-30 2020-11-03 腾讯科技(武汉)有限公司 Question template quality evaluation method and device and storage medium
CN109446310A (en) * 2018-10-30 2019-03-08 腾讯科技(武汉)有限公司 A kind of method for evaluating quality, device and the storage medium of question sentence template
CN110097394A (en) * 2019-03-27 2019-08-06 青岛高校信息产业股份有限公司 The latent objective recommended method of product and device
CN110263022B (en) * 2019-05-08 2023-03-14 深圳丝路天地电子商务有限公司 Hotel data matching method and device
CN110263022A (en) * 2019-05-08 2019-09-20 深圳丝路天地电子商务有限公司 Hotel's data matching method and device
CN110287341B (en) * 2019-06-26 2024-08-20 腾讯科技(深圳)有限公司 Data processing method, device and readable storage medium
CN110287341A (en) * 2019-06-26 2019-09-27 腾讯科技(深圳)有限公司 A kind of data processing method, device and readable storage medium storing program for executing
CN110457502A (en) * 2019-08-21 2019-11-15 京东方科技集团股份有限公司 Constructing knowledge map method, human-computer interaction method, electronic equipment and storage medium
CN110633469A (en) * 2019-09-10 2019-12-31 陈绪平 Method for accurately understanding Chinese sentence meaning
CN110633370B (en) * 2019-09-19 2023-07-04 携程计算机技术(上海)有限公司 OTA hotel label generation method, system, electronic device and medium
CN110633370A (en) * 2019-09-19 2019-12-31 携程计算机技术(上海)有限公司 Method, system, electronic device and medium for generating OTA hotel label
CN110674260A (en) * 2019-09-27 2020-01-10 北京百度网讯科技有限公司 Training method and device of semantic similarity model, electronic equipment and storage medium
CN110781394A (en) * 2019-10-24 2020-02-11 西北工业大学 Personalized commodity description generation method based on multi-source crowd-sourcing data
WO2021077973A1 (en) * 2019-10-24 2021-04-29 西北工业大学 Personalised product description generating method based on multi-source crowd intelligence data
CN111080162A (en) * 2019-12-27 2020-04-28 南昌众荟智盈信息技术有限公司 Automatic service task allocation method for improving automation level of hotel service process
CN111737400A (en) * 2020-06-15 2020-10-02 上海理想信息产业(集团)有限公司 Knowledge reasoning-based big data service tag expansion method and system
CN111737400B (en) * 2020-06-15 2023-06-20 上海理想信息产业(集团)有限公司 Knowledge reasoning-based big data service label expansion method and system
CN114117057A (en) * 2020-08-25 2022-03-01 武汉Tcl集团工业研究院有限公司 Keyword extraction method of product feedback information and terminal equipment
CN111768213A (en) * 2020-09-03 2020-10-13 耀方信息技术(上海)有限公司 User label weight evaluation method
CN112035750A (en) * 2020-09-17 2020-12-04 上海二三四五网络科技有限公司 Control method and device for user tag expansion
CN113761228A (en) * 2021-01-15 2021-12-07 北京沃东天骏信息技术有限公司 Label generating method and device based on multiple tasks, electronic equipment and medium
CN112948677A (en) * 2021-02-26 2021-06-11 上海携旅信息技术有限公司 Recommendation reason determination method, system, device and medium based on comment aesthetic feeling
CN112948677B (en) * 2021-02-26 2023-11-03 上海携旅信息技术有限公司 Recommendation reason determining method, system, equipment and medium based on comment aesthetic feeling
CN113139838A (en) * 2021-05-10 2021-07-20 上海华客信息科技有限公司 Hotel service evaluation method, system, equipment and storage medium
CN113361920A (en) * 2021-06-04 2021-09-07 上海华客信息科技有限公司 Hotel service optimization index recommendation method, system, equipment and storage medium
CN114943462A (en) * 2022-06-02 2022-08-26 上海华客信息科技有限公司 Hotel group data processing method, system, equipment and storage medium
CN115034213A (en) * 2022-08-15 2022-09-09 苏州大学 Recognition method of prefix and suffix negation words based on joint learning

Similar Documents

Publication Publication Date Title
CN105205699A (en) User label and hotel label matching method and device based on hotel comments
CN109189942B (en) Method and device for constructing knowledge graph of patent data
CN108573411B (en) Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
Li et al. Twiner: named entity recognition in targeted twitter stream
Zubrinic et al. The automatic creation of concept maps from documents written using morphologically rich languages
CN106407236B (en) A sentiment orientation detection method for review data
CN111680173A (en) A CMR Model for Unified Retrieval of Cross-Media Information
CN112182145B (en) Text similarity determination method, device, equipment and storage medium
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN108388660A (en) A kind of improved electric business product pain spot analysis method
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
CN111353044B (en) Comment-based emotion analysis method and system
CN106407235A (en) A semantic dictionary establishing method based on comment data
CN111444713B (en) Method and device for extracting entity relationship in news event
Xiong et al. Affective impression: Sentiment-awareness POI suggestion via embedding in heterogeneous LBSNs
Gong et al. Phrase-based hashtag recommendation for microblog posts
Nguyen et al. Analyzing customer experience in hotel services using topic modeling
Jia et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth
CN118838993A (en) Method for constructing keyword library and related products thereof
CN107908749A (en) A kind of personage's searching system and method based on search engine
Hussain et al. A technique for perceiving abusive bangla comments
Al-Sultany et al. Enriching tweets for topic modeling via linking to the wikipedia
CN113821718A (en) Article information pushing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100088 Madian East Road, Haidian District, No. 17,, golden floor, International Building, 18

Applicant after: Beijing Zhong Hui Information Technology Limited by Share Ltd

Address before: 100088 Madian East Road, Haidian District, No. 17,, golden floor, International Building, 18

Applicant before: BEIJING ZHONGHUI INFORMATION TECHNOLOGY CO., LTD.

COR Change of bibliographic data
RJ01 Rejection of invention patent application after publication

Application publication date: 20151230

RJ01 Rejection of invention patent application after publication