[go: up one dir, main page]

WO2019184217A1 - Hotspot event classification method and apparatus, and storage medium - Google Patents

Hotspot event classification method and apparatus, and storage medium Download PDF

Info

Publication number
WO2019184217A1
WO2019184217A1 PCT/CN2018/102083 CN2018102083W WO2019184217A1 WO 2019184217 A1 WO2019184217 A1 WO 2019184217A1 CN 2018102083 W CN2018102083 W CN 2018102083W WO 2019184217 A1 WO2019184217 A1 WO 2019184217A1
Authority
WO
WIPO (PCT)
Prior art keywords
event
information
word
preset
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2018/102083
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
吴天博
黄章成
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Publication of WO2019184217A1 publication Critical patent/WO2019184217A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present application relates to the field of information technology, and in particular, to a hot event classification method, apparatus, and computer readable storage medium.
  • the present application provides a hot event classification method, apparatus, and computer readable storage medium, the main purpose of which is to improve the speed and accuracy of hot event classification on social media.
  • the present application provides a method for classifying hotspot events, including:
  • Obtaining step obtaining, in real time, a first preset number of information texts published by the user from a predetermined server;
  • Word segmentation step segmenting the above information text by using a predetermined word segmentation rule to obtain a word segment corresponding to each information text;
  • the classification step is: determining whether the hot event indicator value is greater than a preset threshold, and if the hot event indicator value is greater than a preset threshold, acquiring an information vector of the information text corresponding to the feature word by using a preset vectorization manner, and inputting the information vector In the pre-trained event classification model, the event type corresponding to the information text is determined.
  • the present application further provides an electronic device, including: a memory and a processor, the hotspot event classification program is stored on the memory, and the hotspot event classification program is executed by the processor, and the following steps can be implemented:
  • Word segmentation step segmenting the above information text by using a predetermined word segmentation rule to obtain a word segment corresponding to each information text;
  • Calculating step calculating a hotspot event index value corresponding to the feature word according to a preset calculation formula
  • the classification step is: determining whether the hot event indicator value is greater than a preset threshold, and if the hot event indicator value is greater than a preset threshold, acquiring an information vector of the information text corresponding to the feature word by using a preset vectorization manner, and inputting the information vector In the pre-trained event classification model, the event type corresponding to the information text is determined.
  • the present application further provides a computer readable storage medium, where the computer readable storage medium includes a hotspot event classification program, and when the hotspot event classification program is executed by a processor, the foregoing can be implemented as described above. Any step in the hot event classification method.
  • the hot event classification method, the electronic device and the computer readable storage medium proposed by the application obtain the information texts published by the social account in the server, and segment the information text to extract the feature words, and then calculate the maximum corresponding to the feature words. Probabilistic event theme, and using the preset calculation formula to calculate the event index value corresponding to the feature word, and finally vectorizing the information text corresponding to the feature word whose event index value is greater than the preset threshold value, and input the event classification model, thereby accurately Determine the event type of the information text and improve the event classification speed.
  • FIG. 1 is a schematic diagram of a preferred embodiment of an electronic device of the present application.
  • FIG. 2 is a block diagram showing a preferred embodiment of the hot event classification procedure of FIG. 1;
  • FIG. 1 is a schematic diagram of a preferred embodiment of an electronic device 1 of the present application.
  • FIG. 2 for a schematic diagram of a preferred embodiment of the hotspot event classification procedure 10
  • FIG. 3 for an overview of a flow chart of a preferred embodiment of the hot event classification method.
  • the determining module 130 is configured to extract a feature word preset in the word segment, and determine a event topic corresponding to the feature word by using a predetermined probability algorithm.
  • the feature words are pre-labeled and stored in the thesaurus 15.
  • the predetermined probability algorithm includes calculating a final probability P 3 according to the first selection probability P 1 and the second selection probability P 2 .
  • a second predetermined number of implicit event topics are added between the feature words and the event topic text, the hidden event topics being virtual and having no real meaning.
  • FIG. 3 it is a flowchart of a preferred embodiment of the hot event classification method of the present application.
  • the training is completed. If the accuracy is less than or equal to the preset value, the number of sample data is increased, and then the step of dividing the sample data into the training set and the verification set is returned. Assume that the default value is 98%. If the verification accuracy is greater than 98%, the training is completed. If the accuracy is less than 98%, then 20,000 sample data is added, and then the steps of dividing the sample data into the training set and the verification set are returned.
  • the hot event classification method proposed by the above embodiment obtains the information text published by the user from the server, performs word segmentation processing on the information text, extracts feature words in the word segmentation, and then calculates a maximum probability event of the feature word by using a predetermined probability algorithm.
  • Calculating step calculating a hot event indicator value corresponding to the feature word according to a preset calculation formula
  • the preset calculation formula is as follows:
  • v represents the speed of event development
  • a represents the hot event indicator value
  • t represents the time point
  • T represents the time interval
  • i is an integer
  • t i represents the time point at which the i-th feature word appears
  • X i represents the i-th feature The number of times a word appears.
  • the predetermined word segmentation rule comprises:
  • the predetermined probability algorithm comprises:
  • P 1 represents the first selection probability
  • P 2 represents the second selection probability
  • P 3 represents the final probability
  • the preset vectorization manner includes:
  • the event classification model is a long-term and short-term memory network model, and the training steps of the event classification model are as follows:
  • the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM as described above). , a disk, an optical disk, including a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.
  • a terminal device which may be a mobile phone, a computer, a server, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided are a hotspot event classification method and apparatus, and a storage medium. The method comprises: acquiring, in real time from a pre-determined social server, information text issued by a first pre-set number of users, and performing word segmentation on the information text according to a pre-determined word segmentation rule to obtain segmented words corresponding to each piece of information text. Then, the method comprises: extracting a feature word in the segmented words, using a pre-determined probability algorithm to determine an event subject corresponding to the feature word, then calculating, according to a pre-set calculation formula, a hotspot event index value corresponding to the feature word, and judging whether the hotspot event index value is greater than a pre-set threshold; and if the hotspot event index value is greater than the pre-set threshold, then acquiring an information vector of the information text corresponding to the feature word in a pre-set vectorization manner, and inputting the information vector into a pre-trained event classification model so as to determine an event type corresponding to the information text. By means of the present application, an event type of a hotspot event can be rapidly and accurately analyzed.

Description

热点事件分类方法、装置及存储介质Hot event classification method, device and storage medium

本申请要求于2018年03月26日提交中国专利局、申请号为201810252849.6,名称为“热点事件分类方法、装置及存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合本申请中。The present application claims priority to Chinese Patent Application No. 201100252849.6, entitled "Hot Spot Event Classification Method, Apparatus and Storage Medium", which is filed on March 26, 2018, the entire contents of which are incorporated by reference. The way it is combined with this application.

技术领域Technical field

本申请涉及信息技术领域,尤其涉及一种热点事件分类方法、装置及计算机可读存储介质。The present application relates to the field of information technology, and in particular, to a hot event classification method, apparatus, and computer readable storage medium.

背景技术Background technique

随着网络技术的发展,社交媒体的运用也越来越广泛,社交媒体中的各种事件数量也与日俱增。面对数量暴增的事件,如何快速分辨社交媒体的事件类型,了解社交媒体用户关心的领域及热门话题,并作出相应决策已成为管理者面临的难题。With the development of network technology, the use of social media has become more widespread, and the number of events in social media has also increased. In the face of the number of incidents, how to quickly identify the types of events in social media, understand the areas of social media users and hot topics, and make corresponding decisions has become a difficult problem for managers.

目前,现有的社交媒体热点事件分类方法不完善,亟待一种分类方法能够在热点事件发展的早期,准确、快速的分析出热点事件的事件类型。At present, the existing classification methods of social media hotspot events are not perfect, and a classification method is needed to accurately and quickly analyze the event types of hot events in the early stage of hot event development.

发明内容Summary of the invention

鉴于以上内容,本申请提供一种热点事件分类方法、装置及计算机可读存储介质,其主要目的在于提高社交媒体上热点事件分类的速度及准确性。In view of the above, the present application provides a hot event classification method, apparatus, and computer readable storage medium, the main purpose of which is to improve the speed and accuracy of hot event classification on social media.

为实现上述目的,本申请提供一种热点事件分类方法,该方法包括:To achieve the above objective, the present application provides a method for classifying hotspot events, including:

获取步骤:实时从预先确定的服务器中获取第一预设数量用户发布的信息文本;Obtaining step: obtaining, in real time, a first preset number of information texts published by the user from a predetermined server;

分词步骤:利用预先确定的分词规则对上述信息文本进行分词,获得各个信息文本对应的分词;Word segmentation step: segmenting the above information text by using a predetermined word segmentation rule to obtain a word segment corresponding to each information text;

确定步骤:提取出分词中预设的特征词,利用预先确定的概率算法确定该特征词对应的事件主题;Determining step: extracting a feature word preset in the word segment, and determining a event theme corresponding to the feature word by using a predetermined probability algorithm;

计算步骤:根据预设的计算公式,计算出该特征词对应的热点事件指标值;Calculating step: calculating a hot event indicator value corresponding to the feature word according to a preset calculation formula;

分类步骤:判断热点事件指标值是否大于预设阈值,若热点事件指标值大于预设阈值,则利用预设的向量化方式获取该特征词对应的信息文本的信息向量,将所述信息向量输入预先训练的事件分类模型中,确定出该信息文本对应的事件类型。The classification step is: determining whether the hot event indicator value is greater than a preset threshold, and if the hot event indicator value is greater than a preset threshold, acquiring an information vector of the information text corresponding to the feature word by using a preset vectorization manner, and inputting the information vector In the pre-trained event classification model, the event type corresponding to the information text is determined.

此外,本申请还提供一种电子装置,该电子装置包括:存储器及处理器,所述存储器上存储热点事件分类程序,所述热点事件分类程序被所述处理器执行,可实现如下步骤:In addition, the present application further provides an electronic device, including: a memory and a processor, the hotspot event classification program is stored on the memory, and the hotspot event classification program is executed by the processor, and the following steps can be implemented:

获取步骤:实时从预先确定的服务器中获取第一预设数量用户发布的信息文本;Obtaining step: obtaining, in real time, a first preset number of information texts published by the user from a predetermined server;

分词步骤:利用预先确定的分词规则对上述信息文本进行分词,获得各个信息文本对应的分词;Word segmentation step: segmenting the above information text by using a predetermined word segmentation rule to obtain a word segment corresponding to each information text;

确定步骤:提取出分词中预设的特征词,利用预先确定的概率算法确定该特征词对应的事件主题;Determining step: extracting a feature word preset in the word segment, and determining a event theme corresponding to the feature word by using a predetermined probability algorithm;

计算步骤:根据预设的计算公式,计算出该特征词对应的热点事件指标指值;Calculating step: calculating a hotspot event index value corresponding to the feature word according to a preset calculation formula;

分类步骤:判断热点事件指标值是否大于预设阈值,若热点事件指标值大于预设阈值,则利用预设的向量化方式获取该特征词对应的信息文本的信息向量,将所述信息向量输入预先训练的事件分类模型中,确定出该信息文本对应的事件类型。The classification step is: determining whether the hot event indicator value is greater than a preset threshold, and if the hot event indicator value is greater than a preset threshold, acquiring an information vector of the information text corresponding to the feature word by using a preset vectorization manner, and inputting the information vector In the pre-trained event classification model, the event type corresponding to the information text is determined.

此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中包括热点事件分类程序,所述热点事件分类程序被处理器执行时,可实现如上所述热点事件分类方法中的任意步骤。In addition, in order to achieve the above object, the present application further provides a computer readable storage medium, where the computer readable storage medium includes a hotspot event classification program, and when the hotspot event classification program is executed by a processor, the foregoing can be implemented as described above. Any step in the hot event classification method.

本申请提出的热点事件分类方法、电子装置及计算机可读存储介质,通过获取服务器中社交账号发布的信息文本,并对所述信息文本进行分词,提取出特征词,接着计算特征词对应的最大概率的事件主题,并利用预设的计算公式计算出特征词对应的事件指标值,最后将事件指标值大于预设阈值的特征词所对应的信息文本向量化,输入事件分类模型中,从而准确地判断该信息文本的事件类型,提高事件分类速度。The hot event classification method, the electronic device and the computer readable storage medium proposed by the application obtain the information texts published by the social account in the server, and segment the information text to extract the feature words, and then calculate the maximum corresponding to the feature words. Probabilistic event theme, and using the preset calculation formula to calculate the event index value corresponding to the feature word, and finally vectorizing the information text corresponding to the feature word whose event index value is greater than the preset threshold value, and input the event classification model, thereby accurately Determine the event type of the information text and improve the event classification speed.

附图说明DRAWINGS

图1为本申请电子装置较佳实施例的示意图;1 is a schematic diagram of a preferred embodiment of an electronic device of the present application;

图2为图1中热点事件分类程序较佳实施例的模块示意图;2 is a block diagram showing a preferred embodiment of the hot event classification procedure of FIG. 1;

图3为本申请热点事件分类方法较佳实施例的流程图;3 is a flowchart of a preferred embodiment of a hot event classification method according to the present application;

图4为本申请事件分类模型训练的流程图。FIG. 4 is a flowchart of the training of the event classification model of the present application.

本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.

具体实施方式detailed description

应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.

如图1所示,是本申请电子装置1较佳实施例的示意图。FIG. 1 is a schematic diagram of a preferred embodiment of an electronic device 1 of the present application.

在本实施例中,电子装置1可以是服务器、智能手机、平板电脑、个人电脑、便携计算机以及其它具有运算功能的电子设备。In this embodiment, the electronic device 1 may be a server, a smart phone, a tablet computer, a personal computer, a portable computer, and other electronic devices having computing functions.

该电子装置1包括:存储器11、处理器12、网络接口13、通信总线14及词库15。其中,网络接口13可选地可以包括标准的有线接口、无线接口(如WI-FI接口)。通信总线14用于实现这些组件之间的连接通信。The electronic device 1 includes a memory 11, a processor 12, a network interface 13, a communication bus 14, and a thesaurus 15. The network interface 13 can optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). Communication bus 14 is used to implement connection communication between these components.

存储器11至少包括一种类型的可读存储介质。所述至少一种类型的可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器等的非易失性存储介质。在一些实施例中,所述存储器11可以是所述电子装置1的内部存储单元,例如该电子装置1的硬盘。在另一些实施例中,所述存储器11也可以是所述电子装置1的外部存储单元,例如所述电子装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。The memory 11 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card type memory, or the like. In some embodiments, the memory 11 may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1. In other embodiments, the memory 11 may also be an external storage unit of the electronic device 1, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and security. Digital (Secure Digital, SD) card, flash card (Flash Card), etc.

在本实施例中,所述存储器11不仅可以用于存储安装于所述电子装置1的应用软件及各类数据,例如热点事件分类程序10、词库15等。其中,词库15用于存放分词过程中所涉及的所有字和词及标注的特征词。In this embodiment, the memory 11 can be used not only for storing application software and various types of data installed in the electronic device 1, such as the hot event classification program 10, the vocabulary 15, and the like. The vocabulary 15 is used to store all the words and words involved in the word segmentation process and the feature words of the annotation.

处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其它数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行热点事件分类程序10的计算机程序代码、事件分类模型的训练等。The processor 12, in some embodiments, may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as performing hotspot event classification. The computer program code of the program 10, the training of the event classification model, and the like.

图1仅示出了具有组件11-15以及热点事件分类程序10的电子装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。1 shows only the electronic device 1 having the components 11-15 and the hotspot event classification program 10, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.

可选地,该电子装置1还可以包括显示器,显示器可以称为显示屏或显示单元。在一些实施例中显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)触摸器等。显示器用于显示在电子装置1中处理的信息以及用于显示可视化的工作界面,例如信息文本的事件类型。Optionally, the electronic device 1 may further include a display, which may be referred to as a display screen or a display unit. In some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an Organic Light-Emitting Diode (OLED) touch sensor. The display is used to display information processed in the electronic device 1 and a work interface for displaying visualizations, such as event types of information text.

可选地,该电子装置1还可以包括用户接口,用户接口可以包括输入单元比如键盘(Keyboard)、语音输出装置比如音响、耳机等,可选地用户接口还可以包括标准的有线接口、无线接口。Optionally, the electronic device 1 may further include a user interface, and the user interface may include an input unit such as a keyboard, a voice output device such as an audio, a headphone, etc., optionally, the user interface may further include a standard wired interface and a wireless interface. .

该电子装置1还可以包括射频(Radio Frequency,RF)电路、传感器和音频电路等等,在此不再赘述。The electronic device 1 may further include a radio frequency (RF) circuit, a sensor, an audio circuit, and the like, and details are not described herein.

在图1所示的电子装置1实施例中,作为一种计算机存储介质的存储器11中存储热点事件分类程序10的程序代码,处理器12执行热点事件分类程序10的程序代码时,实现如下步骤:In the embodiment of the electronic device 1 shown in FIG. 1, the program code of the hotspot event classification program 10 is stored in the memory 11 as a computer storage medium, and when the processor 12 executes the program code of the hotspot event classification program 10, the following steps are implemented. :

获取步骤:实时从预先确定的服务器中获取第一预设数量用户发布的信息文本;Obtaining step: obtaining, in real time, a first preset number of information texts published by the user from a predetermined server;

分词步骤:利用预先确定的分词规则对上述信息文本进行分词,获得各个信息文本对应的分词;Word segmentation step: segmenting the above information text by using a predetermined word segmentation rule to obtain a word segment corresponding to each information text;

确定步骤:提取出分词中预设的特征词,利用预先确定的概率算法确定该特征词对应的事件主题;Determining step: extracting a feature word preset in the word segment, and determining a event theme corresponding to the feature word by using a predetermined probability algorithm;

计算步骤:根据预设的计算公式,计算出该特征词对应的热点事件指标值;Calculating step: calculating a hot event indicator value corresponding to the feature word according to a preset calculation formula;

分类步骤:判断热点事件指标值是否大于预设阈值,若热点事件指标值大于预设阈值,则利用预设的向量化方式获取该特征词对应的信息文本的信息向量,将所述信息向量输入预先训练的事件分类模型中,确定出该信息文本对应的事件类型。The classification step is: determining whether the hot event indicator value is greater than a preset threshold, and if the hot event indicator value is greater than a preset threshold, acquiring an information vector of the information text corresponding to the feature word by using a preset vectorization manner, and inputting the information vector In the pre-trained event classification model, the event type corresponding to the information text is determined.

具体原理请参照下述图2关于热点事件分类程序10较佳实施例的模块示 意图及图3关于热点事件分类方法较佳实施例的流程图的介绍。For specific principles, please refer to the following FIG. 2 for a schematic diagram of a preferred embodiment of the hotspot event classification procedure 10 and FIG. 3 for an overview of a flow chart of a preferred embodiment of the hot event classification method.

如图2所示,是图1中热点事件分类程序10较佳实施例的模块示意图。本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段。As shown in FIG. 2, it is a block diagram of a preferred embodiment of the hotspot event classification program 10 of FIG. A module as referred to in this application refers to a series of computer program instructions that are capable of performing a particular function.

在本实施例中,热点事件分类程序10包括:获取模块110、分词模块120、确定模块130、计算模块140、判断模块150及分类模块160,所述模块110-160所实现的功能或操作步骤均与上文类似,此处不再详述,示例性地,例如其中:In this embodiment, the hotspot event classification program 10 includes: an acquisition module 110, a word segmentation module 120, a determination module 130, a calculation module 140, a determination module 150, and a classification module 160, and functions or operation steps implemented by the modules 110-160 Both are similar to the above, and will not be described in detail here, exemplarily, for example:

获取模块110,用于实时从预先确定的服务器中获取第一预设数量用户发布的信息文本。其中,所述预先确定的服务器可以是微信服务器、微博服务器、QQ服务器等社交服务器。所述用户是指社交服务器的社交账号,所述第一预设数量用户可以指社交服务器的部分社交账号,也可以指社交服务器的全部社交账号。The obtaining module 110 is configured to obtain, in real time, a first preset number of user-published information texts from a predetermined server. The predetermined server may be a social server such as a WeChat server, a Weibo server, or a QQ server. The user refers to a social account of the social server, and the first preset number of users may refer to a part of the social account of the social server, and may also refer to all social accounts of the social server.

分词模块120,用于利用预先确定的分词规则对上述信息文本进行分词,获得各个信息文本对应的分词。其中,所述预先确定的分词规则包括:根据预设类型标点符号,如“,”、“。”、“!”、“;”、“?”等等,将获取的各个信息文本拆分成短句。根据词库15中存储的词语,利用长词优先原则对每个短句进行分词。所述长词优先原则是指从词库15中找出与短句相同的最长词语作为该短句的一个分词。The word segmentation module 120 is configured to segment the information text by using a predetermined word segmentation rule to obtain a word segment corresponding to each information text. The predetermined word segmentation rule includes: splitting the obtained information text according to preset type punctuation marks, such as “,”, “.”, “!”, “;”, “?”, and the like. Short sentence. According to the words stored in the thesaurus 15, each short sentence is segmented using the long word priority principle. The long word priority principle refers to finding the longest word from the lexicon 15 that is the same as the short sentence as a participle of the short sentence.

确定模块130,用于提取出分词中预设的特征词,利用预先确定的概率算法确定该特征词对应的事件主题。其中,所述特征词是预先标注并存储于词库15中的。所述预先确定的概率算法包括:根据第一选择概率P 1和第二选择概率P 2计算出最终概率P 3。在特征词与事件主题文本之间添加第二预设数量的隐含事件主题,所述隐含事件主题是虚拟的,没有真实含义。第一选择概率P 1的计算方法:根据预先确定的隐含事件主题与特征词的映射关系,确定每个隐含事件主题含有的特征词的第一数量X 1及每个特征词所属的隐含事件主题的第二数量X 2,根据X 1和X 2确定每个特征词对各个隐含事件主题的第一选择概率P 1=1/(X 1*X 2)。第二选择概率P 2的计算方法:根据预先确定的隐含事件主题与事件主题的映射关系,确定每个事件主题含有的隐含事件主题的第三数量X 3及每个隐含事件主题所属的事件主题的第四数量X 4,根据X 3 和X 4确定每个隐含事件主题对各个事件主题的第二选择概率P 2=1/(X 3*X 4)。将P 1和P 2代入预先确定的概率计算公式,计算出每个特征词对各个事件主题的最终概率P 3。所述预先确定的概率计算公式为P 3=P 1*P 2The determining module 130 is configured to extract a feature word preset in the word segment, and determine a event topic corresponding to the feature word by using a predetermined probability algorithm. The feature words are pre-labeled and stored in the thesaurus 15. The predetermined probability algorithm includes calculating a final probability P 3 according to the first selection probability P 1 and the second selection probability P 2 . A second predetermined number of implicit event topics are added between the feature words and the event topic text, the hidden event topics being virtual and having no real meaning. The calculation method of the first selection probability P 1 : determining the first quantity X 1 of the feature words contained in each implicit event topic and the hiddenness of each feature word according to the mapping relationship between the predetermined implicit event topic and the feature words A second number X 2 containing event subjects, the first selection probability P 1 =1/(X 1 *X 2 ) of each feature word for each implicit event subject is determined according to X 1 and X 2 . The second selection probability P 2 is calculated according to a predetermined mapping relationship between the implicit event topic and the event topic, determining a third quantity X 3 of the implicit event topic contained in each event topic and each implicit event topic belongs to The fourth number X 4 of event subjects, based on X 3 and X 4 , determines a second selection probability P 2 =1/(X 3 *X 4 ) for each event topic for each event topic. Substituting P 1 and P 2 into a predetermined probability calculation formula, the final probability P 3 of each feature word for each event subject is calculated. The predetermined probability is calculated as P 3 =P 1 *P 2 .

计算模块140,用于根据预设的计算公式,计算出该特征词对应的热点事件指标值。其中,所述预设的计算公式如下:The calculation module 140 is configured to calculate a hot event indicator value corresponding to the feature word according to a preset calculation formula. Wherein, the preset calculation formula is as follows:

Figure PCTCN2018102083-appb-000001
Figure PCTCN2018102083-appb-000001

Figure PCTCN2018102083-appb-000002
Figure PCTCN2018102083-appb-000002

其中,v代表事件发展的速度,a代表热点事件指标值,即事件发展的“加速的”,t代表时间点,T代表时间间隔,i为整数,t i代表第i个特征词出现的时间点,X i代表第i个特征词出现的次数。 Where v represents the speed of event development, a represents the hot event indicator value, ie “accelerated” of event development, t represents time point, T represents time interval, i is an integer, and t i represents the time when the i-th feature word appears. Point, X i represents the number of occurrences of the i-th feature word.

判断模块150,用于判断热点事件指标值是否大于预设阈值。所述预设阈值是预先设置的,当热点事件指标值大于预设阈值时,则表明该事件主题的事件发展的“加速度”已经超越了一定范围,应立即分析事件的类型。The determining module 150 is configured to determine whether the hot event indicator value is greater than a preset threshold. The preset threshold is preset. When the hot event indicator value is greater than the preset threshold, it indicates that the “acceleration” of the event development of the event subject has exceeded a certain range, and the type of the event should be analyzed immediately.

分类模块160,用于当热点事件指标值大于预设阈值时,利用预设的向量化方式获取该特征词对应的信息文本的信息向量,将所述信息向量输入预先训练的事件分类模型中,确定出该信息文本对应的事件类型。其中,所述预设的向量化方式包括:使用自动编码器对信息文本的用户信息进行编码,生成用户信息向量;使用预先确定的词向量模型对该信息文本进行词向量编码,生成该信息文本的文本信息向量;将用户信息向量与文本信息向量拼接起来生成该信息文本对应的信息向量。The classification module 160 is configured to acquire an information vector of the information text corresponding to the feature word by using a preset vectorization manner when the hot event indicator value is greater than a preset threshold, and input the information vector into the pre-trained event classification model. Determine the type of event corresponding to the information text. The preset vectorization manner includes: encoding an user information of the information text by using an automatic encoder to generate a user information vector; and performing word vector coding on the information text by using a predetermined word vector model to generate the information text. The text information vector; the user information vector is combined with the text information vector to generate an information vector corresponding to the information text.

所述事件分类模型为长短期记忆网络模型,如图4所示,是本申请事件分类模型训练的流程图,所述事件分类模型的训练步骤如下:The event classification model is a long-term and short-term memory network model, as shown in FIG. 4, which is a flowchart of the training of the event classification model of the present application, and the training steps of the event classification model are as follows:

获取第三预设数量的信息文本,并生成各个信息文本对应的信息向量,根据预先确定的信息文本与事件类型的映射关系,确定各个信息向量对应的事件类型,并将信息向量与事件类型的映射关系数据作为样本数据;Obtaining a third preset number of information texts, and generating an information vector corresponding to each information text, determining an event type corresponding to each information vector according to a predetermined mapping relationship between the information text and the event type, and determining the information vector and the event type Mapping relational data as sample data;

将样本数据分成第一比例的训练集和第二比例的验证集,其中,第一比例大于第二比例;Dividing the sample data into a training set of a first ratio and a verification set of a second ratio, wherein the first ratio is greater than the second ratio;

利用训练集中的样本数据对所述事件分类模型进行训练,并在训练完后利用验证集中的样本数据对所述事件分类模型的准确率进行验证;The event classification model is trained by using sample data in the training set, and the accuracy of the event classification model is verified by using sample data in the verification set after training;

若准确率大于预设值,则训练完成,若准确率小于或等于预设值,则增加样本数据的数量,之后返回将样本数据分成训练集和验证集的步骤。If the accuracy is greater than the preset value, the training is completed. If the accuracy is less than or equal to the preset value, the number of sample data is increased, and then the step of dividing the sample data into the training set and the verification set is returned.

如图3所示,是本申请热点事件分类方法较佳实施例的流程图。As shown in FIG. 3, it is a flowchart of a preferred embodiment of the hot event classification method of the present application.

在本实施例中,处理器12执行存储器11中存储的热点事件分类程序10的计算机程序时实现热点事件分类方法包括:步骤S10-步骤S60:In this embodiment, when the processor 12 executes the computer program of the hotspot event classification program 10 stored in the memory 11, the hotspot event classification method includes: Step S10 - Step S60:

步骤S10,获取模块110实时从预先确定的服务器中获取第一预设数量用户发布的信息文本。其中,所述预先确定的服务器可以是微信服务器、微博服务器、QQ服务器等社交服务器。所述用户是指社交服务器的社交账号,所述第一预设数量用户可以指社交服务器的部分社交账号,也可以指社交服务器的全部社交账号。例如,从微信服务器中获取销售业务员A 1的微信账号在朋友圈或朋友群发布的信息文本。 In step S10, the obtaining module 110 acquires the information text of the first preset number of users from the predetermined server in real time. The predetermined server may be a social server such as a WeChat server, a Weibo server, or a QQ server. The user refers to a social account of the social server, and the first preset number of users may refer to a part of the social account of the social server, and may also refer to all social accounts of the social server. For example, the information text of the WeChat account of the sales clerk A 1 in the circle of friends or the group of friends is obtained from the WeChat server.

步骤S20,根据获取的信息文本,分词模块120利用预先确定的分词规则对上述信息文本进行分词,获得各个信息文本对应的分词。所述分词是指将信息文本分成字或词。例如,信息文本是“B 1成功研制出了C 1产品”,分词后的结果为“B 1”、“成功”、“研制”、“出”、“了”、“C 1”、“产品”,其中,B 1可以是公司或部门,C 1可以是产品名称。其中,所述预先确定的分词规则包括:根据预设类型标点符号,如“,”、“。”、“!”、“;”、“?”等等,将获取的各个信息文本拆分成短句。例如,从信息文本的起始位置(第一个字)至第一个预设类型标点符号之间的信息为一个短句,第一个预设类型标点符号至第二个预设类型标点符号之间的信息为一个短句,……,每两个预设类型标点符号之间的信息为一个短句,直至将该信息文本全部拆分成短句。但应理解的是,若信息结束位置无预设类型标点符号,则从倒数第一预设类型标点符号至信息结束位置(最后一个字)之间的信息为一个短句。根据词库15中存储的词语,利用长词优先原则对每个短句进行分词。其中所述长词优先原则是指从词库15中找出与短句相同的最长词语作为该短句的一个分词。假设,需要分词的短句T1的第一个字是a,先从第一个字a开始,在词库15中找出一个由a开始的最长词语R 1,R 1与T 1部分相同,然后从T1中剔除R 1剩下T 2部分,再对T 2采用相同的方法直至从词库15中找出T 1的所有字和词,得到的结果为“R 1/R 2……”。 In step S20, according to the acquired information text, the word segmentation module 120 performs segmentation on the information text by using a predetermined word segmentation rule to obtain a word segment corresponding to each information text. The word segmentation refers to dividing the information text into words or words. For example, the message text is “B 1 successfully developed the C 1 product”, and the results after the word segmentation are “B 1 ”, “success”, “development”, “out”, “to”, “C 1 ”, “product” ", where B 1 can be a company or department, and C 1 can be a product name. The predetermined word segmentation rule includes: splitting the obtained information text according to preset type punctuation marks, such as “,”, “.”, “!”, “;”, “?”, and the like. Short sentence. For example, the information from the start position of the message text (the first word) to the first preset type punctuation mark is a short sentence, the first preset type punctuation mark to the second preset type punctuation mark The information between the two is a short sentence, ..., the information between each of the two preset types of punctuation marks is a short sentence until the information text is completely split into short sentences. However, it should be understood that if there is no preset type punctuation at the end position of the information, the information from the last preset type punctuation mark to the information end position (the last word) is a short sentence. According to the words stored in the thesaurus 15, each short sentence is segmented using the long word priority principle. The long word priority principle refers to finding the longest word from the lexicon 15 that is the same as the short sentence as a participle of the short sentence. Suppose, first required a word phrase is a word T1, starting with a first word begins, starting from a find a longest word in lexicon R 1 15, R 1 and T 1 part of the same and T 2 from a rest portion T1 excluded R, T 2 again in the same manner until all words and identify the words in the thesaurus from T 1 as 15, the result is "R 1 / R 2 ...... ".

步骤S30,若信息文本的分词中含有词库15存储的特征词,则确定模块130利用预先确定的概率算法确定该特征词对应的事件主题。但应理解的是,信息文本的分词中可能不含有特征词,也可能含有一个或多个特征词。所述特征词是预先标注并存储于词库15中的。In step S30, if the word segmentation of the information text contains the feature word stored in the thesaurus 15, the determining module 130 determines the event subject corresponding to the feature word by using a predetermined probability algorithm. However, it should be understood that the word segmentation of the information text may not contain feature words or one or more feature words. The feature words are pre-labeled and stored in the lexicon 15.

其中,所述预先确定的概率算法包括:在特征词与事件主题文本之间添加第二预设数量的隐含事件主题,所述隐含事件主题是虚拟的,没有真实含义。例如,在特征词与事件主题文本之间添加50个隐含事件主题:k 1,k 2,……,k 50。根据预先确定的隐含事件主题与特征词的映射关系,确定每个隐含事件主题含有的特征词的第一数量X 1及每个特征词所属的隐含事件主题的第二数量X 2,根据第一数量X 1和第二数量X 2确定每个特征词对各个隐含事件主题的第一选择概率P 1=1/(X 1*X 2)。例如,特征词Y所属的隐含事件主题的第二数量为5,其中一个隐含事件主题k 7含有的特征词的第一数量为7,则该特征词Y对该隐含事件主题k 7的第一选择概率为1/35。根据预先确定的隐含事件主题与事件主题的映射关系,确定每个事件主题含有的隐含事件主题的第三数量X 3及每个隐含事件主题所属的事件主题的第四数量X 4,根据第三数量X 3和第四数量X 4确定每个隐含事件主题对各个事件主题的第二选择概率P 2=1/(X 3*X 4)。例如,隐含事件主题k 7所属的事件主题的第四数量为4,其中一个事件主题Z含有的隐含事件主题的第三数量为5,则该隐含事件主题k 7对事件主题Z的第二选择概率为1/20。将第一选择概率P 1和第二选择概率P 2代入预先确定的概率计算公式,计算出每个特征词对各个事件主题的最终概率P 3的分布。所述预先确定的概率计算公式为P 3=P 1*P 2。例如,特征词Y对隐含事件主题k 7的第一选择概率P 1为1/35,隐含事件主题k 7对事件主题文本Z的第二选择概率P 2为1/20,则特征词Y对事件主题文本Z的最终概率P 3为1/700。同理,算出特征词Y对其它事件主题文本的最终概率P3及该信息文本的其它特征词的各个事件主题文本的最终概率P 3。最后将各个特征词对应的最大概率的事件主题作为该特征词对应的事件主题。 The predetermined probability algorithm includes: adding a second preset number of implicit event topics between the feature words and the event topic text, the hidden event topics being virtual and having no real meaning. For example, add 50 implicit event themes between the feature word and the event topic text: k 1 , k 2 , ..., k 50 . Determining, according to a predetermined mapping relationship between the subject matter of the implicit event and the feature word, a first quantity X 1 of the feature words contained in each implicit event topic and a second quantity X 2 of the implicit event subject to which each feature word belongs, A first selection probability P 1 =1/(X 1 *X 2 ) of each feature word for each implied event subject is determined according to the first quantity X 1 and the second quantity X 2 . For example, the second number of implicit event topics to which the feature word Y belongs is 5, wherein the first number of feature words contained in one hidden event topic k 7 is 7, and the feature word Y is the implicit event topic k 7 The first selection probability is 1/35. Determining, according to a predetermined mapping relationship between the implicit event topic and the event topic, a third quantity X 3 of the implicit event topic included in each event topic and a fourth number X 4 of the event topic to which each implicit event topic belongs, A second selection probability P 2 =1/(X 3 *X 4 ) for each event subject for each event subject is determined according to the third number X 3 and the fourth number X 4 . For example, the fourth number of event topics to which the implicit event topic k 7 belongs is 4, wherein the third number of implicit event topics contained in one event topic Z is 5, then the implicit event topic k 7 is for the event topic Z The second selection probability is 1/20. The first selection probability P 1 and the second selection probability P 2 are substituted into a predetermined probability calculation formula, and the distribution of the final probability P 3 of each feature word to each event topic is calculated. The predetermined probability is calculated as P 3 =P 1 *P 2 . For example, the first selection probability P 1 of the feature word Y to the implicit event topic k 7 is 1/35, and the second selection probability P 2 of the event theme topic k 7 to the event subject text Z is 1/20, then the feature word The final probability P 3 of Y to the event subject text Z is 1/700. Similarly, the final probability P3 of the feature word Y to other event subject texts and the final probability P 3 of each event topic text of other feature words of the information text are calculated. Finally, the event theme of the maximum probability corresponding to each feature word is used as the event topic corresponding to the feature word.

步骤S40,计算模块140根据预设的计算公式,计算出每个特征词对应的热点事件指标值。其中,所述预设的计算公式如下:In step S40, the calculation module 140 calculates a hot event indicator value corresponding to each feature word according to a preset calculation formula. Wherein, the preset calculation formula is as follows:

Figure PCTCN2018102083-appb-000003
Figure PCTCN2018102083-appb-000003

Figure PCTCN2018102083-appb-000004
Figure PCTCN2018102083-appb-000004

其中,v代表事件发展的速度,a代表热点事件指标值,即事件发展的“加速的”,t代表时间点,T代表时间间隔,i为整数,t i代表第i个特征词出现的时间点,X i代表第i个特征词出现的次数。从而计算出所有特征词对应的事件主题的热点事件指标值,热点指标值越大,代表该事件主题的事件发展趋势越快。 Where v represents the speed of event development, a represents the hot event indicator value, ie “accelerated” of event development, t represents time point, T represents time interval, i is an integer, and t i represents the time when the i-th feature word appears. Point, X i represents the number of occurrences of the i-th feature word. Therefore, the hot event indicator value of the event topic corresponding to all feature words is calculated, and the hotspot index value is larger, and the event development trend representing the event topic is faster.

步骤S50,判断模块150判断热点事件指标值是否大于预设阈值。所述预设阈值是预先设置的,当热点事件指标值大于预设阈值时,则表明该事件主题的事件发展的“加速度”已经超越了一定范围,应立即分析事件的类型。In step S50, the determining module 150 determines whether the hot event indicator value is greater than a preset threshold. The preset threshold is preset. When the hot event indicator value is greater than the preset threshold, it indicates that the “acceleration” of the event development of the event subject has exceeded a certain range, and the type of the event should be analyzed immediately.

步骤S60,若热点事件指标值大于预设阈值,则分类模块150利用预设的向量化方式获取该特征词对应的信息文本的信息向量,将所述信息向量输入预先训练的事件分类模型中,确定出该信息文本对应的事件类型。其中,所述预设的向量化方式包括:使用自动编码器,如Auto-Encoder对信息文本的用户信息进行编码,生成用户信息向量。进一步地,所述Auto-Encoder是一种无监督的学习算法,主要用于数据的降维或特征抽取。接着使用预先确定的词向量模型对该信息文本进行词向量编码,生成该信息文本的文本信息向量。所述预先确定的词向量模型可以是Word2Vec模型或Doc2Vec模型。例如,使用Word2Vec模型对该信息文本进行词向量编码,生成该信息文本的文本信息向量。最后将用户信息向量与文本信息向量拼接起来生成该信息文本对应的信息向量。Step S60: If the hot event indicator value is greater than the preset threshold, the classification module 150 acquires the information vector of the information text corresponding to the feature word by using a preset vectorization manner, and inputs the information vector into the pre-trained event classification model. Determine the type of event corresponding to the information text. The preset vectorization manner includes: using an automatic encoder, such as an Auto-Encoder, encoding user information of the information text to generate a user information vector. Further, the Auto-Encoder is an unsupervised learning algorithm mainly used for data dimensionality reduction or feature extraction. The information text is then subjected to word vector coding using a predetermined word vector model to generate a text information vector of the information text. The predetermined word vector model may be a Word2Vec model or a Doc2Vec model. For example, the information text is coded using the Word2Vec model to generate a text information vector of the information text. Finally, the user information vector and the text information vector are spliced together to generate an information vector corresponding to the information text.

其中,所述事件分类模型为LSTM模型,如图4所示,是本申请事件分类模型训练的流程图,所述事件分类模型的训练步骤如下:The event classification model is an LSTM model, as shown in FIG. 4, which is a flowchart of the event classification model training of the present application. The training steps of the event classification model are as follows:

获取第三预设数量的信息文本,并生成各个信息文本对应的信息向量,根据预先确定的信息文本与事件类型的映射关系,确定各个信息向量对应的事件类型,并将信息向量与事件类型的映射关系数据作为样本数据。例如,从微博服务器中获取10万个信息文本,标注信息文本的事件类型,并将信息文本生成10万个对应的信息向量,根据预先确定的信息文本与事件类型的映射关系,确定各个信息文本的事件类型,将信息向量与对应的事件类型的映射关系作为样本数据。Obtaining a third preset number of information texts, and generating an information vector corresponding to each information text, determining an event type corresponding to each information vector according to a predetermined mapping relationship between the information text and the event type, and determining the information vector and the event type Map relational data as sample data. For example, 100,000 pieces of information text are obtained from the microblog server, the event type of the information text is marked, and the information text is generated into 100,000 corresponding information vectors, and each information is determined according to a predetermined mapping relationship between the information text and the event type. The event type of the text, and the mapping relationship between the information vector and the corresponding event type is taken as sample data.

将样本数据分成第一比例的训练集和第二比例的验证集,其中,第一比 例大于第二比例。例如,随机将80%的样本数据,即8万个样本数据作为训练集,将剩余20%的样本数据,即2万个样本数据作为验证集。The sample data is divided into a training set of a first ratio and a verification set of a second ratio, wherein the first ratio is greater than the second ratio. For example, 80% of the sample data, that is, 80,000 sample data are randomly used as the training set, and the remaining 20% of the sample data, that is, 20,000 sample data, is used as the verification set.

利用训练集中的样本数据对所述事件分类模型进行训练,并在训练完后利用验证集中的样本数据对所述事件分类模型的准确率进行验证。例如,将训练集中8万个用户的样本数据输入到LSTM模型中训练,生成事件分类模型,并将验证集中2万个用户的样本数据输入到生成的事件分类模型中进行准确率验证。The event classification model is trained by using the sample data in the training set, and the accuracy of the event classification model is verified by using the sample data in the verification set after the training. For example, the sample data of 80,000 users in the training set is input into the LSTM model for training, an event classification model is generated, and sample data of 20,000 users in the verification set is input into the generated event classification model for accuracy verification.

若准确率大于预设值,则训练完成,若准确率小于或等于预设值,则增加样本数据的数量,之后返回将样本数据分成训练集和验证集的步骤。假设,预设值为98%,若验证准确率大于98%,则训练完成,若准确率小于98%,则增加2万个样本数据,之后返回将样本数据分成训练集和验证集的步骤。If the accuracy is greater than the preset value, the training is completed. If the accuracy is less than or equal to the preset value, the number of sample data is increased, and then the step of dividing the sample data into the training set and the verification set is returned. Assume that the default value is 98%. If the verification accuracy is greater than 98%, the training is completed. If the accuracy is less than 98%, then 20,000 sample data is added, and then the steps of dividing the sample data into the training set and the verification set are returned.

上述实施例提出的热点事件分类方法,通过从服务器获取用户发布的信息文本,对信息文本进行分词处理,提取出分词中的特征词,接着利用预先确定的概率算法算出特征词的最大概率的事件主题,并利用预设的计算公式计算特征词的热点事件指标值,将热点事件指标值大于预设值的特征词对应的信息文本向量化,输入事件分类模型中确定事件类型,提高事件分类的效率,缩短分析时间。The hot event classification method proposed by the above embodiment obtains the information text published by the user from the server, performs word segmentation processing on the information text, extracts feature words in the word segmentation, and then calculates a maximum probability event of the feature word by using a predetermined probability algorithm. The theme, and using the preset calculation formula to calculate the hot event indicator value of the feature word, vectorizing the information text corresponding to the feature word whose hot spot event index value is greater than the preset value, input the event classification model to determine the event type, and improve the event classification. Efficiency and shorten analysis time.

此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质中包括热点事件分类程序10,所述热点事件分类程序10被处理器执行时实现如下操作:In addition, the embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium includes a hotspot event classification program 10, and when the hotspot event classification program 10 is executed by the processor, the following operations are implemented:

获取步骤:实时从预先确定的服务器中获取第一预设数量用户发布的信息文本;Obtaining step: obtaining, in real time, a first preset number of information texts published by the user from a predetermined server;

分词步骤:利用预先确定的分词规则对上述信息文本进行分词,获得各个信息文本对应的分词;Word segmentation step: segmenting the above information text by using a predetermined word segmentation rule to obtain a word segment corresponding to each information text;

确定步骤:提取出分词中预设的特征词,利用预先确定的概率算法确定该特征词对应的事件主题;Determining step: extracting a feature word preset in the word segment, and determining a event theme corresponding to the feature word by using a predetermined probability algorithm;

计算步骤:根据预设的计算公式,计算出该特征词对应的热点事件指标值;Calculating step: calculating a hot event indicator value corresponding to the feature word according to a preset calculation formula;

分类步骤:判断热点事件指标值是否大于预设阈值,若热点事件指标值 大于预设阈值,则利用预设的向量化方式获取该特征词对应的信息文本的信息向量,将所述信息向量输入预先训练的事件分类模型中,确定出该信息文本对应的事件类型。The classification step is: determining whether the hot event indicator value is greater than a preset threshold, and if the hot event indicator value is greater than a preset threshold, acquiring an information vector of the information text corresponding to the feature word by using a preset vectorization manner, and inputting the information vector In the pre-trained event classification model, the event type corresponding to the information text is determined.

优选地,所述预设的计算公式如下:Preferably, the preset calculation formula is as follows:

Figure PCTCN2018102083-appb-000005
Figure PCTCN2018102083-appb-000005

Figure PCTCN2018102083-appb-000006
Figure PCTCN2018102083-appb-000006

其中,v代表事件发展的速度,a代表热点事件指标值,t代表时间点,T代表时间间隔,i为整数,t i代表第i个特征词出现的时间点,X i代表第i个特征词出现的次数。 Where v represents the speed of event development, a represents the hot event indicator value, t represents the time point, T represents the time interval, i is an integer, t i represents the time point at which the i-th feature word appears, and X i represents the i-th feature The number of times a word appears.

优选地,所述预先确定的分词规则包括:Preferably, the predetermined word segmentation rule comprises:

根据预设类型标点符号,将获取的各个信息文本拆分成短句;According to the preset type punctuation marks, the obtained information texts are divided into short sentences;

根据词库中存储的词语,利用长词优先原则对每个短句进行分词。According to the words stored in the thesaurus, each short sentence is segmented using the long word priority principle.

优选地,所述预先确定的概率算法包括:Preferably, the predetermined probability algorithm comprises:

在特征词与事件主题文本之间添加第二预设数量的隐含事件主题;Adding a second preset number of implicit event topics between the feature word and the event topic text;

根据预先确定的隐含事件主题与特征词的映射关系,确定每个隐含事件主题含有的特征词的第一数量X 1及每个特征词所属的隐含事件主题的第二数量X 2,根据第一数量X 1和第二数量X 2确定每个特征词对各个隐含事件主题的第一选择概率P 1=1/(X 1*X 2); Determining, according to a predetermined mapping relationship between the subject matter of the implicit event and the feature word, a first quantity X 1 of the feature words contained in each implicit event topic and a second quantity X 2 of the implicit event subject to which each feature word belongs, Determining, according to the first quantity X 1 and the second quantity X 2 , a first selection probability P 1 =1/(X 1 *X 2 ) of each feature word to each implicit event topic;

根据预先确定的隐含事件主题与事件主题的映射关系,确定每个事件主题含有的隐含事件主题的第三数量X 3及每个隐含事件主题所属的事件主题的第四数量X 4,根据第三数量X 3和第四数量X 4确定每个隐含事件主题对各个事件主题的第二选择概率P 2=1/(X 3*X 4); Determining, according to a predetermined mapping relationship between the implicit event topic and the event topic, a third quantity X 3 of the implicit event topic included in each event topic and a fourth number X 4 of the event topic to which each implicit event topic belongs, Determining, according to the third quantity X 3 and the fourth quantity X 4 , a second selection probability P 2 =1/(X 3 *X 4 ) of each implicit event topic for each event topic;

将第一选择概率P 1和第二选择概率P 2代入预先确定的概率计算公式,计算出每个特征词对各个事件主题的最终概率P 3的分布。 The first selection probability P 1 and the second selection probability P 2 are substituted into a predetermined probability calculation formula, and the distribution of the final probability P 3 of each feature word to each event topic is calculated.

优选地,所述预先确定的概率计算公式如下:Preferably, the predetermined probability calculation formula is as follows:

P 3=P 1*P 2 P 3 =P 1 *P 2

其中,P 1代表第一选择概率,P 2代表第二选择概率,P 3代表最终概率。 Where P 1 represents the first selection probability, P 2 represents the second selection probability, and P 3 represents the final probability.

优选地,所述预设的向量化方式包括:Preferably, the preset vectorization manner includes:

使用自动编码器对信息文本的用户信息进行编码,生成用户信息向量;Encoding the user information of the information text using an automatic encoder to generate a user information vector;

使用预先确定的词向量模型对该信息文本进行词向量编码,生成该信息文本的文本信息向量;Generating a word vector of the information text using a predetermined word vector model to generate a text information vector of the information text;

将用户信息向量与文本信息向量拼接起来生成该信息文本对应的信息向量。The user information vector and the text information vector are spliced together to generate an information vector corresponding to the information text.

优选地,所述事件分类模型为长短期记忆网络模型,所述事件分类模型的训练步骤如下:Preferably, the event classification model is a long-term and short-term memory network model, and the training steps of the event classification model are as follows:

获取第三预设数量的信息文本,并生成各个信息文本对应的信息向量,根据预先确定的信息文本与事件类型的映射关系,确定各个信息向量对应的事件类型,并将信息向量与事件类型的映射关系数据作为样本数据;Obtaining a third preset number of information texts, and generating an information vector corresponding to each information text, determining an event type corresponding to each information vector according to a predetermined mapping relationship between the information text and the event type, and determining the information vector and the event type Mapping relational data as sample data;

将样本数据分成第一比例的训练集和第二比例的验证集,其中,第一比例大于第二比例;Dividing the sample data into a training set of a first ratio and a verification set of a second ratio, wherein the first ratio is greater than the second ratio;

利用训练集中的样本数据对所述事件分类模型进行训练,并在训练完后利用验证集中的样本数据对所述事件分类模型的准确率进行验证;The event classification model is trained by using sample data in the training set, and the accuracy of the event classification model is verified by using sample data in the verification set after training;

若准确率大于预设值,则训练完成,若准确率小于或等于预设值,则增加样本数据的数量,之后返回将样本数据分成训练集和验证集的步骤。If the accuracy is greater than the preset value, the training is completed. If the accuracy is less than or equal to the preset value, the number of sample data is increased, and then the step of dividing the sample data into the training set and the verification set is returned.

本申请之计算机可读存储介质的具体实施方式与上述热点事件分类方法的具体实施方式大致相同,在此不再赘述。The specific implementation manner of the computer readable storage medium of the present application is substantially the same as the specific implementation manner of the foregoing hot event classification method, and details are not described herein again.

上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM as described above). , a disk, an optical disk, including a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.

以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其它相关的技术领域,均同理包括在本申请的专利保护范围内。The above is only a preferred embodiment of the present application, and thus does not limit the scope of the patent application, and the equivalent structure or equivalent process transformation made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims (20)

一种热点事件分类方法,应用于电子装置,其特征在于,所述方法包括:A hotspot event classification method is applied to an electronic device, and the method includes: 获取步骤:实时从预先确定的服务器中获取第一预设数量用户发布的信息文本;Obtaining step: obtaining, in real time, a first preset number of information texts published by the user from a predetermined server; 分词步骤:利用预先确定的分词规则对上述信息文本进行分词,获得各个信息文本对应的分词;Word segmentation step: segmenting the above information text by using a predetermined word segmentation rule to obtain a word segment corresponding to each information text; 确定步骤:提取出分词中预设的特征词,利用预先确定的概率算法确定该特征词对应的事件主题;Determining step: extracting a feature word preset in the word segment, and determining a event theme corresponding to the feature word by using a predetermined probability algorithm; 计算步骤:根据预设的计算公式,计算出该特征词对应的热点事件指标值;Calculating step: calculating a hot event indicator value corresponding to the feature word according to a preset calculation formula; 分类步骤:判断热点事件指标值是否大于预设阈值,若热点事件指标值大于预设阈值,则利用预设的向量化方式获取该特征词对应的信息文本的信息向量,将所述信息向量输入预先训练的事件分类模型中,确定出该信息文本对应的事件类型。The classification step is: determining whether the hot event indicator value is greater than a preset threshold, and if the hot event indicator value is greater than a preset threshold, acquiring an information vector of the information text corresponding to the feature word by using a preset vectorization manner, and inputting the information vector In the pre-trained event classification model, the event type corresponding to the information text is determined. 根据权利要求1所述的热点事件分类方法,其特征在于,所述预设的计算公式如下:The hotspot event classification method according to claim 1, wherein the preset calculation formula is as follows:
Figure PCTCN2018102083-appb-100001
Figure PCTCN2018102083-appb-100001
Figure PCTCN2018102083-appb-100002
Figure PCTCN2018102083-appb-100002
其中,v代表事件发展的速度,a代表热点事件指标值,t代表时间点,T代表时间间隔,i为整数,t i代表第i个特征词出现的时间点,X i代表第i个特征词出现的次数。 Where v represents the speed of event development, a represents the hot event indicator value, t represents the time point, T represents the time interval, i is an integer, t i represents the time point at which the i-th feature word appears, and X i represents the i-th feature The number of times a word appears.
根据权利要求1所述的热点事件分类方法,其特征在于,所述预先确定的分词规则包括:The hotspot event classification method according to claim 1, wherein the predetermined word segmentation rule comprises: 根据预设类型标点符号,将获取的各个信息文本拆分成短句;According to the preset type punctuation marks, the obtained information texts are divided into short sentences; 根据词库中存储的词语,利用长词优先原则对每个短句进行分词。According to the words stored in the thesaurus, each short sentence is segmented using the long word priority principle. 根据权利要求1所述的热点事件分类方法,其特征在于,所述预先确定的概率算法包括:The hotspot event classification method according to claim 1, wherein the predetermined probability algorithm comprises: 在特征词与事件主题文本之间添加第二预设数量的隐含事件主题;Adding a second preset number of implicit event topics between the feature word and the event topic text; 根据预先确定的隐含事件主题与特征词的映射关系,确定每个隐含事件主题含有的特征词的第一数量X 1及每个特征词所属的隐含事件主题的第二数量X 2,根据第一数量X 1和第二数量X 2确定每个特征词对各个隐含事件主题的第一选择概率P 1=1/(X 1*X 2); Determining, according to a predetermined mapping relationship between the subject matter of the implicit event and the feature word, a first quantity X 1 of the feature words contained in each implicit event topic and a second quantity X 2 of the implicit event subject to which each feature word belongs, Determining, according to the first quantity X 1 and the second quantity X 2 , a first selection probability P 1 =1/(X 1 *X 2 ) of each feature word to each implicit event topic; 根据预先确定的隐含事件主题与事件主题的映射关系,确定每个事件主题含有的隐含事件主题的第三数量X 3及每个隐含事件主题所属的事件主题的第四数量X 4,根据第三数量X 3和第四数量X 4确定每个隐含事件主题对各个事件主题的第二选择概率P 2=1/(X 3*X 4); Determining, according to a predetermined mapping relationship between the implicit event topic and the event topic, a third quantity X 3 of the implicit event topic included in each event topic and a fourth number X 4 of the event topic to which each implicit event topic belongs, Determining, according to the third quantity X 3 and the fourth quantity X 4 , a second selection probability P 2 =1/(X 3 *X 4 ) of each implicit event topic for each event topic; 将第一选择概率P 1和第二选择概率P 2代入预先确定的概率计算公式,计算出每个特征词对各个事件主题的最终概率P 3的分布。 The first selection probability P 1 and the second selection probability P 2 are substituted into a predetermined probability calculation formula, and the distribution of the final probability P 3 of each feature word to each event topic is calculated. 根据权利要求4所述的热点事件分类方法,其特征在于,所述预先确定的概率计算公式如下:The hotspot event classification method according to claim 4, wherein the predetermined probability calculation formula is as follows: P 3=P 1*P 2 P 3 =P 1 *P 2 其中,P 1代表第一选择概率,P 2代表第二选择概率,P 3代表最终概率。 Where P 1 represents the first selection probability, P 2 represents the second selection probability, and P 3 represents the final probability. 根据权利要求1所述的热点事件分类方法,其特征在于,所述预设的向量化方式包括:The hotspot event classification method according to claim 1, wherein the preset vectorization manner comprises: 使用自动编码器对信息文本的用户信息进行编码,生成用户信息向量;Encoding the user information of the information text using an automatic encoder to generate a user information vector; 使用预先确定的词向量模型对该信息文本进行词向量编码,生成该信息文本的文本信息向量;Generating a word vector of the information text using a predetermined word vector model to generate a text information vector of the information text; 将用户信息向量与文本信息向量拼接起来生成该信息文本对应的信息向量。The user information vector and the text information vector are spliced together to generate an information vector corresponding to the information text. 根据权利要求1所述的热点事件分类方法,其特征在于,所述事件分类模型为长短期记忆网络模型,所述事件分类模型的训练步骤如下:The hotspot event classification method according to claim 1, wherein the event classification model is a long-term and short-term memory network model, and the training steps of the event classification model are as follows: 获取第三预设数量的信息文本,并生成各个信息文本对应的信息向量,根据预先确定的信息文本与事件类型的映射关系,确定各个信息向量对应的事件类型,并将信息向量与事件类型的映射关系数据作为样本数据;Obtaining a third preset number of information texts, and generating an information vector corresponding to each information text, determining an event type corresponding to each information vector according to a predetermined mapping relationship between the information text and the event type, and determining the information vector and the event type Mapping relational data as sample data; 将样本数据分成第一比例的训练集和第二比例的验证集,其中,第一比例大于第二比例;Dividing the sample data into a training set of a first ratio and a verification set of a second ratio, wherein the first ratio is greater than the second ratio; 利用训练集中的样本数据对所述事件分类模型进行训练,并在训练完后利用验证集中的样本数据对所述事件分类模型的准确率进行验证;The event classification model is trained by using sample data in the training set, and the accuracy of the event classification model is verified by using sample data in the verification set after training; 若准确率大于预设值,则训练完成,若准确率小于或等于预设值,则增加样本数据的数量,之后返回将样本数据分成训练集和验证集的步骤。If the accuracy is greater than the preset value, the training is completed. If the accuracy is less than or equal to the preset value, the number of sample data is increased, and then the step of dividing the sample data into the training set and the verification set is returned. 一种电子装置,其特征在于,所述装置包括:存储器及处理器,所述存储器上存储有热点事件分类程序,所述热点事件分类程序被所述处理器执行,可实现如下步骤:An electronic device, comprising: a memory and a processor, wherein the memory stores a hotspot event classification program, and the hotspot event classification program is executed by the processor, and the following steps can be implemented: 获取步骤:实时从预先确定的社交服务器中获取第一预设数量用户发布的信息文本;Obtaining step: acquiring, in real time, a first preset number of information texts published by the user from a predetermined social server; 分词步骤:利用预先确定的分词规则对上述信息文本进行分词,获得各个信息文本对应的分词;Word segmentation step: segmenting the above information text by using a predetermined word segmentation rule to obtain a word segment corresponding to each information text; 确定步骤:提取出分词中预设的特征词,利用预先确定的概率算法确定该特征词对应的事件主题;Determining step: extracting a feature word preset in the word segment, and determining a event theme corresponding to the feature word by using a predetermined probability algorithm; 计算步骤:根据预设的计算公式,计算出该特征词对应的热点事件指标指值;Calculating step: calculating a hotspot event index value corresponding to the feature word according to a preset calculation formula; 分类步骤:判断热点事件指标值是否大于预设阈值,若热点事件指标值大于预设阈值,则利用预设的向量化方式获取该特征词对应的信息文本的信息向量,将所述信息向量输入预先训练的事件分类模型中,确定出该信息文本对应的事件类型。The classification step is: determining whether the hot event indicator value is greater than a preset threshold, and if the hot event indicator value is greater than a preset threshold, acquiring an information vector of the information text corresponding to the feature word by using a preset vectorization manner, and inputting the information vector In the pre-trained event classification model, the event type corresponding to the information text is determined. 根据权利要求8所述的电子装置,其特征在于,所述预设的计算公式如下:The electronic device according to claim 8, wherein the preset calculation formula is as follows:
Figure PCTCN2018102083-appb-100003
Figure PCTCN2018102083-appb-100003
Figure PCTCN2018102083-appb-100004
Figure PCTCN2018102083-appb-100004
其中,v代表事件发展的速度,a代表热点事件指标值,t代表时间点,T代表时间间隔,i为整数,t i代表第i个特征词出现的时间点,X i代表第i个特征词出现的次数。 Where v represents the speed of event development, a represents the hot event indicator value, t represents the time point, T represents the time interval, i is an integer, t i represents the time point at which the i-th feature word appears, and X i represents the i-th feature The number of times a word appears.
根据权利要求8所述的电子装置,其特征在于,所述预先确定的分词规则包括:The electronic device according to claim 8, wherein the predetermined word segmentation rule comprises: 根据预设类型标点符号,将获取的各个信息文本拆分成短句;According to the preset type punctuation marks, the obtained information texts are divided into short sentences; 根据词库中存储的词语,利用长词优先原则对每个短句进行分词。According to the words stored in the thesaurus, each short sentence is segmented using the long word priority principle. 根据权利要求8所述的电子装置,其特征在于,所述预先确定的概率 算法包括:The electronic device of claim 8, wherein the predetermined probability algorithm comprises: 在特征词与事件主题文本之间添加第二预设数量的隐含事件主题;Adding a second preset number of implicit event topics between the feature word and the event topic text; 根据预先确定的隐含事件主题与特征词的映射关系,确定每个隐含事件主题含有的特征词的第一数量X 1及每个特征词所属的隐含事件主题的第二数量X 2,根据第一数量X 1和第二数量X 2确定每个特征词对各个隐含事件主题的第一选择概率P 1=1/(X 1*X 2); Determining, according to a predetermined mapping relationship between the subject matter of the implicit event and the feature word, a first quantity X 1 of the feature words contained in each implicit event topic and a second quantity X 2 of the implicit event subject to which each feature word belongs, Determining, according to the first quantity X 1 and the second quantity X 2 , a first selection probability P 1 =1/(X 1 *X 2 ) of each feature word to each implicit event topic; 根据预先确定的隐含事件主题与事件主题的映射关系,确定每个事件主题含有的隐含事件主题的第三数量X 3及每个隐含事件主题所属的事件主题的第四数量X 4,根据第三数量X 3和第四数量X 4确定每个隐含事件主题对各个事件主题的第二选择概率P 2=1/(X 3*X 4); Determining, according to a predetermined mapping relationship between the implicit event topic and the event topic, a third quantity X 3 of the implicit event topic included in each event topic and a fourth number X 4 of the event topic to which each implicit event topic belongs, Determining, according to the third quantity X 3 and the fourth quantity X 4 , a second selection probability P 2 =1/(X 3 *X 4 ) of each implicit event topic for each event topic; 将第一选择概率P 1和第二选择概率P 2代入预先确定的概率计算公式,计算出每个特征词对各个事件主题的最终概率P 3的分布。 The first selection probability P 1 and the second selection probability P 2 are substituted into a predetermined probability calculation formula, and the distribution of the final probability P 3 of each feature word to each event topic is calculated. 根据权利要求11所述的电子装置,其特征在于,所述预先确定的概率计算公式如下:The electronic device according to claim 11, wherein the predetermined probability calculation formula is as follows: P 3=P 1*P 2 P 3 =P 1 *P 2 其中,P 1代表第一选择概率,P 2代表第二选择概率,P 3代表最终概率。 Where P 1 represents the first selection probability, P 2 represents the second selection probability, and P 3 represents the final probability. 根据权利要求8所述的电子装置,其特征在于,所述预设的向量化方式包括:The electronic device according to claim 8, wherein the preset vectorization manner comprises: 使用自动编码器对信息文本的用户信息进行编码,生成用户信息向量;Encoding the user information of the information text using an automatic encoder to generate a user information vector; 使用预先确定的词向量模型对该信息文本进行词向量编码,生成该信息文本的文本信息向量;Generating a word vector of the information text using a predetermined word vector model to generate a text information vector of the information text; 将用户信息向量与文本信息向量拼接起来生成该信息文本对应的信息向量。The user information vector and the text information vector are spliced together to generate an information vector corresponding to the information text. 根据权利要求8所述的电子装置,其特征在于,所述事件分类模型为长短期记忆网络模型,所述事件分类模型的训练步骤如下:The electronic device according to claim 8, wherein the event classification model is a long-term and short-term memory network model, and the training steps of the event classification model are as follows: 获取第三预设数量的信息文本,并生成各个信息文本对应的信息向量,根据预先确定的信息文本与事件类型的映射关系,确定各个信息向量对应的事件类型,并将信息向量与事件类型的映射关系数据作为样本数据;Obtaining a third preset number of information texts, and generating an information vector corresponding to each information text, determining an event type corresponding to each information vector according to a predetermined mapping relationship between the information text and the event type, and determining the information vector and the event type Mapping relational data as sample data; 将样本数据分成第一比例的训练集和第二比例的验证集,其中,第一比例大于第二比例;Dividing the sample data into a training set of a first ratio and a verification set of a second ratio, wherein the first ratio is greater than the second ratio; 利用训练集中的样本数据对所述事件分类模型进行训练,并在训练完后利用验证集中的样本数据对所述事件分类模型的准确率进行验证;The event classification model is trained by using sample data in the training set, and the accuracy of the event classification model is verified by using sample data in the verification set after training; 若准确率大于预设值,则训练完成,若准确率小于或等于预设值,则增加样本数据的数量,之后返回将样本数据分成训练集和验证集的步骤。If the accuracy is greater than the preset value, the training is completed. If the accuracy is less than or equal to the preset value, the number of sample data is increased, and then the step of dividing the sample data into the training set and the verification set is returned. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中包括热点事件分类程序,所述热点事件分类程序被处理器执行时,可实现如下步骤:A computer readable storage medium, characterized in that the computer readable storage medium includes a hotspot event classification program, and when the hotspot event classification program is executed by a processor, the following steps can be implemented: 获取步骤:实时从预先确定的社交服务器中获取第一预设数量用户发布的信息文本;Obtaining step: acquiring, in real time, a first preset number of information texts published by the user from a predetermined social server; 分词步骤:利用预先确定的分词规则对上述信息文本进行分词,获得各个信息文本对应的分词;Word segmentation step: segmenting the above information text by using a predetermined word segmentation rule to obtain a word segment corresponding to each information text; 确定步骤:提取出分词中预设的特征词,利用预先确定的概率算法确定该特征词对应的事件主题;Determining step: extracting a feature word preset in the word segment, and determining a event theme corresponding to the feature word by using a predetermined probability algorithm; 计算步骤:根据预设的计算公式,计算出该特征词对应的热点事件指标指值;Calculating step: calculating a hotspot event index value corresponding to the feature word according to a preset calculation formula; 分类步骤:判断热点事件指标值是否大于预设阈值,若热点事件指标值大于预设阈值,则利用预设的向量化方式获取该特征词对应的信息文本的信息向量,将所述信息向量输入预先训练的事件分类模型中,确定出该信息文本对应的事件类型。The classification step is: determining whether the hot event indicator value is greater than a preset threshold, and if the hot event indicator value is greater than a preset threshold, acquiring an information vector of the information text corresponding to the feature word by using a preset vectorization manner, and inputting the information vector In the pre-trained event classification model, the event type corresponding to the information text is determined. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述预设的计算公式如下:The computer readable storage medium of claim 15, wherein the predetermined calculation formula is as follows:
Figure PCTCN2018102083-appb-100005
Figure PCTCN2018102083-appb-100005
Figure PCTCN2018102083-appb-100006
Figure PCTCN2018102083-appb-100006
其中,v代表事件发展的速度,a代表热点事件指标值,t代表时间点,T代表时间间隔,i为整数,t i代表第i个特征词出现的时间点,X i代表第i个特征词出现的次数。 Where v represents the speed of event development, a represents the hot event indicator value, t represents the time point, T represents the time interval, i is an integer, t i represents the time point at which the i-th feature word appears, and X i represents the i-th feature The number of times a word appears.
根据权利要求15所述的计算机可读存储介质,其特征在于,所述预先确定的分词规则包括:The computer readable storage medium of claim 15, wherein the predetermined word segmentation rules comprise: 根据预设类型标点符号,将获取的各个信息文本拆分成短句;According to the preset type punctuation marks, the obtained information texts are divided into short sentences; 根据词库中存储的词语,利用长词优先原则对每个短句进行分词。According to the words stored in the thesaurus, each short sentence is segmented using the long word priority principle. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述预先确定的概率算法包括:The computer readable storage medium of claim 15, wherein the predetermined probability algorithm comprises: 在特征词与事件主题文本之间添加第二预设数量的隐含事件主题;Adding a second preset number of implicit event topics between the feature word and the event topic text; 根据预先确定的隐含事件主题与特征词的映射关系,确定每个隐含事件主题含有的特征词的第一数量X 1及每个特征词所属的隐含事件主题的第二数量X 2,根据第一数量X 1和第二数量X 2确定每个特征词对各个隐含事件主题的第一选择概率P 1=1/(X 1*X 2); Determining, according to a predetermined mapping relationship between the subject matter of the implicit event and the feature word, a first quantity X 1 of the feature words contained in each implicit event topic and a second quantity X 2 of the implicit event subject to which each feature word belongs, Determining, according to the first quantity X 1 and the second quantity X 2 , a first selection probability P 1 =1/(X 1 *X 2 ) of each feature word to each implicit event topic; 根据预先确定的隐含事件主题与事件主题的映射关系,确定每个事件主题含有的隐含事件主题的第三数量X 3及每个隐含事件主题所属的事件主题的第四数量X 4,根据第三数量X 3和第四数量X 4确定每个隐含事件主题对各个事件主题的第二选择概率P 2=1/(X 3*X 4); Determining, according to a predetermined mapping relationship between the implicit event topic and the event topic, a third quantity X 3 of the implicit event topic included in each event topic and a fourth number X 4 of the event topic to which each implicit event topic belongs, Determining, according to the third quantity X 3 and the fourth quantity X 4 , a second selection probability P 2 =1/(X 3 *X 4 ) of each implicit event topic for each event topic; 将第一选择概率P 1和第二选择概率P 2代入预先确定的概率计算公式,计算出每个特征词对各个事件主题的最终概率P 3的分布。 The first selection probability P 1 and the second selection probability P 2 are substituted into a predetermined probability calculation formula, and the distribution of the final probability P 3 of each feature word to each event topic is calculated. 根据权利要求18所述的计算机可读存储介质,其特征在于,所述预先确定的概率计算公式如下:The computer readable storage medium of claim 18, wherein the predetermined probability calculation formula is as follows: P 3=P 1*P 2 P 3 =P 1 *P 2 其中,P 1代表第一选择概率,P 2代表第二选择概率,P 3代表最终概率。 Where P 1 represents the first selection probability, P 2 represents the second selection probability, and P 3 represents the final probability. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述事件分类模型为长短期记忆网络模型,所述事件分类模型的训练步骤如下:The computer readable storage medium according to claim 15, wherein the event classification model is a long-term and short-term memory network model, and the training steps of the event classification model are as follows: 获取第三预设数量的信息文本,并生成各个信息文本对应的信息向量,根据预先确定的信息文本与事件类型的映射关系,确定各个信息向量对应的事件类型,并将信息向量与事件类型的映射关系数据作为样本数据;Obtaining a third preset number of information texts, and generating an information vector corresponding to each information text, determining an event type corresponding to each information vector according to a predetermined mapping relationship between the information text and the event type, and determining the information vector and the event type Mapping relational data as sample data; 将样本数据分成第一比例的训练集和第二比例的验证集,其中,第一比例大于第二比例;Dividing the sample data into a training set of a first ratio and a verification set of a second ratio, wherein the first ratio is greater than the second ratio; 利用训练集中的样本数据对所述事件分类模型进行训练,并在训练完后利用验证集中的样本数据对所述事件分类模型的准确率进行验证;The event classification model is trained by using sample data in the training set, and the accuracy of the event classification model is verified by using sample data in the verification set after training; 若准确率大于预设值,则训练完成,若准确率小于或等于预设值,则增加样本数据的数量,之后返回将样本数据分成训练集和验证集的步骤。If the accuracy is greater than the preset value, the training is completed. If the accuracy is less than or equal to the preset value, the number of sample data is increased, and then the step of dividing the sample data into the training set and the verification set is returned.
PCT/CN2018/102083 2018-03-26 2018-08-24 Hotspot event classification method and apparatus, and storage medium Ceased WO2019184217A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810252849.6A CN108595519A (en) 2018-03-26 2018-03-26 Focus incident sorting technique, device and storage medium
CN201810252849.6 2018-03-26

Publications (1)

Publication Number Publication Date
WO2019184217A1 true WO2019184217A1 (en) 2019-10-03

Family

ID=63623682

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102083 Ceased WO2019184217A1 (en) 2018-03-26 2018-08-24 Hotspot event classification method and apparatus, and storage medium

Country Status (2)

Country Link
CN (1) CN108595519A (en)
WO (1) WO2019184217A1 (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222032A (en) * 2019-12-17 2020-06-02 中国平安人寿保险股份有限公司 Public opinion analysis method and related equipment
CN111274782A (en) * 2020-02-25 2020-06-12 平安科技(深圳)有限公司 Text auditing method and device, computer equipment and readable storage medium
CN111291562A (en) * 2020-01-17 2020-06-16 中国石油集团安全环保技术研究院有限公司 Intelligent Semantic Recognition Method Based on HSE
CN111324811A (en) * 2020-02-20 2020-06-23 北京奇艺世纪科技有限公司 Hot content confirmation method and device
CN111506727A (en) * 2020-04-16 2020-08-07 腾讯科技(深圳)有限公司 Text content category acquisition method and device, computer equipment and storage medium
CN111552790A (en) * 2020-04-27 2020-08-18 北京学之途网络科技有限公司 Method and device for identifying article list brushing
CN111858725A (en) * 2020-04-30 2020-10-30 北京嘀嘀无限科技发展有限公司 Method and system for determining event attributes
CN111967601A (en) * 2020-06-30 2020-11-20 北京百度网讯科技有限公司 Event relation generation method, event relation rule generation method and device
CN112135334A (en) * 2020-10-27 2020-12-25 上海连尚网络科技有限公司 A method and device for determining a hotspot type of a wireless access point
CN112667791A (en) * 2020-12-23 2021-04-16 深圳壹账通智能科技有限公司 Latent event prediction method, device, equipment and storage medium
CN112765349A (en) * 2021-01-12 2021-05-07 深圳前海微众银行股份有限公司 Industry classification method, apparatus, system and computer readable storage medium
CN112926308A (en) * 2021-02-25 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device, storage medium and program product for matching text
CN113127576A (en) * 2021-04-15 2021-07-16 微梦创科网络科技(中国)有限公司 Hotspot discovery method and system based on user content consumption analysis
CN113220999A (en) * 2021-05-14 2021-08-06 北京百度网讯科技有限公司 User feature generation method and device, electronic equipment and storage medium
CN113392213A (en) * 2021-04-19 2021-09-14 合肥讯飞数码科技有限公司 Event extraction method, electronic device and storage device
CN113822069A (en) * 2021-09-17 2021-12-21 国家计算机网络与信息安全管理中心 Emergency early warning method and device based on meta-knowledge and electronic device
CN114297099A (en) * 2021-12-29 2022-04-08 中国电信股份有限公司 Data cache optimization method and device, nonvolatile storage medium and electronic equipment
CN114386394A (en) * 2020-10-16 2022-04-22 电科云(北京)科技有限公司 Prediction model training method, prediction method and prediction device for platform public opinion data theme
CN114461948A (en) * 2021-12-24 2022-05-10 天翼云科技有限公司 Web cache setting optimization method and electronic device
CN114492926A (en) * 2021-12-20 2022-05-13 华能煤炭技术研究有限公司 A method and system for text analysis and prediction of coal mine safety hazards
CN114764440A (en) * 2022-04-15 2022-07-19 中南林业科技大学 Main event duplicate removal method based on graph node selection and optimization
CN114792096A (en) * 2021-01-26 2022-07-26 腾讯科技(深圳)有限公司 A kind of classification method and device of content publishing subject
CN114861805A (en) * 2022-05-18 2022-08-05 湖南快乐阳光互动娱乐传媒有限公司 Hot event classification model construction method, hot event classification method and device
CN115409105A (en) * 2022-08-26 2022-11-29 中国人民解放军战略支援部队信息工程大学 Method for constructing user portrait based on Android external storage space file operation
WO2023125589A1 (en) * 2021-12-29 2023-07-06 北京辰安科技股份有限公司 Emergency monitoring method and apparatus
CN116542238A (en) * 2023-07-07 2023-08-04 和元达信息科技有限公司 Event heat trend determining method and system based on small program
CN117271857A (en) * 2023-09-22 2023-12-22 中国工商银行股份有限公司 Information display methods, devices, equipment, storage media and program products
CN118041707A (en) * 2024-04-15 2024-05-14 深圳市奇兔软件技术有限公司 Identity verification method based on computer network
CN118474427A (en) * 2024-07-09 2024-08-09 中译文娱科技(青岛)有限公司 Network public opinion detection method and system

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232149B (en) * 2019-05-09 2022-03-01 北京邮电大学 Hot event detection method and system
CN110414006B (en) * 2019-07-31 2023-09-08 京东方科技集团股份有限公司 Text subject annotation method, device, electronic equipment and storage medium
CN110458296B (en) * 2019-08-02 2023-08-29 腾讯科技(深圳)有限公司 Method and device for marking target event, storage medium and electronic device
CN110956021B (en) * 2019-11-14 2025-01-10 微民保险代理有限公司 A method, device, system and server for generating original articles
CN111078883A (en) * 2019-12-13 2020-04-28 北京明略软件系统有限公司 Risk index analysis method and device, electronic equipment and storage medium
CN111177319B (en) * 2019-12-24 2024-08-27 中国建设银行股份有限公司 Method and device for determining risk event, electronic equipment and storage medium
CN113065329A (en) * 2020-01-02 2021-07-02 广州越秀金融科技有限公司 Data processing method and device
CN111275327B (en) * 2020-01-19 2024-06-07 深圳前海微众银行股份有限公司 Resource allocation method, device, equipment and storage medium
CN111369148A (en) * 2020-03-05 2020-07-03 广州快盈信息技术服务有限公司 Object index monitoring method, electronic device and storage medium
CN112100374B (en) * 2020-08-28 2024-12-27 清华大学 Text clustering method, device, electronic device and storage medium
CN113342979B (en) * 2021-06-24 2023-12-05 中国平安人寿保险股份有限公司 Hot topic identification method, computer device and storage medium
CN113434273B (en) * 2021-06-29 2022-12-23 平安科技(深圳)有限公司 Data processing method, device, system and storage medium
CN113743746B (en) * 2021-08-17 2024-11-19 携程旅游网络技术(上海)有限公司 Model training method, event dispatching processing method, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965867A (en) * 2015-06-08 2015-10-07 南京师范大学 Text event classification method based on CHI feature selection
CN105335476A (en) * 2015-10-08 2016-02-17 北京邮电大学 Method and device for classifying hot event
US20160071024A1 (en) * 2014-02-25 2016-03-10 Sri International Dynamic hybrid models for multimodal analysis
CN106570164A (en) * 2016-11-07 2017-04-19 中国农业大学 Integrated foodstuff safety text classification method based on deep learning
CN107797983A (en) * 2017-04-07 2018-03-13 平安科技(深圳)有限公司 Microblog data processing method, device, computer equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095928B (en) * 2016-06-12 2019-10-29 国家计算机网络与信息安全管理中心 A kind of event type recognition methods and device
CN107220648B (en) * 2017-04-11 2018-06-22 平安科技(深圳)有限公司 The character identifying method and server of Claims Resolution document
CN107644012B (en) * 2017-08-29 2019-03-01 平安科技(深圳)有限公司 Electronic device, problem identification confirmation method and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160071024A1 (en) * 2014-02-25 2016-03-10 Sri International Dynamic hybrid models for multimodal analysis
CN104965867A (en) * 2015-06-08 2015-10-07 南京师范大学 Text event classification method based on CHI feature selection
CN105335476A (en) * 2015-10-08 2016-02-17 北京邮电大学 Method and device for classifying hot event
CN106570164A (en) * 2016-11-07 2017-04-19 中国农业大学 Integrated foodstuff safety text classification method based on deep learning
CN107797983A (en) * 2017-04-07 2018-03-13 平安科技(深圳)有限公司 Microblog data processing method, device, computer equipment and storage medium

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222032A (en) * 2019-12-17 2020-06-02 中国平安人寿保险股份有限公司 Public opinion analysis method and related equipment
CN111222032B (en) * 2019-12-17 2024-04-30 中国平安人寿保险股份有限公司 Public opinion analysis method and related equipment
CN111291562A (en) * 2020-01-17 2020-06-16 中国石油集团安全环保技术研究院有限公司 Intelligent Semantic Recognition Method Based on HSE
CN111291562B (en) * 2020-01-17 2024-05-03 中国石油天然气集团有限公司 Intelligent semantic recognition method based on HSE
CN111324811A (en) * 2020-02-20 2020-06-23 北京奇艺世纪科技有限公司 Hot content confirmation method and device
CN111324811B (en) * 2020-02-20 2024-04-12 北京奇艺世纪科技有限公司 Hot content confirmation method and device
CN111274782A (en) * 2020-02-25 2020-06-12 平安科技(深圳)有限公司 Text auditing method and device, computer equipment and readable storage medium
CN111274782B (en) * 2020-02-25 2023-10-20 平安科技(深圳)有限公司 Text auditing method and device, computer equipment and readable storage medium
CN111506727A (en) * 2020-04-16 2020-08-07 腾讯科技(深圳)有限公司 Text content category acquisition method and device, computer equipment and storage medium
CN111506727B (en) * 2020-04-16 2023-10-03 腾讯科技(深圳)有限公司 Text content category acquisition method, apparatus, computer device and storage medium
CN111552790A (en) * 2020-04-27 2020-08-18 北京学之途网络科技有限公司 Method and device for identifying article list brushing
CN111552790B (en) * 2020-04-27 2024-03-08 北京明略昭辉科技有限公司 Method and device for identifying article form
CN111858725A (en) * 2020-04-30 2020-10-30 北京嘀嘀无限科技发展有限公司 Method and system for determining event attributes
CN111967601A (en) * 2020-06-30 2020-11-20 北京百度网讯科技有限公司 Event relation generation method, event relation rule generation method and device
CN111967601B (en) * 2020-06-30 2024-02-20 北京百度网讯科技有限公司 Method for generating event relationships, methods and devices for generating event relationship rules
CN114386394A (en) * 2020-10-16 2022-04-22 电科云(北京)科技有限公司 Prediction model training method, prediction method and prediction device for platform public opinion data theme
CN112135334A (en) * 2020-10-27 2020-12-25 上海连尚网络科技有限公司 A method and device for determining a hotspot type of a wireless access point
CN112667791A (en) * 2020-12-23 2021-04-16 深圳壹账通智能科技有限公司 Latent event prediction method, device, equipment and storage medium
CN112765349A (en) * 2021-01-12 2021-05-07 深圳前海微众银行股份有限公司 Industry classification method, apparatus, system and computer readable storage medium
CN114792096A (en) * 2021-01-26 2022-07-26 腾讯科技(深圳)有限公司 A kind of classification method and device of content publishing subject
CN114792096B (en) * 2021-01-26 2025-07-08 腾讯科技(深圳)有限公司 Content release main body classification method and device
CN112926308A (en) * 2021-02-25 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device, storage medium and program product for matching text
CN112926308B (en) * 2021-02-25 2024-01-12 北京百度网讯科技有限公司 Methods, devices, equipment, storage media and program products for matching text
CN113127576A (en) * 2021-04-15 2021-07-16 微梦创科网络科技(中国)有限公司 Hotspot discovery method and system based on user content consumption analysis
CN113127576B (en) * 2021-04-15 2024-05-24 微梦创科网络科技(中国)有限公司 Hot spot discovery method and system based on user content consumption analysis
CN113392213B (en) * 2021-04-19 2024-05-31 合肥讯飞数码科技有限公司 Event extraction method, electronic equipment and storage device
CN113392213A (en) * 2021-04-19 2021-09-14 合肥讯飞数码科技有限公司 Event extraction method, electronic device and storage device
CN113220999A (en) * 2021-05-14 2021-08-06 北京百度网讯科技有限公司 User feature generation method and device, electronic equipment and storage medium
CN113822069A (en) * 2021-09-17 2021-12-21 国家计算机网络与信息安全管理中心 Emergency early warning method and device based on meta-knowledge and electronic device
CN113822069B (en) * 2021-09-17 2024-03-12 国家计算机网络与信息安全管理中心 Sudden event early warning method and device based on meta-knowledge and electronic device
CN114492926A (en) * 2021-12-20 2022-05-13 华能煤炭技术研究有限公司 A method and system for text analysis and prediction of coal mine safety hazards
CN114461948B (en) * 2021-12-24 2024-12-10 天翼云科技有限公司 Web cache setting optimization method and electronic device
CN114461948A (en) * 2021-12-24 2022-05-10 天翼云科技有限公司 Web cache setting optimization method and electronic device
CN114297099A (en) * 2021-12-29 2022-04-08 中国电信股份有限公司 Data cache optimization method and device, nonvolatile storage medium and electronic equipment
WO2023125589A1 (en) * 2021-12-29 2023-07-06 北京辰安科技股份有限公司 Emergency monitoring method and apparatus
CN114764440A (en) * 2022-04-15 2022-07-19 中南林业科技大学 Main event duplicate removal method based on graph node selection and optimization
CN114861805A (en) * 2022-05-18 2022-08-05 湖南快乐阳光互动娱乐传媒有限公司 Hot event classification model construction method, hot event classification method and device
CN115409105A (en) * 2022-08-26 2022-11-29 中国人民解放军战略支援部队信息工程大学 Method for constructing user portrait based on Android external storage space file operation
CN116542238A (en) * 2023-07-07 2023-08-04 和元达信息科技有限公司 Event heat trend determining method and system based on small program
CN116542238B (en) * 2023-07-07 2024-03-15 和元达信息科技有限公司 Event heat trend determining method and system based on small program
CN117271857A (en) * 2023-09-22 2023-12-22 中国工商银行股份有限公司 Information display methods, devices, equipment, storage media and program products
CN118041707A (en) * 2024-04-15 2024-05-14 深圳市奇兔软件技术有限公司 Identity verification method based on computer network
CN118474427A (en) * 2024-07-09 2024-08-09 中译文娱科技(青岛)有限公司 Network public opinion detection method and system

Also Published As

Publication number Publication date
CN108595519A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
WO2019184217A1 (en) Hotspot event classification method and apparatus, and storage medium
CN111814465B (en) Machine learning-based information extraction method, device, computer equipment, and medium
US8543375B2 (en) Multi-mode input method editor
CN110457680B (en) Entity disambiguation method, device, computer equipment and storage medium
WO2019214149A1 (en) Text key information identification method, electronic device, and readable storage medium
US8983826B2 (en) Method and system for extracting shadow entities from emails
CN112395391A (en) Concept graph construction method and device, computer equipment and storage medium
CN113094478B (en) Expression reply method, device, equipment and storage medium
CN113127621A (en) Dialogue module pushing method, device, equipment and storage medium
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
CN114416976A (en) Text annotation method, device and electronic equipment
CN111538830B (en) French searching method, device, computer equipment and storage medium
CN115481599A (en) Document processing method and device, electronic equipment and storage medium
CN109753646B (en) Article attribute identification method and electronic equipment
CN110807322B (en) Method, device, server and storage medium for identifying new words based on information entropy
CN117556050B (en) Data classification and classification method and device, electronic equipment and storage medium
CN114138951B (en) Question and answer processing method, device, electronic device and storage medium
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN114780678B (en) Text retrieval method, device, equipment and storage medium
CN117743577A (en) Text classification method, device, electronic device and storage medium
CN111783447B (en) Sensitive word detection method, device and equipment based on ngram distance and storage medium
CN115563515A (en) Text similarity detection method, device and equipment and storage medium
KR20220024251A (en) Method and apparatus for building event library, electronic device, and computer-readable medium
CN114461771A (en) Question and answer method, apparatus, electronic device and readable storage medium
CN115879452A (en) New word discovery method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 03.02.2021)

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18912789

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 18912789

Country of ref document: EP

Kind code of ref document: A1