CN106168946A

CN106168946A - A kind of method identifying user initials phenomenon

Info

Publication number: CN106168946A
Application number: CN201610474472.XA
Authority: CN
Inventors: 亚静; 王玉斌; 柳厅文; 时金桥; 李全刚
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2016-06-24
Filing date: 2016-06-24
Publication date: 2016-11-30

Abstract

The invention provides a method for identifying the abbreviation phenomenon of user names. The steps include: 1) filtering the characters in two or more user names, and only keeping English letters and numbers; 2) filtering the above-mentioned filtered user names Each is divided into several consecutive segments, and the first character of each segment is selected to form a new character string; 3) The length of the longest abbreviation is obtained according to the above new character string. If the length value is greater than or equal to a given threshold ΔL, then the There are abbreviations of user names among the above user names; the reserved English letters are uniformly converted into lowercase or uppercase; the fragments are words or single characters; the fragments are segmented according to the specified dictionary; new string gets the longest abbreviation length.

Description

A Method for Recognizing Username Abbreviation Phenomenon

技术领域technical field

本发明涉及计算机领域，具体涉及一种识别用户名缩写现象的方法。The invention relates to the field of computers, in particular to a method for identifying abbreviated user names.

背景技术Background technique

近年来，互联网发展迅速，已经深入到社会生活的方方面面，例如在新浪、搜狐、腾讯等门户网站观看新闻或视频，在微博、贴吧、社区等社交网络进行信息交流，人们在使用这些网络时会注册账户，填写用户名。用户名是用户在注册网站时填写的符合一定规则的能够标识用户身份的字符串，通常由英文字母、数字以及下划线等一些特殊字符组成。In recent years, the Internet has developed rapidly and has penetrated into all aspects of social life, such as watching news or videos on portal sites such as Sina, Sohu, and Tencent, and exchanging information on social networks such as Weibo, Tieba, and communities. When people use these networks Will register an account, fill in the user name. The user name is a character string that meets certain rules and can identify the user's identity filled in by the user when registering for the website. It usually consists of English letters, numbers, and underscores and other special characters.

用户在某个网站注册时，由于该网站的用户名具有唯一性，且常用的用户名已被他人注册，或出于保护个人隐私等其他考虑而没有使用常用的用户名，但同时为了便于记忆而选择将常用的用户名中某些单词进行缩写，变成一个新的用户名并注册使用，此即为用户名的缩写现象。识别用户名间的缩写现象对于互联网研究有重大意义，例如，研究人员在对社交网络数据进行挖掘，如用户行为分析、个性化推荐等，有时需要将不同社交网络中的同一用户或相似用户进行关联，其中一些关联方法就需要借助识别用户名中的缩写现象来分析用户名的相似程度。When a user registers on a certain website, because the user name of the website is unique, and the commonly used user name has been registered by others, or due to other considerations such as protecting personal privacy, the user does not use the commonly used user name, but at the same time, for the sake of easy memory And choose to abbreviate some words in the commonly used user name, become a new user name and register for use, this is the abbreviation phenomenon of the user name. Identifying abbreviations between user names is of great significance to Internet research. For example, when researchers are mining social network data, such as user behavior analysis, personalized recommendation, etc., sometimes it is necessary to analyze the same user or similar users in different social networks. Association, and some of the association methods need to analyze the similarity of user names by identifying abbreviations in user names.

识别用户名缩写现象即为给出两个用户名，判断其中一个用户名的某个片段是否是另一个用户名的某个片段的缩写。解决该问题的通用方法为枚举法，即枚举一个用户名的子串s₁和另一个用户名的子串s₂，检查s₁是否是s₂的缩写形式。如果存在这样的s₁、s₂，则认为两个用户名间存在缩写现象，且s₁是对s₂的缩写，否则认为两个用户名间不存在缩写现象。Recognizing the abbreviation phenomenon of usernames is to give two usernames and judge whether a certain fragment of one username is an abbreviation of a certain fragment of the other username. The general method to solve this problem is the enumeration method, that is, to enumerate the substring s ₁ of a user name and the substring s ₂ of another user name, and check whether s ₁ is an abbreviated form of s ₂ . If such s ₁ and s ₂ exist, it is considered that there is an abbreviation between the two user names, and s ₁ is an abbreviation of s ₂ , otherwise it is considered that there is no abbreviation between the two user names.

然而，枚举法在识别用户名缩写现象时需要枚举用户名所有的子串，时间开销大，难以适用大规模的实时用户名缩写现象识别任务。However, the enumeration method needs to enumerate all the substrings of the username when identifying the phenomenon of username abbreviation, which takes a long time and is difficult to apply to large-scale real-time username abbreviation recognition tasks.

发明内容Contents of the invention

鉴于上述不足，本发明提供一种识别用户名缩写现象的方法，无需列举用户名的子串，减少计算量，时间开销小，能够更加高效地识别用户名缩写现象。In view of the above deficiencies, the present invention provides a method for identifying user name abbreviation, which does not need to enumerate user name substrings, reduces the amount of calculation, has low time overhead, and can more efficiently identify user name abbreviation.

为解决上述技术问题，本发明采用如下技术方案：In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种识别用户名缩写现象的方法，步骤包括：A method for identifying abbreviated usernames, the steps comprising:

1)对两个或两个以上的用户名中的字符进行过滤，仅保留英文字母和数字；1) Filter the characters in two or more user names, and only keep English letters and numbers;

2)将上述过滤后的用户名各分割成若干个连续片段，各选取每个片段的首字符组成一新字符串；2) the user name after the above-mentioned filtering is respectively divided into several continuous segments, each selects the first character of each segment to form a new character string;

3)根据上述新字符串获取最长缩写词长度，如长度值大于或等于给定阈值ΔL，则判定所述用户名间具有用户名缩写现象。3) Obtain the length of the longest abbreviation according to the above-mentioned new character string. If the length value is greater than or equal to a given threshold ΔL, it is determined that there is an abbreviation phenomenon among the user names.

进一步地，将保留的英文字母统一转换成小写或大写的形式。Further, the reserved English letters are uniformly converted into lowercase or uppercase.

进一步地，所述片段为单词或单个字符。Further, the segment is a word or a single character.

进一步地，所述片段根据指定的字典分割得到。Further, the segment is obtained by segmenting according to a specified dictionary.

进一步地，所述字典包括人名、地名、物名、杜撰词或其他指定词，该指定词包括名词、动词、形容词、副词。Further, the dictionary includes personal names, place names, object names, invented words or other specified words, and the specified words include nouns, verbs, adjectives, and adverbs.

进一步地，采用动态规划算法根据所述新字符串获取最长缩写词长度。Further, a dynamic programming algorithm is used to obtain the longest abbreviation length according to the new character string.

进一步地，所述阈值ΔL为欲识别用户名缩写形式的最小长度。Further, the threshold ΔL is the minimum length of the abbreviated form of the user name to be recognized.

进一步地，当欲识别中文人名拼音缩写形式时，ΔL≥2。Further, when it is desired to recognize the pinyin abbreviation of Chinese personal names, ΔL≥2.

进一步地，当欲识别英文人名缩写形式时，ΔL＝2。Further, when it is desired to recognize the abbreviated forms of English personal names, ΔL=2.

本发明的有益效果是，本发明提供的方法无需列举用户名所有的子串，能够自动识别用户名间是否存在缩写现象。在识别、判断用户名之前，预先对用户名进行分割、缩写，相比于现有技术减少了识别过程中字符串自身的长度，从而减少了计算量。在识别用户名缩写现象时，本方法通过判断最长缩写词长度的值便可轻易判定，而最长缩写词长度的获取采用的是一种动态规划算法，相比于现有技术逐一枚举子串的过程减少了大量重复的计算。The beneficial effect of the present invention is that the method provided by the present invention does not need to enumerate all the substrings of the user names, and can automatically identify whether there is abbreviation among the user names. Before identifying and judging the user name, the user name is divided and abbreviated in advance, which reduces the length of the character string itself in the identification process compared with the prior art, thereby reducing the amount of calculation. When identifying user name abbreviations, this method can easily determine by judging the value of the longest abbreviation length, and the acquisition of the longest abbreviation length uses a dynamic programming algorithm. The process of substring reduces a lot of repeated calculations.

附图说明Description of drawings

图1为实施例中提供的一种识别用户名缩写现象的方法流程图。FIG. 1 is a flow chart of a method for identifying abbreviated user names provided in an embodiment.

具体实施方式detailed description

为使本发明的上述特征和优点能更明显易懂，下文特举实施例，并配合所附图作详细说明如下。In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.

此提供一种识别用户名缩写现象的方法，如图1所示，假定给定两个用户名a,b，判断a和b之间是否具有用户名缩写现象，包含以下步骤：This provides a method for identifying abbreviated usernames. As shown in Figure 1, assuming two usernames a and b are given, it is judged whether there is abbreviated usernames between a and b, including the following steps:

1、用户名预处理1. Username preprocessing

用户名通常可以包含英文字母、数字以及下划线等部分特殊字符，该步骤旨在去除用户名中的特殊字符，本发明提到的特殊字符是指除英文字母和数字以外的所有字符，只保留英文字母和数字，并且统一将英文字母转变为小写或大写形式，本实施例以小写形式为例。The user name can usually contain some special characters such as English letters, numbers and underscores. This step is to remove the special characters in the user name. The special characters mentioned in this invention refer to all characters except English letters and numbers, and only English characters are reserved. Letters and numbers, and uniformly convert English letters into lowercase or uppercase. In this embodiment, lowercase is used as an example.

2、分割并缩写用户名2. Split and abbreviate usernames

根据指定者指定的具有实际意义的单词的字典W，将用户名分割成若干个连续的片段，每个片段为W中的单词或单个字符，同时需要保证分割后的片段的数量尽可能少。该字典W的指定者为用户，也可以为其他人。所谓具有实际意义的单词，例如是人名、地名、物名或其他对于指定者来讲具有意义的概念，此概念可为名词、动词、形容词、副词或其他性词汇，或者其他由指定者杜撰的词汇。According to the dictionary W of words with actual meaning specified by the designator, the user name is divided into several consecutive segments, each segment is a word or a single character in W, and it is necessary to ensure that the number of segmented segments is as small as possible. The designator of the dictionary W is the user, and may be other people. The so-called words with practical meaning, such as the names of people, places, objects or other concepts that have meaning for the designator, such concepts can be nouns, verbs, adjectives, adverbs or other sexual words, or other words created by the designator vocabulary.

可通过如下方法对经过上述预处理的用户名进行分割，为便于表达，此处设给定的用户名为u，长度为n，记S_i表示用户名u的第i个字符至第n个字符组成的子串的分割结果，分割步骤如下：The above-mentioned preprocessed user name can be divided by the following method. For the convenience of expression, the given user name is set to u here, and the length is n. Note that S _i represents the i-th character to the n-th character of the user name u The segmentation result of the substring composed of characters, the segmentation steps are as follows:

(1)初始化i＝n+1，S_n+1＝{}。(1) Initialize i=n+1, S _n+1 ={}.

(2)令i＝i-1，依次查看字典W中的各个单词w，设w的长度为m，如果满足w＝u_iu_i+1…u_i+m-1，即w出现在用户名u中的第i个字符的位置，这时如果S_i未赋值或满足|S_i+m|+1<|S_i|，则令S_i＝w∪S_i+m。如果检查完W中所有的单词之后S_i仍未赋值，则令S_i＝u[i]∪S_i+m。其中u[i]表示u的第i个字符。(2) Let i=i-1, check each word w in the dictionary W in turn, let the length of w be m, if w=u _i u _i+1 ...u _i+m-1 is satisfied, that is, w appears in the user The position of the i-th character in the name u, at this time, if S _i is not assigned or satisfies |S _i+m |+1<|S _i |, then let S _i =w∪S _i+m . If S _i is still unassigned after checking all the words in W, set S _i =u[i]∪S _i+m . Where u[i] represents the i-th character of u.

(3)重复第(2)步，直到i＝0为止，此时S₀就是用户名u的分割结果。(3) Repeat step (2) until i=0, at this time S ₀ is the segmentation result of user name u.

在对用户名进行分割之后，取分割结果中每个片段的首字符组成一个新字符串作为原用户名的缩写形式。After the user name is segmented, the first character of each segment in the segmented result is taken to form a new character string as the abbreviated form of the original user name.

3、计算最长缩写词长度3. Calculate the length of the longest abbreviation

设经过第2步处理之后得到的用户名a、b的分割结果分别为X_a、X_b，缩写形式分别为Y_a、Y_b，需得到用户名a和b的最长缩写词长度m。最长缩写词长度是指两个用户名缩写形式满足特定条件的最长公共部分，该特定条件为该公共部分的每个字符对应缩写之前的两个用户的分割结果中的字符串，均满足一个为单个字符且另一个为单词。为获得该最长缩写词长度m，专门设计了一动态规划算法，其公式如下：Assuming that the segmentation results of user names a and b obtained after the second step are X _a and X _b respectively, and the abbreviated forms are Y _a and Y _b respectively, it is necessary to obtain the longest abbreviation length m of user names a and b. The length of the longest abbreviation refers to the longest common part of two user name abbreviations that meet a specific condition. The specific condition is that each character of the common part corresponds to the string in the segmentation results of the two users before the abbreviation, all of which satisfy One is a single character and the other is a word. In order to obtain the length m of the longest abbreviation, a dynamic programming algorithm is specially designed, and its formula is as follows:

$m m = = \underset{11 \leq \leq i i \leq \leq | | {Y Y}_{a a} | |,, 11 \leq \leq j j \leq \leq | | {Y Y}_{b b} | |}{m m a a x x} f f ((i i,, j j))$

其中，Y_a[i]表示字符串Y_a的第i个字符，Y_b[j]表示字符串Y_b的第j个字符，|X_a[i]|表示集合X_a中第i个字符串的长度，|X_b[j]|表示集合X_b中第j个字符串的长度，|Y_a|表示字符串Y_a的长度，|Y_b|表示字符串Y_b的长度。Among them, Y _a [i] represents the i-th character of the string Y _a , Y _b [j] represents the j-th character of the string Y _b , |X _a [i]| represents the i-th character in the set X _a The length of the string, |X _b [j]| indicates the length of the jth string in the set X _b , |Y _a | indicates the length of the string Y _a , |Y _b | indicates the length of the string Y _b .

4、识别用户名缩写现象4. Identify username abbreviations

设给定的阈值为ΔL，如果满足m≥ΔL，说明用户名a、b之间存在缩写现象，否则说明用户名a、b之间不存在缩写现象。Assuming the given threshold is ΔL, if m≥ΔL, it means that there is abbreviation between user names a and b, otherwise it means that there is no abbreviation between user names a and b.

结合上述方法，特举以下两个应用于具体场景的实施例，以说明本方法切实可行。In combination with the above method, the following two embodiments applied to specific scenarios are given to illustrate the feasibility of this method.

实施例1：Example 1:

本实施例1用于识别用户名之间是否存在姓名拼音的缩写现象。根据中国人的姓名特征来看，姓名至少为两个字，例如张伟、史小明等，以张伟为例，其姓名拼音为ZhangWei或Wei Zhang，拼音缩写个例是具有随机性的，而从统计学上来看，拼音缩写很大概率上是取姓名的首字母，即zw或wz；而史小明的姓名拼音Shi Xiaoming或Xiaoming Shi，其拼音缩写很可能是sxm或xms，由上述分析可认为W为字符串长度不小于2的拼音的集合，ΔL＝2。同理，如果要识别英文人名的缩写，由于英文名一般情况下middle name较少使用，即英文名至少是由first name和last name组成，例如英文名Sheldon Lee Cooper，常为SheldonCooper，缩写为sc，所以要识别的缩写形式的最小长度可设置为2，即ΔL＝2。Embodiment 1 is used to identify whether there is abbreviation of pinyin among user names. According to the characteristics of Chinese names, the name must be at least two characters, such as Zhang Wei, Shi Xiaoming, etc. Taking Zhang Wei as an example, the pinyin of his name is ZhangWei or Wei Zhang, and the pinyin abbreviation is random, and From a statistical point of view, the pinyin abbreviation is likely to be the first letter of the name, that is, zw or wz; and the pinyin of Shi Xiaoming's name Shi Xiaoming or Xiaoming Shi, the pinyin abbreviation is likely to be sxm or xms, from the above analysis. It is considered that W is a set of pinyin whose character string length is not less than 2, and ΔL=2. Similarly, if you want to identify the abbreviation of an English name, since English names are generally less used as middle names, that is, English names are at least composed of first name and last name. For example, the English name Sheldon Lee Cooper is often SheldonCooper, and the abbreviation is sc , so the minimum length of the abbreviated form to be recognized can be set to 2, that is, ΔL=2.

给定两个用户名a＝zgxxidian123、b＝zhangguoxin012，本实施例欲通过本发明提供的方法识别这两个用户名之间是否存在缩写现象，通过上述第2步提供方法计算得到用户名a、b的分割结果分别为X_a＝{z,g,x,xi,dian,1,2,3}、X_b＝{zhang,guo,xin,0,1,2}，缩写形式分别为Y_a＝zgxxd123、Y_b＝zgx012，进一步地，通过上述第3步计算得到用户名a和b的最长缩写词长度m＝3。由上段已知ΔL＝2，则m≥ΔL，说明用户名a、b之间存在缩写现象。Given two user names a=zgxxidian123, b=zhangguoxin012, the present embodiment intends to identify whether there is an abbreviation phenomenon between these two user names by the method provided by the present invention, obtain user name a, The segmentation results of b are respectively X _a = {z, g, x, xi, dian, 1, 2, 3}, X _b = {zhang, guo, xin, 0, 1, 2}, and the abbreviated forms are respectively Y _a =zgxxd123, Y _b =zgx012, further, the length of the longest abbreviation of user names a and b is calculated as m=3 through the third step above. It is known from the above paragraph that ΔL=2, then m≥ΔL, indicating that there is an abbreviation phenomenon between user names a and b.

实施例2：Example 2:

本实施例2用于识别用户名之间是否存在姓名拼音的缩写现象，由上述实施例的分析可知，W为字符串长度不小于2的拼音的集合，ΔL＝2。This embodiment 2 is used to identify whether there is abbreviation of pinyin among user names. From the analysis of the above embodiment, it can be known that W is a set of pinyin whose character string length is not less than 2, and ΔL=2.

给定两个用户名a＝wanxia68、b＝wanter_123，通过上述第2步提供方法计算得到用户名a、b的分割结果分别为X_a＝{wan,xia,6,8}、X_b＝{wan,te,r,1,2,3}，缩写形式分别为Y_a＝wx68,Y_b＝wtr123，进一步地，通过上述第3步计算得到用户名a和b的最长缩写词的长度m＝0。由于m<ΔL，说明用户名a、b之间不存在缩写现象。Given two user names a=wanxia68 and b=wanter_123, the segmentation results of user names a and b obtained through the method provided in the second step above are respectively X _a ={wan,xia,6,8}, X _b ={ wan, te, r, 1, 2, 3}, the abbreviated forms are respectively Y _a =wx68, Y _b =wtr123, and further, the length m of the longest abbreviation of user name a and b is calculated through the above step 3 =0. Since m<ΔL, it means that there is no abbreviation phenomenon between user names a and b.

本发明提供的方法是通过算法自动识别用户名是否存在缩写现象，无需像现有技术那样列举用户名所有的子串，简便可行。在识别、判断用户名之前，预先对用户名进行分割、缩写，相比于现有技术减少了识别过程中字符串自身的长度，从而减少了计算量。在识别用户名缩写现象时，本方法通过判断最长缩写词长度的值便可轻易判定，而最长缩写词长度的获取采用的是一种动态规划算法，相比于现有技术逐一枚举子串的过程减少了大量重复的计算。The method provided by the invention is to automatically identify whether there is an abbreviation phenomenon in the user name through an algorithm, without enumerating all the substrings of the user name as in the prior art, which is simple and feasible. Before identifying and judging the user name, the user name is divided and abbreviated in advance, which reduces the length of the character string itself in the identification process compared with the prior art, thereby reducing the amount of calculation. When identifying user name abbreviations, this method can easily determine by judging the value of the longest abbreviation length, and the acquisition of the longest abbreviation length uses a dynamic programming algorithm. The process of substring reduces a lot of repeated calculations.

最后应当说明的是，虽然本发明已以实施例公开如上，但这些实施例并非用于限定本发明，所属技术领域中普通技术人员可以对其进行改动或替换，而不脱离本发明的精神和范围，故本发明的保护范围以权利要求书为准。Finally, it should be noted that although the present invention has been disclosed above with embodiments, these embodiments are not intended to limit the present invention, and those skilled in the art can modify or replace it without departing from the spirit and spirit of the present invention. scope, so the protection scope of the present invention shall be determined by the claims.

Claims

1. the method identifying user initials phenomenon, step includes:

1) character in two or more user names is filtered, only retain English alphabet and numeral；

2) user name after above-mentioned filtration is respectively divided into several continuous fragments, respectively chooses the initial character composition one of each fragment New character strings；

3) obtain the longest abbreviation length according to above-mentioned new character strings, if length value is more than or equal to given threshold value Δ L, then judge There is between described user name user initials phenomenon.

Method the most according to claim 1, it is characterised in that be converted into lower case or upper case by unified for the English alphabet of reservation Form.

Method the most according to claim 1, it is characterised in that described fragment is word or single character.

Method the most according to claim 1, it is characterised in that described fragment obtains according to the dictionary segmentation specified.

Method the most according to claim 4, it is characterised in that described dictionary include name, place name, name, fabricate word or Other specify word, and this appointment word includes noun, verb, adjective, adverbial word.

Method the most according to claim 1, it is characterised in that use dynamic programming algorithm to obtain according to described new character strings The longest abbreviation length.

Method the most according to claim 1, it is characterised in that described threshold value Δ L is user initials form to be identified Minimum length.

Method the most according to claim 7, it is characterised in that when Chinese personal name Pinyin abbreviation form to be identified, Δ L >= 2。

Method the most according to claim 7, it is characterised in that when English name-to abbreviated form to be identified, Δ L=2.