HK1213705B - Document classification using multiscale text fingerprints - Google Patents
Document classification using multiscale text fingerprints Download PDFInfo
- Publication number
- HK1213705B HK1213705B HK16101454.9A HK16101454A HK1213705B HK 1213705 B HK1213705 B HK 1213705B HK 16101454 A HK16101454 A HK 16101454A HK 1213705 B HK1213705 B HK 1213705B
- Authority
- HK
- Hong Kong
- Prior art keywords
- text
- fingerprint
- tags
- target
- electronic document
- Prior art date
Links
Description
背景技术Background Art
本发明涉及用于分类电子文档的方法及系统,且尤其涉及用于筛选未经请求的电子通信(垃圾邮件)且检测诈骗性网上文档的系统及方法。The present invention relates to methods and systems for classifying electronic documents, and more particularly to systems and methods for screening unsolicited electronic communications (spam) and detecting fraudulent online documents.
未经请求的电子通信(也称为垃圾邮件)形成全球通信业务的显著部分,从而影响计算机消息传递服务及电话消息传递服务两者。垃圾邮件可呈许多形式,从未经请求的电子邮件通信到伪装成各种互联网站点(例如,网志及社交网络站点)上的用户评论的垃圾邮件消息。垃圾邮件占用宝贵的硬件资源、影响生产率,且被通信服务及/或互联网的许多用户视为讨厌的及打扰的。Unsolicited electronic communications (also known as spam) constitute a significant portion of global communications traffic, affecting both computer and telephone messaging services. Spam can take many forms, from unsolicited email communications to spam messages disguised as user comments on various Internet sites (e.g., blogs and social networking sites). Spam consumes valuable hardware resources, impacts productivity, and is considered annoying and disruptive by many users of communications services and/or the Internet.
网上诈骗(尤其是呈网络钓鱼及身份盗用的形式)已正对全球互联网用户造成日益增加的威胁。由在互联网上操作的国际犯罪网络诈骗性地获得的敏感身份信息(例如用户姓名、ID、密码、身份证号码及医疗记录、银行及信用卡明细)用于提取私人资金及/或进一步卖给第三方。除了给个人造成直接的金融损失以外,网上诈骗也造成一系列有害的副作用,例如公司日益增加的安全成本、较高的零售价格及银行收费、下跌的股票价值、较低的工资及下降的税收收入。Online fraud, particularly in the form of phishing and identity theft, is posing an increasing threat to Internet users worldwide. Sensitive identifying information (such as user names, IDs, passwords, national identification numbers, and medical records, bank and credit card details) fraudulently obtained by international criminal networks operating on the Internet is used to withdraw private funds and/or further sold to third parties. In addition to causing direct financial losses to individuals, online fraud also has a range of harmful side effects, such as increased security costs for companies, higher retail prices and bank fees, falling stock values, lower wages, and decreased tax revenues.
在示范性网络钓鱼尝试中,虚假网站(也称为克隆)可伪装成属于网上零售商或金融机构的正版网页,要求用户输入一些个人信息(例如,用户名或密码)或一些金融信息(例如,信用卡号、账号或安全代码)。一旦毫无戒心的用户提交所述信息,其就可由所述虚假网站搜集。另外,用户可被引导到另一网页,其能够在用户的计算机上安装恶意软件。所述恶意软件(例如,病毒、特洛伊木马)能够通过记录由用户在访问某些网页时键入的密钥而继续窃取个人信息,且能够将用户的计算机变换成用于发动其它网络钓鱼及垃圾邮件攻击的平台。In an exemplary phishing attempt, a fake website (also called a clone) may masquerade as a genuine webpage belonging to an online retailer or financial institution, asking the user to enter some personal information (e.g., username or password) or some financial information (e.g., credit card number, account number, or security code). Once the unsuspecting user submits this information, it can be collected by the fake website. In addition, the user may be directed to another webpage that can install malware on the user's computer. The malware (e.g., virus, Trojan horse) can continue to steal personal information by recording the keys typed by the user when visiting certain webpages, and can transform the user's computer into a platform for launching other phishing and spam attacks.
在垃圾电子邮件或电子邮件诈骗的情况下,在用户或电子邮件服务提供商的计算机系统上运行的软件可用于将电子邮件消息分类为垃圾邮件/非垃圾邮件(或诈骗性/合法),且甚至区分各种种类的消息,例如,区分产品提供、成人内容及尼日利亚诈骗。垃圾邮件/诈骗性消息可随后被引导到特殊文件夹或被删除。类似地,在内容提供商的计算机系统上运行的软件能够用于拦截发布到由相应内容提供商托管的网站的垃圾邮件/诈骗性消息,且防止显示相应消息,或向所述网站的用户显示所述相应消息可为诈骗性或垃圾邮件的警告。In the case of spam email or email scams, software running on a user's or email service provider's computer system can be used to classify email messages as spam/not spam (or scam/legitimate), and even differentiate between different types of messages, for example, product offers, adult content, and Nigerian scams. Spam/scam messages can then be directed to a special folder or deleted. Similarly, software running on a content provider's computer system can be used to intercept spam/scam messages posted to a website hosted by the corresponding content provider and prevent the corresponding message from being displayed, or display a warning to users of the website that the corresponding message may be scam or spam.
已提出用于识别垃圾邮件及/或网上诈骗的若干方法,其包含使消息的发端地址与已知违法或受信任地址列表(分别称为黑名单及白名单的技术)匹配、搜索某些字或字形(例如,再融资、股票),及分析消息标头。有时结合自动化数据分类方法(例如,贝叶斯(Bayesian)筛选、神经网络)而使用特征提取/匹配方法。Several methods have been proposed for identifying spam and/or online scams, including matching the originating address of a message to lists of known illegal or trusted addresses (techniques known as blacklists and whitelists, respectively), searching for certain words or word patterns (e.g., "refinance," "stocks"), and analyzing message headers. Feature extraction/matching methods are sometimes used in conjunction with automated data classification methods (e.g., Bayesian screening, neural networks).
一些所提出的方法使用散列以产生电子文本消息的紧凑表示。此类表示允许有效的消息间比较,其用于垃圾邮件或诈骗检测目的。Some proposed methods use hashing to produce a compact representation of electronic text messages. Such representations allow efficient inter-message comparisons, which are used for spam or fraud detection purposes.
垃圾邮件发送者及网上诈骗者试图通过使用各种迷惑方法(例如,拼错某些字、将垃圾邮件及/或诈骗性内容嵌入到伪装成合法文档的较大文本块中,及将消息的形式及/或内容从一个分布波更改到另一分布波)而避开检测。使用散列的反垃圾邮件及反诈骗方法通常易受此类迷惑的干扰,这是因为文本的小改变可产生实质上不同的散列。成功的检测可因此受益于能够识别多态垃圾邮件及诈骗的方法及系统。Spammers and online scammers attempt to evade detection by using various obfuscation methods, such as misspelling certain words, embedding spam and/or fraudulent content within larger blocks of text disguised as legitimate documents, and changing the form and/or content of messages from one distribution wave to another. Anti-spam and anti-fraud methods that use hashing are often susceptible to such obfuscation because small changes in text can produce substantially different hashes. Successful detection can therefore benefit from methods and systems that can identify polymorphic spam and fraud.
发明内容Summary of the Invention
根据一个方面,一种客户端计算机系统包括至少一个处理器,其经配置以确定目标电子文档的文本指纹,使得所述文本指纹的长度约束在下限与上限之间,其中所述下限及上限为预定的。确定所述文本指纹包括:选择所述目标电子文档的多个文本标记;及响应于选择所述多个文本标记,根据所述上限及下限且根据所述所选择的多个文本标记的计数而确定指纹片段大小。确定所述文本指纹进一步包括:确定多个指纹片段,所述多个指纹片段中的每一指纹片段是根据所述所选择的多个文本标记中的相异文本标记的散列而确定,每一指纹片段由字符序列组成,所述序列的长度经选择为等于所述指纹片段大小;及级联所述多个指纹片段以形成所述文本指纹。According to one aspect, a client computer system includes at least one processor configured to determine a text fingerprint of a target electronic document such that a length of the text fingerprint is constrained between a lower limit and an upper limit, wherein the lower limit and the upper limit are predetermined. Determining the text fingerprint includes: selecting a plurality of text tokens of the target electronic document; and in response to selecting the plurality of text tokens, determining a fingerprint segment size based on the upper and lower limits and based on a count of the selected plurality of text tokens. Determining the text fingerprint further includes: determining a plurality of fingerprint segments, each of the plurality of fingerprint segments being determined based on a hash of a distinct text token in the selected plurality of text tokens, each fingerprint segment consisting of a character sequence whose length is selected to be equal to the fingerprint segment size; and concatenating the plurality of fingerprint segments to form the text fingerprint.
根据另一方面,一种服务器计算机系统包括至少一个处理器,其经配置以执行与多个客户端系统进行的事务,其中事务包括:从所述多个客户端系统中的客户端系统接收文本指纹,所述文本指纹是针对目标电子文档而确定,使得所述文本指纹的长度约束在下限与上限之间,其中所述下限及上限为预定的;及向所述客户端系统发送指示所述目标电子文档所属的文档类别的目标标签。确定所述文本指纹包括:选择所述目标电子文档的多个文本标记;及响应于选择所述多个文本标记,根据所述上限及下限且根据所述所选择的多个文本标记的计数而确定指纹片段大小。确定所述文本指纹进一步包括:确定多个指纹片段,所述多个指纹片段中的每一指纹片段是根据所述所选择的多个文本标记中的相异文本标记的散列而确定,每一指纹片段由字符序列组成,所述序列的长度经选择为等于所述指纹片段大小;及级联所述多个指纹片段以形成所述文本指纹。确定所述目标标签包括:从参考指纹的数据库检索参考指纹,所述参考指纹是针对属于所述类别的参考电子文档而确定,所述参考指纹是根据所述参考指纹的长度而选择,使得所述参考指纹的所述长度在所述上限与下限之间;及根据比较所述文本指纹与所述参考指纹的结果而确定所述目标电子文档是否属于所述类别。According to another aspect, a server computer system includes at least one processor configured to execute transactions with a plurality of client systems, wherein the transactions include: receiving a text fingerprint from a client system of the plurality of client systems, the text fingerprint determined for a target electronic document such that a length of the text fingerprint is constrained between a lower limit and an upper limit, wherein the lower limit and the upper limit are predetermined; and sending a target tag to the client system indicating a document category to which the target electronic document belongs. Determining the text fingerprint includes: selecting a plurality of text tags of the target electronic document; and in response to selecting the plurality of text tags, determining a fingerprint segment size based on the upper and lower limits and based on a count of the selected plurality of text tags. Determining the text fingerprint further includes: determining a plurality of fingerprint segments, each of the plurality of fingerprint segments being determined based on a hash of a distinct text tag from the selected plurality of text tags, each fingerprint segment consisting of a character sequence whose length is selected to be equal to the fingerprint segment size; and concatenating the plurality of fingerprint segments to form the text fingerprint. Determining the target label includes: retrieving a reference fingerprint from a database of reference fingerprints, the reference fingerprint being determined for a reference electronic document belonging to the category, the reference fingerprint being selected based on a length of the reference fingerprint such that the length of the reference fingerprint is between an upper limit and a lower limit; and determining whether the target electronic document belongs to the category based on a result of comparing the text fingerprint with the reference fingerprint.
根据另一方面,一种方法包括使用客户端计算机系统的至少一个处理器以确定目标电子文档的文本指纹,使得所述文本指纹的长度约束在下限与上限之间,其中所述下限及上限为预定的。确定所述文本指纹包括:选择所述目标电子文档的多个文本标记;及响应于选择所述多个文本标记,根据所述上限及下限且根据所述所选择的多个文本标记的计数而确定指纹片段大小。确定所述文本指纹进一步包括:确定多个指纹片段,所述多个指纹片段中的每一指纹片段是根据所述所选择的多个文本标记中的相异文本标记的散列而确定,每一指纹片段由字符序列组成,所述序列的长度经选择为等于所述指纹片段大小;及级联所述多个指纹片段以形成所述文本指纹。According to another aspect, a method includes determining, using at least one processor of a client computer system, a text fingerprint of a target electronic document such that a length of the text fingerprint is constrained between a lower limit and an upper limit, wherein the lower limit and the upper limit are predetermined. Determining the text fingerprint includes: selecting a plurality of text tokens of the target electronic document; and, in response to selecting the plurality of text tokens, determining a fingerprint segment size based on the upper and lower limits and based on a count of the plurality of selected text tokens. Determining the text fingerprint further includes: determining a plurality of fingerprint segments, each of the plurality of fingerprint segments being determined based on a hash of a distinct text token in the plurality of selected text tokens, each fingerprint segment consisting of a character sequence whose length is selected to be equal to the fingerprint segment size; and concatenating the plurality of fingerprint segments to form the text fingerprint.
根据另一方面,一种方法包括使用经配置以执行与多个客户端系统进行的事务的服务器计算机系统的至少一个处理器以:从所述多个客户端系统中的客户端系统接收文本指纹,所述文本指纹是针对目标电子文档而确定,使得所述文本指纹的长度约束在下限与上限之间,其中所述下限及上限为预定的;及向所述客户端系统发送针对所述目标电子文档所确定的目标标签,所述目标标签指示所述目标电子文档所属的文档类别。确定所述文本指纹包括:选择所述目标电子文档的多个文本标记;及响应于选择所述多个文本标记,根据所述上限及下限且根据所述所选择的多个文本标记的计数而确定指纹片段大小。确定所述文本指纹进一步包括:确定多个指纹片段,所述多个指纹片段中的每一指纹片段是根据所述所选择的多个文本标记中的相异文本标记的散列而确定,每一指纹片段由字符序列组成,所述序列的长度经选择为等于所述指纹片段大小;及级联所述多个指纹片段以形成所述文本指纹。确定所述目标标签包括:从参考指纹的数据库检索参考指纹,所述参考指纹是针对属于所述类别的参考电子文档而确定,所述参考指纹是根据所述参考指纹的长度而选择,使得所述参考指纹的所述长度在所述上限与下限之间;及根据比较所述文本指纹与所述参考指纹的结果而确定所述目标电子文档是否属于所述类别。According to another aspect, a method includes using at least one processor of a server computer system configured to execute transactions with a plurality of client systems to: receive a text fingerprint from a client system of the plurality of client systems, the text fingerprint determined for a target electronic document such that a length of the text fingerprint is constrained between a lower limit and an upper limit, wherein the lower limit and the upper limit are predetermined; and send a target tag determined for the target electronic document to the client system, the target tag indicating a document category to which the target electronic document belongs. Determining the text fingerprint includes: selecting a plurality of text tags for the target electronic document; and in response to selecting the plurality of text tags, determining a fingerprint segment size based on the upper and lower limits and based on a count of the selected plurality of text tags. Determining the text fingerprint further includes: determining a plurality of fingerprint segments, each of the plurality of fingerprint segments being determined based on a hash of a distinct text tag from the selected plurality of text tags, each fingerprint segment consisting of a character sequence whose length is selected to be equal to the fingerprint segment size; and concatenating the plurality of fingerprint segments to form the text fingerprint. Determining the target label includes: retrieving a reference fingerprint from a database of reference fingerprints, the reference fingerprint being determined for a reference electronic document belonging to the category, the reference fingerprint being selected based on a length of the reference fingerprint such that the length of the reference fingerprint is between an upper limit and a lower limit; and determining whether the target electronic document belongs to the category based on a result of comparing the text fingerprint with the reference fingerprint.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
在阅读以下详细描述后及在参考图式后就将更好地理解本发明的前述方面及优点,在图式中:The foregoing aspects and advantages of the present invention will be better understood after reading the following detailed description and after referring to the accompanying drawings, in which:
图1展示根据本发明的一些实施例的包括保护多个客户端系统的安全服务器的示范性反垃圾邮件/反诈骗系统。1 shows an exemplary anti-spam/anti-fraud system including a security server protecting multiple client systems, according to some embodiments of the present invention.
图2-A展示根据本发明的一些实施例的客户端计算机系统的示范性硬件配置。FIG2-A shows an exemplary hardware configuration of a client computer system according to some embodiments of the present invention.
图2-B展示根据本发明的一些实施例的安全服务器计算机系统的示范性硬件配置。FIG2-B shows an exemplary hardware configuration of a secure server computer system according to some embodiments of the present invention.
图2-C展示根据本发明的一些实施例的内容服务器计算机系统的示范性硬件配置。FIG2-C shows an exemplary hardware configuration of a content server computer system according to some embodiments of the present invention.
图3-A展示根据本发明的一些实施例的包括文本块的示范性垃圾电子邮件消息。FIG. 3-A shows an exemplary spam email message including a block of text, according to some embodiments of the present invention.
图3-B展示根据本发明的一些实施例的包括文本块的示范性垃圾邮件网志评论。FIG. 3-B shows an exemplary spam blog comment including a block of text, according to some embodiments of the present invention.
图3-C说明根据本发明的一些实施例的包括多个文本块的示范性诈骗性网页。FIG. 3-C illustrates an exemplary fraudulent web page including multiple blocks of text, according to some embodiments of the present invention.
图4-A说明根据本发明的一些实施例的客户端计算机与安全服务器之间的示范性垃圾邮件/诈骗检测事务。4-A illustrates an exemplary spam/scam detection transaction between a client computer and a secure server, according to some embodiments of the present invention.
图4-B说明根据本发明的一些实施例的内容服务器与安全服务器之间的示范性垃圾邮件/诈骗检测事务。4-B illustrates an exemplary spam/scam detection transaction between a content server and a security server according to some embodiments of the present invention.
图5展示根据本发明的一些实施例的目标电子文档的示范性目标指示符,所述指示符包括文本指纹及其它垃圾邮件/诈骗识别数据。5 shows exemplary target indicators for a target electronic document, including text fingerprints and other spam/fraud identification data, according to some embodiments of the present invention.
图6展示根据本发明的一些实施例的在客户端系统上执行的示范性应用程序集合的图解。6 shows a diagram of an exemplary set of applications executing on a client system according to some embodiments of the invention.
图7说明根据本发明的一些实施例的由图6的指纹计算器执行的示范性步骤序列。FIG7 illustrates an exemplary sequence of steps performed by the fingerprint calculator of FIG6 according to some embodiments of the present invention.
图8展示根据本发明的一些实施例的目标文本块的文本指纹的示范性确定。FIG8 shows an exemplary determination of a text fingerprint of a target block of text according to some embodiments of the present invention.
图9展示根据本发明的一些实施例的针对处于各种放大及缩小因数的目标文本块而确定的多个指纹。9 shows multiple fingerprints determined for a target text block at various magnification and reduction factors, according to some embodiments of the present invention.
图10说明根据本发明的一些实施例的由指纹计算器执行以确定缩小指纹的示范性步骤序列。10 illustrates an exemplary sequence of steps performed by a fingerprint computer to determine a reduced fingerprint according to some embodiments of the present invention.
图11展示根据本发明的一些实施例的在安全服务器上执行的示范性应用程序。FIG. 11 shows an exemplary application executing on a secure server according to some embodiments of the present invention.
图12展示根据本发明的一些实施例的在安全服务器上执行的示范性文档分类器的图解。12 shows a diagram of an exemplary document classifier executed on a secure server according to some embodiments of the invention.
图13展示在包括分析实际垃圾邮件消息流的计算机实验中获得的垃圾邮件检测率,所述分析是根据本发明的一些实施例而执行;比较所述检测率与通过常规方法而获得的检测率。13 shows spam detection rates obtained in a computer experiment involving analysis of actual spam message streams, the analysis being performed according to some embodiments of the present invention; comparing the detection rates with those obtained by conventional methods.
具体实施方式DETAILED DESCRIPTION
在以下描述中,应理解,结构之间的所有列举的连接可为直接操作连接或通过中介结构的间接操作连接。元件集合包含一或多个元件。元件的任何列举应被理解为是指至少一个元件。多个元件包含至少两个元件。除非另有要求,否则任何所描述的方法步骤未必需要按所说明的特定次序执行。来源于第二元件的第一元件(例如,数据)涵盖等于第二元件的第一元件,以及通过处理第二元件而产生的第一元件及任选的其它数据。根据参数做出确定或决定涵盖根据参数且任选地根据其它数据做出确定或决定。除非另有指定,否则一些数量/数据的指示符可为所述数量/数据自身,或为与所述数量/数据自身不同的指示符。除非另有指定,否则散列为散列函数的输出。除非另有指定,否则散列函数为将符号(例如,字符、位)序列映射成数字或位串的数学变换。计算机可读媒体涵盖例如磁性、光学及半导体存储媒体(例如,硬盘驱动器、光盘、闪速存储器、DRAM)的非暂时性媒体,以及例如导电电缆及光纤链路的通信链路。根据一些实施例,本发明尤其提供包括硬件(例如,一或多个处理器)以及计算机可读媒体的计算机系统,所述硬件经编程以执行本文中所描述的方法,所述计算机可读媒体编码指令以执行本文中所描述的方法。In the following description, it should be understood that all enumerated connections between structures can be direct operational connections or indirect operational connections through intermediary structures. An element set includes one or more elements. Any enumeration of elements should be understood to refer to at least one element. A plurality of elements includes at least two elements. Unless otherwise required, any described method steps do not necessarily need to be performed in the specific order described. A first element (e.g., data) derived from a second element encompasses a first element that is equal to the second element, as well as a first element and optionally other data generated by processing the second element. Making a determination or decision based on a parameter encompasses making a determination or decision based on a parameter and optionally based on other data. Unless otherwise specified, an indicator of some quantity/data may be the quantity/data itself, or an indicator different from the quantity/data itself. Unless otherwise specified, a hash is the output of a hash function. Unless otherwise specified, a hash function is a mathematical transformation that maps a sequence of symbols (e.g., characters, bits) into a number or bit string. Computer-readable media encompass non-transitory media such as magnetic, optical, and semiconductor storage media (e.g., hard drives, optical disks, flash memories, DRAM), as well as communication links such as conductive cables and optical fiber links. According to some embodiments, the present invention provides, among other things, a computer system comprising hardware (eg, one or more processors) programmed to perform the methods described herein and a computer-readable medium encoding instructions to perform the methods described herein.
以下描述作为实例而未必作为限制来说明本发明的实施例。The following description illustrates embodiments of the invention by way of example and not necessarily by way of limitation.
图1展示根据本发明的一些实施例的示范性反垃圾邮件/反诈骗系统10。系统10包含内容服务器12、发送器系统13、安全服务器14及多个客户端系统16a到c,其全部是由通信网络18连接。网络18可为广域网(例如,互联网),而网络18的部分也可包含局域网(LAN)。FIG1 shows an exemplary anti-spam/anti-fraud system 10 according to some embodiments of the present invention. System 10 includes a content server 12, a sender system 13, a security server 14, and a plurality of client systems 16a-c, all of which are connected by a communication network 18. Network 18 may be a wide area network (e.g., the Internet), while portions of network 18 may also include local area networks (LANs).
在一些实施例中,内容服务器12经配置以从多个用户接收用户贡献内容(例如,文章、网志条目、媒体上传、评论等等),且组织、格式化及分布此类内容到第三方(例如,客户端系统16a到c)。内容服务器12的示范性实施例为将电子消息递送提供到客户端系统16a到c的电子邮件服务器。内容服务器12的另一实施例为托管网志或社交联网站点的计算机系统。在一些实施例中,用户贡献内容以电子文档(在以下描述中也称为目标文档)的形式在网络18上流传。电子文档包含网页(例如,HTML文档)及电子消息(例如,电子邮件及短消息服务(SMS)消息等等)。在服务器12处接收的用户贡献数据的部分可包括未经请求的及/或诈骗性消息及文档。In some embodiments, content server 12 is configured to receive user-contributed content (e.g., articles, blog entries, media uploads, comments, etc.) from a plurality of users and to organize, format, and distribute such content to third parties (e.g., client systems 16a-c). An exemplary embodiment of content server 12 is an email server that provides electronic message delivery to client systems 16a-c. Another embodiment of content server 12 is a computer system that hosts a blog or social networking site. In some embodiments, user-contributed content circulates over network 18 in the form of electronic documents (also referred to as target documents in the following description). Electronic documents include web pages (e.g., HTML documents) and electronic messages (e.g., emails and short message service (SMS) messages, etc.). Some of the user-contributed data received at server 12 may include unsolicited and/or fraudulent messages and documents.
在一些实施例中,发送器系统13包括向客户端系统16a到c发送未经请求的通信(例如,垃圾电子邮件消息)的计算机系统。可在服务器12处接收此类消息,且随后发送到客户端系统16a到c。替代地,可使在服务器12处接收的消息可用(例如,通过web界面)以供客户端系统16a到c检索。在其他实施例中,发送器系统13可向内容服务器12发送未经请求的通信(例如,垃圾网志评论,或发布到社交联网站点的垃圾邮件)。客户端系统16a到c可随后经由协议(例如,超文本传输协议(HTTP))而检索此类通信。In some embodiments, sender system 13 comprises a computer system that sends unsolicited communications (e.g., spam email messages) to client systems 16a-c. Such messages may be received at server 12 and subsequently sent to client systems 16a-c. Alternatively, messages received at server 12 may be made available (e.g., via a web interface) for retrieval by client systems 16a-c. In other embodiments, sender system 13 may send unsolicited communications (e.g., spam blog comments or spam emails posted to social networking sites) to content server 12. Client systems 16a-c may then retrieve such communications via a protocol (e.g., Hypertext Transfer Protocol (HTTP)).
安全服务器14可包含一或多个计算机系统,其执行电子文档的分类(如下文详细地所展示)。执行此类分类可包含识别未经请求的消息(垃圾邮件)及/或诈骗性电子文档(例如,网络钓鱼消息及网页)。在一些实施例中,执行所述分类包含安全服务器14与内容服务器12之间及/或安全服务器14与客户端系统16a到b之间进行的协作式垃圾邮件/诈骗检测事务。The security server 14 may include one or more computer systems that perform classification of electronic documents (as shown in detail below). Performing such classification may include identifying unsolicited messages (spam) and/or fraudulent electronic documents (e.g., phishing messages and web pages). In some embodiments, performing such classification includes collaborative spam/fraud detection transactions between the security server 14 and the content server 12 and/or between the security server 14 and the client systems 16a-b.
客户端系统16a到c可包含终端用户计算机,其各自具有处理器、存储器及存储装置,且运行操作系统(例如,或Linux)。一些客户端计算机系统16a到c可为移动计算及/或电信装置,例如,平板PC、移动电话、个人数字助理(PDA),及家用装置(例如,电视机或音乐播放器等等)。在一些实施例中,客户端系统16a到c可表示个别客户,或若干客户端系统可属于同一客户。客户端系统16a到c可通过从发送器系统13接收电子文档(例如,电子邮件消息)且将其存储在本地收件箱中或通过在网络18上检索此类文档(例如,从由内容服务器12服务的网站)而存取此类文档。Client systems 16a-c may comprise end-user computers, each having a processor, memory, and storage, and running an operating system (e.g., Windows XP or Linux). Some client computer systems 16a-c may be mobile computing and/or telecommunications devices, such as tablet PCs, mobile phones, personal digital assistants (PDAs), and home devices (e.g., televisions or music players, etc.). In some embodiments, client systems 16a-c may represent individual clients, or several client systems may belong to the same client. Client systems 16a-c may access electronic documents (e.g., email messages) by receiving them from sender system 13 and storing them in a local inbox or by retrieving such documents over network 18 (e.g., from a website served by content server 12).
图2-A展示客户端系统16(例如,图1的系统16a到c)的示范性硬件配置。图2-A展示用于说明性目的的计算机系统;其它装置(例如,移动电话)的硬件配置可不同。在一些实施例中,客户端系统16包括处理器20、存储器单元22、输入装置24的集合、输出装置26的集合、存储装置28的集合及通信接口控制器30,其全部是由总线34的集合连接。FIG2-A shows an exemplary hardware configuration of client system 16 (e.g., systems 16a-c of FIG1 ). FIG2-A shows a computer system for illustrative purposes; the hardware configuration of other devices (e.g., mobile phones) may differ. In some embodiments, client system 16 includes a processor 20, a memory unit 22, a set of input devices 24, a set of output devices 26, a set of storage devices 28, and a communication interface controller 30, all of which are connected by a set of buses 34.
在一些实施例中,处理器20包括物理装置(例如,多核集成电路),其经配置以用信号及/或数据集合来执行计算及/或逻辑运算。在一些实施例中,此类逻辑运算是以处理器指令序列(例如,机器码或其它软件类型)的形式递送到处理器20。存储器单元22可包括易失性计算机可读媒体(例如,RAM),其存储由处理器20在进行指令期间存取或产生的数据/信号。输入装置24可包含计算机键盘、鼠标及麦克风等等,其包含允许用户将数据及/或指令引入到系统16中的相应硬件接口及/或适配器。输出装置26可包含显示装置(例如,显示器及扬声器等等),以及硬件接口/适配器(例如,图形卡),其允许系统16向用户传达数据。在一些实施例中,输入装置24及输出装置26可共享硬件的公用部分,在触摸屏装置的情况下就是如此。存储装置28包含计算机可读媒体,其实现软件指令及/或数据的非易失性存储、读取及写入。示范性存储装置28包含磁盘与光盘及闪速存储器装置,以及可移动媒体(例如,CD及/或DVD盘与驱动器)。通信接口控制器30使系统16能够连接到网络18及/或其它装置/计算机系统。总线34共同地表示多个系统、外围设备及芯片集总线,及/或实现客户端系统16的装置20到30的内部通信的所有其它电路。举例来说,总线34可包括将处理器20连接到存储器22的北桥,及/或将处理器20连接到装置24到30的南桥等等。In some embodiments, processor 20 comprises a physical device (e.g., a multi-core integrated circuit) configured to perform computations and/or logical operations using signals and/or data sets. In some embodiments, such logical operations are delivered to processor 20 in the form of a sequence of processor instructions (e.g., machine code or other software type). Memory unit 22 may comprise volatile computer-readable media (e.g., RAM) that stores data/signals accessed or generated by processor 20 during the execution of instructions. Input devices 24 may include a computer keyboard, mouse, microphone, etc., including corresponding hardware interfaces and/or adapters that allow a user to enter data and/or instructions into system 16. Output devices 26 may include a display device (e.g., a monitor and speakers, etc.), as well as hardware interfaces/adapters (e.g., a graphics card) that allow system 16 to communicate data to the user. In some embodiments, input devices 24 and output devices 26 may share common hardware, as is the case in the case of a touchscreen device. Storage device 28 comprises computer-readable media that enables non-volatile storage, reading, and writing of software instructions and/or data. Exemplary storage devices 28 include magnetic and optical disks and flash memory devices, as well as removable media (e.g., CD and/or DVD disks and drives). Communication interface controller 30 enables system 16 to connect to network 18 and/or other devices/computer systems. Bus 34 collectively represents multiple system, peripheral, and chipset buses, and/or all other circuitry that enables internal communications between devices 20-30 of client system 16. For example, bus 34 may include a north bridge that connects processor 20 to memory 22, and/or a south bridge that connects processor 20 to devices 24-30, and so on.
图2-B展示根据本发明的一些实施例的安全服务器14的示范性硬件配置。安全服务器14包含处理器120及存储器单元122,且可进一步包括存储装置128的集合及至少一个通信接口控制器130,其全部是经由总线134的集合而互连。在一些实施例中,处理器120、存储器122及存储装置128的操作可分别类似于项目20、22及28的操作,如上文关于图2-A所描述。存储器单元122存储由处理器120在进行指令期间存取或产生的数据/信号。控制器130使安全服务器14能够连接到网络18,以向连接到网络18的其它系统发射数据及/或从连接到网络18的其它系统接收数据。FIG2-B shows an exemplary hardware configuration of a security server 14 according to some embodiments of the present invention. The security server 14 includes a processor 120 and a memory unit 122, and may further include a collection of storage devices 128 and at least one communication interface controller 130, all of which are interconnected via a collection of buses 134. In some embodiments, the operation of the processor 120, memory 122, and storage devices 128 may be similar to the operation of items 20, 22, and 28, respectively, as described above with respect to FIG2-A. The memory unit 122 stores data/signals accessed or generated by the processor 120 during the execution of instructions. The controller 130 enables the security server 14 to connect to the network 18 to transmit data to and/or receive data from other systems connected to the network 18.
图2-C展示根据本发明的一些实施例的内容服务器12的示范性硬件配置。内容服务器12包含处理器220及存储器单元222,且可进一步包括存储装置228的集合及至少一个通信接口控制器230,其全部是由总线234的集合互连。在一些实施例中,处理器220、存储器222及存储装置228的操作可分别类似于项目20、22及28的操作,如上文所描述。存储器单元222存储由处理器220在进行指令期间存取或产生的数据/信号。在一些实施例中,接口控制器230使内容服务器12能够连接到网络18,且向连接到网络18的其它系统发射数据及/或从连接到网络18的其它系统接收数据。FIG2-C shows an exemplary hardware configuration of content server 12 according to some embodiments of the present invention. Content server 12 includes a processor 220 and a memory unit 222, and may further include a collection of storage devices 228 and at least one communication interface controller 230, all of which are interconnected by a collection of buses 234. In some embodiments, the operation of processor 220, memory 222, and storage devices 228 may be similar to the operation of items 20, 22, and 28, respectively, as described above. Memory unit 222 stores data/signals accessed or generated by processor 220 during the execution of instructions. In some embodiments, interface controller 230 enables content server 12 to connect to network 18 and transmit data to and/or receive data from other systems connected to network 18.
图3-A展示根据本发明的一些实施例的包括垃圾电子邮件的示范性目标文档36a。目标文档36a可包括标头及有效负载,所述标头包含消息路由数据(例如,发件人的指示符及/或收件人的指示符),及/或其它数据(例如,时间戳及内容类型(例如,多用途互联网邮件扩展(MIME)类型)的指示符)。所述有效负载可包含作为文本及/或图像显示给用户的数据。在内容服务器12及/或客户端系统16a到c上执行的软件可处理所述有效负载以产生目标文档36a的目标文本块38a。在一些实施例中,目标文本块38a包括意在被解释为文本的标志及/或符号序列。文本块38a可包含比如标点符号的特殊字符,以及表示网络地址、统一资源定位符(URL)、电子邮件地址、假名及别名的字符序列等等。目标文本块38a可直接嵌入到目标文档36a中(例如,作为纯文本MIME部分),或可包括处理嵌入在文档36a中的计算机指令集合的结果。举例来说,目标文本块38a可包含呈现超文本标记语言(HTML)指令集合的结果,或执行嵌入在目标文档36a中的客户端脚本指令或服务器端脚本指令集合(例如,PHP、Javascript)的结果。在另一实施例中,目标文本块38a可嵌入到图像中,在图像垃圾邮件的情况下就是如此。FIG3-A shows an exemplary target document 36a comprising a junk email message according to some embodiments of the present invention. The target document 36a may include a header and a payload, wherein the header includes message routing data (e.g., an indicator of the sender and/or the recipient), and/or other data (e.g., a timestamp and an indicator of the content type (e.g., a Multipurpose Internet Mail Extensions (MIME) type)). The payload may include data displayed to the user as text and/or an image. Software executing on the content server 12 and/or the client systems 16a-c may process the payload to generate a target text block 38a of the target document 36a. In some embodiments, the target text block 38a includes a sequence of signs and/or symbols that are intended to be interpreted as text. The text block 38a may include special characters such as punctuation marks, as well as character sequences representing network addresses, uniform resource locators (URLs), email addresses, pseudonyms, and aliases, etc. The target text block 38a may be embedded directly into the target document 36a (e.g., as a plain text MIME part), or may comprise the result of processing a set of computer instructions embedded in the document 36a. For example, the target text block 38a may comprise the result of rendering a set of Hypertext Markup Language (HTML) instructions, or the result of executing a set of client-side script instructions or a set of server-side script instructions (e.g., PHP, Javascript) embedded in the target document 36a. In another embodiment, the target text block 38a may be embedded in an image, as is the case in image spam.
图3-B展示另一示范性目标文档36b,其包括发布在网页(例如,网志、网上新闻页面或社交联网页面)上的评论。在一些实施例中,文档36b包括数据字段(例如,嵌入在HTML文档中的表单的字段)集合的内容。例如,填写此类表单字段可由人类操作者远程执行,及/或由在发送器系统13上执行的软件部分自动地执行。在一些实施例中,文档36b的显示包括文本块38b,其由字符及/或符号序列组成,所述字符及/或符号序列意在由存取相应网站的用户解释为文本。文本块38b可包含超链接、特殊字符、表情符及图像等等。FIG3-B shows another exemplary target document 36b, which includes a comment posted on a webpage (e.g., a blog, an online news page, or a social networking page). In some embodiments, document 36b includes the contents of a collection of data fields (e.g., fields of a form embedded in an HTML document). For example, filling in such form fields can be performed remotely by a human operator and/or automatically by software executing on sender system 13. In some embodiments, the display of document 36b includes a text block 38b, which consists of a sequence of characters and/or symbols that are intended to be interpreted as text by a user accessing the corresponding website. Text block 38b can include hyperlinks, special characters, emoticons, images, and the like.
图3-C说明另一示范性目标文档36c,其包括网络钓鱼网页。文档36c可作为HTML及/或服务器端或客户端脚本指令集合而递送,其在执行时确定文档查看器(例如,web浏览器)以产生图像集合及/或文本块集合。图3-C中说明两个此类示范性文本块38c到d。文本块36c到d可包含超链接及电子邮件地址。FIG3-C illustrates another exemplary target document 36c, which comprises a phishing webpage. Document 36c may be delivered as a set of HTML and/or server-side or client-side script instructions that, when executed, determine a document viewer (e.g., a web browser) to generate a set of images and/or a set of text blocks. FIG3-C illustrates two such exemplary text blocks 38c-d. Text blocks 38c-d may include hyperlinks and email addresses.
图4-A展示根据本发明的一些实施例的示范性客户端系统16(例如,图1的客户端系统16a到c)与安全服务器14之间的示范性垃圾邮件/诈骗检测事务。图4-A中说明的交换发生(例如)在系统10的实施例中,系统10经配置以检测电子邮件垃圾邮件。在从内容服务器12接收到目标文档36(例如,电子邮件消息)之后,客户端系统16可确定目标文档36的目标指示符40,且可将目标指示符40发送到安全服务器14。目标指示符40包括允许安全服务器14执行目标文档36的分类的数据以确定(例如)文档36是否为垃圾邮件。响应于接收到目标指示符40,安全服务器14可向相应客户端系统16发送指示文档36是否为垃圾邮件的目标标签50。FIG4-A shows an exemplary spam/fraud detection transaction between an exemplary client system 16 (e.g., client systems 16a-c of FIG1 ) and a security server 14, according to some embodiments of the present invention. The exchange illustrated in FIG4-A occurs, for example, in an embodiment of system 10 configured to detect email spam. After receiving a target document 36 (e.g., an email message) from a content server 12, the client system 16 may determine a target indicator 40 for the target document 36 and may send the target indicator 40 to the security server 14. The target indicator 40 includes data that allows the security server 14 to perform classification of the target document 36 to determine, for example, whether the document 36 is spam. In response to receiving the target indicator 40, the security server 14 may send a target tag 50 to the respective client system 16 indicating whether the document 36 is spam.
图4-B中说明垃圾邮件检测事务的另一实施例,且其发生在内容服务器12与安全服务器14之间。此类交换可发生(例如)以检测发布到网志及/或社交网络网站的未经请求的通信,或检测网络钓鱼网页。托管及/或显示相应网站的内容服务器12可接收目标文档36(例如,网志评论)。内容服务器12可处理相应通信以产生相应文档的目标指示符40,且可将目标指示符发送到安全服务器14。作为回报,服务器14可确定指示相应文档是否为垃圾邮件或诈骗性的目标标签50,且将标签50发送到内容服务器12。Another embodiment of a spam detection transaction is illustrated in FIG4-B and occurs between a content server 12 and a security server 14. Such an exchange may occur, for example, to detect unsolicited communications posted to blogs and/or social networking websites, or to detect phishing web pages. The content server 12, which hosts and/or displays the corresponding website, may receive a target document 36 (e.g., a blog comment). The content server 12 may process the corresponding communication to generate a target indicator 40 for the corresponding document and may send the target indicator to the security server 14. In return, the server 14 may determine a target label 50 indicating whether the corresponding document is spam or fraudulent and send the label 50 to the content server 12.
图5展示针对示范性目标文档36(例如,图3-A中的电子邮件消息36a)所确定的示范性目标指示符40。在一些实施例中,目标指示符40为数据结构,其包含与目标文档36唯一地相关联的消息标识符41(例如,散列索引),及针对文档36的文本块(例如,图3-A中的文本块38a)所确定的文本指纹42。目标指示符40可进一步包括指示文档36的发件人的发件人指示符44、指示发出文档36的网络地址(例如,IP地址)的路由指示符46,及指示文档36被发送及/或接收的时刻的时间戳48。在一些实施例中,目标指示符40可包括文档36的其它垃圾邮件指示及/或诈骗指示特征,例如,指示文档36是否包含图像的旗标、指示文档36是否包含超链接的旗标,及针对文档36所确定的文档布局指示符等等。FIG5 shows an exemplary target indicator 40 determined for an exemplary target document 36 (e.g., email message 36a in FIG3-A). In some embodiments, target indicator 40 is a data structure that includes a message identifier 41 (e.g., a hash index) uniquely associated with target document 36 and a text fingerprint 42 determined for a block of text in document 36 (e.g., block 38a in FIG3-A). Target indicator 40 may further include a sender indicator 44 indicating the sender of document 36, a routing indicator 46 indicating the network address (e.g., an IP address) from which document 36 was sent, and a timestamp 48 indicating the time at which document 36 was sent and/or received. In some embodiments, target indicator 40 may include other spam-indicating and/or fraud-indicating features of document 36, such as a flag indicating whether document 36 contains an image, a flag indicating whether document 36 contains a hyperlink, a document layout indicator determined for document 36, and the like.
图6展示根据本发明的一些实施例的在客户端系统16上执行的示范性组件集合。图6中说明的配置适合于(例如)检测在客户端系统16处接收的垃圾电子邮件消息。系统16包括文档消解仪52及连接到文档消解仪52的文档显示管理器54。文档消解仪52可进一步包括指纹计算器56。在一些实施例中,文档消解仪52接收目标文档36(例如,电子邮件消息),且处理文档36以产生目标指示符40。处理文档36可包含剖析文档36以识别相异数据字段及/或类型,且区分标头数据与有效负载数据等等。当文档36为电子邮件消息时,示范性剖析可产生针对相应消息的发件人、IP地址、主题、时间戳及内容等等的相异数据对象。当文档36的内容包含多个MIME类型的数据时,剖析可产生针对每一MIME类型(例如,纯文本、HTML及图像等等)的相异数据对象。文档消解仪52可随后制定目标指示符40,例如通过填写目标指示符40的相应字段(例如,发件人、路由地址及时间戳等等)。客户端系统16的软件组件可随后将目标指示符40发射到安全服务器14以供分析。FIG6 shows an exemplary set of components executed on a client system 16 according to some embodiments of the present invention. The configuration illustrated in FIG6 is suitable, for example, for detecting spam email messages received at a client system 16. System 16 includes a document digester 52 and a document display manager 54 connected to the document digester 52. The document digester 52 may further include a fingerprint calculator 56. In some embodiments, the document digester 52 receives a target document 36 (e.g., an email message) and processes the document 36 to generate a target indicator 40. Processing the document 36 may include parsing the document 36 to identify distinct data fields and/or types, distinguishing header data from payload data, and the like. When the document 36 is an email message, the exemplary parsing may generate distinct data objects for the sender, IP address, subject, timestamp, and content of the corresponding message, among others. When the content of the document 36 includes data of multiple MIME types, the parsing may generate distinct data objects for each MIME type (e.g., plain text, HTML, image, etc.). Document digester 52 may then formulate target indicator 40, such as by filling in the corresponding fields of target indicator 40 (eg, sender, routing address, timestamp, etc.) The software component of client system 16 may then transmit target indicator 40 to secure server 14 for analysis.
在一些实施例中,文档显示管理器54接收目标文档36,将其转化为视觉形式且将其显示在客户端系统16的输出装置上。显示管理器54的一些实施例也可允许客户端系统16的用户与所显示的内容交互。显示管理器54可与现成的文档显示软件(例如,web浏览器、电子邮件阅读器、电子书阅读器及媒体播放器等等)集成。例如,此类集成可以软件插件的形式而实现。显示管理器54可经配置以将目标文档36(例如,传入电子邮件)指派到文档类别(例如,文档的垃圾邮件、合法及/或各种其它类别与子类别)。此类分类可根据从安全服务器14接收的目标标签50而确定。显示管理器54可经进一步配置以将垃圾邮件/诈骗消息分组为单独的文件夹及/或仅向用户显示合法消息。管理器54也可根据此类分类而对文档36加标签。例如,文档显示管理器54可以相异的颜色显示垃圾邮件/诈骗消息,或紧接于每一垃圾邮件/诈骗消息而显示指示相应消息(例如,垃圾邮件、网络钓鱼等等)的分类的旗标。类似地,当文档36为诈骗性网页时,显示管理器54可阻止用户存取相应页面及/或向用户显示警告。In some embodiments, the document display manager 54 receives the target document 36, converts it into a visual form, and displays it on an output device of the client system 16. Some embodiments of the display manager 54 may also allow the user of the client system 16 to interact with the displayed content. The display manager 54 may integrate with existing document display software (e.g., web browsers, email readers, e-book readers, media players, etc.). For example, such integration may be implemented as a software plug-in. The display manager 54 may be configured to assign target documents 36 (e.g., incoming emails) to document categories (e.g., spam, legitimate, and/or various other categories and subcategories of documents). Such categorization may be determined based on the target tags 50 received from the secure server 14. The display manager 54 may further be configured to group spam/scam messages into separate folders and/or display only legitimate messages to the user. The manager 54 may also tag documents 36 based on such categorization. For example, the document display manager 54 may display spam/scam messages in different colors or display a flag next to each spam/scam message indicating the message's category (e.g., spam, phishing, etc.). Similarly, when the document 36 is a fraudulent web page, the display manager 54 may prevent the user from accessing the corresponding page and/or display a warning to the user.
在经配置以检测作为评论发布在网志及社交网络站点上的垃圾邮件/诈骗的实施例中,文档消解仪52及显示管理器54可在内容服务器12(代替图6中展示的客户端系统16a到c)上执行。此类软件可以服务器端脚本的形式在内容服务器12上实施,其可进一步并入(例如,作为插件)到较大的脚本包中(例如,作为针对或网上出版平台的反垃圾邮件/反诈骗插件)。一旦确定目标文档36为垃圾邮件或诈骗性。显示管理器54就可经配置以阻止相应消息,从而防止其在相应网站内显示。In an embodiment configured to detect spam/fraud posted as comments on blogs and social networking sites, the document digester 52 and display manager 54 can be executed on the content server 12 (instead of the client systems 16a-c shown in FIG6). Such software can be implemented on the content server 12 in the form of a server-side script, which can be further incorporated (e.g., as a plug-in) into a larger script package (e.g., as an anti-spam/anti-fraud plug-in for a WordPress or online publishing platform). Once a target document 36 is determined to be spam or fraudulent, the display manager 54 can be configured to block the corresponding message, thereby preventing it from being displayed on the corresponding website.
指纹计算器56(图6)经配置以确定目标文档36的文本指纹,其构成目标指示符40的部分(例如,图5中的项目42)。在一些实施例中,针对目标电子文档所确定的指纹包括字符序列,所述序列的长度约束在预定上限与下限之间(例如,在129与256个字符之间,129及256包含在内)。使此类指纹在预定长度范围内可为期望的,其允许与参考指纹集合的有效比较,以识别包括垃圾邮件及/或诈骗的文本块,如下文更详细地所展示。在一些实施例中,形成指纹的字符可包括字母数字字符、特殊字符及符号(例如,*、/、$等等)等等。用于形成文本指纹的其它示范性字符包含用于在各种编码中表示数目的数字或其它符号,例如,二进制、十六进制及Base64等等。Fingerprint calculator 56 ( FIG. 6 ) is configured to determine a text fingerprint of target document 36 , which forms part of target indicator 40 (e.g., item 42 in FIG. 5 ). In some embodiments, the fingerprint determined for the target electronic document comprises a sequence of characters whose length is constrained between predetermined upper and lower limits (e.g., between 129 and 256 characters, inclusive). Keeping such fingerprints within the predetermined length range may be desirable, allowing for efficient comparison with a reference fingerprint set to identify blocks of text that include spam and/or scams, as described in greater detail below. In some embodiments, the characters forming the fingerprint may include alphanumeric characters, special characters, and symbols (e.g., *, /, $, etc.), among others. Other exemplary characters used to form the text fingerprint include digits or other symbols used to represent numbers in various encodings, such as binary, hexadecimal, and Base64, among others.
图7说明由指纹计算器56执行以确定文本指纹的示范性步骤序列。在步骤402中,指纹计算器可选择用于指纹计算的目标文档36的目标文本块。在一些实施例中,目标文本块可实质上由目标文档36的全部文本内容(例如,文档36的纯文本MIME部分)组成。在一些实施例中,目标文本块可由文档36的文本部分的单一段落组成。在经配置以筛选基于web的垃圾邮件的实施例中,目标文本块可由网志评论的内容组成,或由用户发送且意在发布在相应网站上的另一种类的消息(例如,wall post、tweet等等)组成。在一些实施例中,目标文本块包括HTML文档的章节的内容(例如,由DIV或SPAN标签指示的章节)。FIG7 illustrates an exemplary sequence of steps performed by the fingerprint calculator 56 to determine a text fingerprint. In step 402, the fingerprint calculator may select a target text block of the target document 36 for fingerprint calculation. In some embodiments, the target text block may consist of substantially the entire text content of the target document 36 (e.g., a plain text MIME portion of the document 36). In some embodiments, the target text block may consist of a single paragraph of the text portion of the document 36. In an embodiment configured to filter web-based spam, the target text block may consist of the content of a blog comment, or another type of message sent by a user and intended for publication on a corresponding website (e.g., a wall post, a tweet, etc.). In some embodiments, the target text block includes the contents of a section of an HTML document (e.g., a section indicated by a DIV or SPAN tag).
在步骤404中,指纹计算器56可将目标文本块分成文本标记。图8展示将文本块38分成多个文本标记60a到c的示范性分段。在一些实施例中,文本标记为由任何定界符字符/符号集合而与其它文本标记分离的字符/符号序列。用于西方语言脚本的示范性定界符包含空格、断行、制表符、‘\r’、‘\0’、句号、逗号、冒号、分号、圆括号及/或方括号、反向及/或正向斜线、双斜线、数学符号(例如,‘+’、‘-’、‘*’、‘^’)、标点符号(例如,‘!’及‘?’)及特殊字符(例如,‘$’及‘|’等等)。图8中的示范性标记为个别字;文本标记的其它实例可包含多字序列、电子邮件地址及URL等等。为了识别文本块38的个别标记,指纹计算器可使用所属领域中所知的任何串标记化算法。指纹计算器56的一些实施例可考虑某些标记(例如,英语中的常见字(例如,‘a’及‘the’)),此针对指纹计算为不合格的。在一些实施例中,超过预定最大长度的标记被进一步分成较短标记。In step 404, the fingerprint calculator 56 may segment the target text block into text tokens. FIG8 shows an exemplary segmentation of the text block 38 into a plurality of text tokens 60a-c. In some embodiments, a text token is a sequence of characters/symbols separated from other text tokens by any set of delimiter characters/symbols. Exemplary delimiters for Western scripts include space, line break, tab, '\r', '\0', period, comma, colon, semicolon, parentheses and/or brackets, backward and/or forward slashes, double slashes, mathematical symbols (e.g., '+', '-', '*', '^'), punctuation (e.g., '!' and '?'), and special characters (e.g., '$' and '|', etc.). The exemplary tokens in FIG8 are individual words; other examples of text tokens may include multi-word sequences, email addresses, URLs, etc. To identify individual tokens of the text block 38, the fingerprint calculator may use any string tokenization algorithm known in the art. Some embodiments of the fingerprint calculator 56 may consider certain tokens, such as common words in the English language (e.g., 'a' and 'the'), to be ineligible for fingerprint calculation. In some embodiments, tokens exceeding a predetermined maximum length are further broken into shorter tokens.
在一些实施例中,由计算器56确定的文本指纹的长度约束在预定范围(例如,在129与256个字符之间,129及256包含在内)内,而不管相应目标文本块的长度或标记计数如何。为了计算此类指纹,在步骤406中,指纹计算器56可首先确定目标文本块的文本标记的计数,且比较所述计数与预定上限阈值,所述预定上限阈值是根据指纹长度的上限而确定。当标记计数超过上限阈值(例如,256)时,在步骤408中,计算器56可确定缩小指纹,如下文详细地所展示。In some embodiments, the length of the text fingerprint determined by calculator 56 is constrained to be within a predetermined range (e.g., between 129 and 256 characters, inclusive), regardless of the length or token count of the corresponding target text block. To calculate such a fingerprint, in step 406, fingerprint calculator 56 may first determine the count of text tokens of the target text block and compare the count to a predetermined upper threshold value, which is determined based on an upper limit on the fingerprint length. When the token count exceeds the upper threshold value (e.g., 256), calculator 56 may determine a reduced fingerprint in step 408, as shown in detail below.
当标记计数降到低于上限阈值时,在步骤410中,指纹计算器可计算每一文本标记的散列。图8展示示范性散列62a到c,其分别是针对文本标记60a到c而确定。散列62a到c是以十六进制计数法而展示。在一些实施例中,此类散列为将散列函数应用于每一标记60a到c的结果。许多此类散列函数及算法在所属领域中为已知的。散列算法快速,但通常产生大量冲突(相异标记具有相同散列的情况)。更复杂的散列(例如,通过比如MD5的消息摘要算法而计算的散列)据称为无冲突,但其计算费用巨大。本发明的一些实施例使用在计算速度与冲突避免之间提供权衡的散列算法来计算散列62a到c。此类算法的实例归因于罗伯特.赛奇威克(Robert Sedgewick),且在所属领域中称为RSHash。下文展示RSHash的伪代码:When the tag count drops below the upper threshold, in step 410, the fingerprint calculator may calculate a hash for each text tag. Figure 8 shows exemplary hashes 62a to c, which are determined for text tags 60a to c, respectively. Hashes 62a to c are shown in hexadecimal notation. In some embodiments, such hashes are the result of applying a hash function to each tag 60a to c. Many such hash functions and algorithms are known in the art. Hash algorithms are fast, but typically produce a large number of collisions (cases where different tags have the same hash). More complex hashes (e.g., hashes calculated by message digest algorithms such as MD5) are said to be collision-free, but their computational cost is huge. Some embodiments of the present invention use a hash algorithm that provides a trade-off between computational speed and collision avoidance to calculate hashes 62a to c. An example of such an algorithm is attributed to Robert Sedgewick and is known in the art as RSHash. Pseudocode for RSHash is shown below:
其中a及b表示整数,例如,a=63,689,且b=378,551。Wherein a and b represent integers, for example, a=63,689, and b=378,551.
散列62a到c的大小(位的数目)可影响冲突的可能性,且因此影响垃圾邮件检测率。一般来说,使用小散列会增加冲突的可能性。较大散列通常不易产生冲突,但在计算速度及存储器方面更加昂贵。指纹计算器56的一些实施例计算项目62a到c作为30位散列。The size (number of bits) of hashes 62a-c can affect the likelihood of collisions and, therefore, the spam detection rate. Generally speaking, using small hashes increases the likelihood of collisions. Larger hashes are generally less prone to collisions but are more expensive in terms of computational speed and memory. Some embodiments of fingerprint calculator 56 calculate items 62a-c as 30-bit hashes.
指纹计算器56可现在确定目标文本块的实际文本指纹。图8进一步说明针对目标文本块38所确定的示范性指纹42。文本指纹42包括根据在步骤410中确定的散列62a到c所确定的字符序列。在一些实施例中,针对每一标记60a到c,指纹计算器56确定指纹片段,其在图8中被说明作为项目64a到c。在一些实施例中,随后级联此类片段以产生指纹42。Fingerprint calculator 56 can now determine the actual textual fingerprint of the target text block. FIG8 further illustrates an exemplary fingerprint 42 determined for target text block 38. Textual fingerprint 42 includes a character sequence determined based on hashes 62a-c determined in step 410. In some embodiments, for each token 60a-c, fingerprint calculator 56 determines a fingerprint segment, which is illustrated in FIG8 as items 64a-c. In some embodiments, such segments are then concatenated to generate fingerprint 42.
每一指纹片段64a到c可包括根据相应标记60a到c的散列62a到c所确定的字符序列。在一些实施例中,所有指纹片段64a到c具有相同的长度:在图8的实例中,每一片段64a到c由两个字符组成。确定指纹片段的所述长度,使得相应指纹具有在所要范围(例如,在129到256个字符)内的长度。在一些实施例中,指纹片段的长度称为放大因数k。例如,长度1的片段为无缩放片段(放大因数1),从而产生无缩放指纹;长度2的片段为2倍放大片段(放大因数2),从而产生2倍放大片段,等等。图9展示针对处于各种放大因数k的文本块38所确定多个文本指纹42a到c。Each fingerprint segment 64a-c may include a sequence of characters determined from a hash 62a-c of the corresponding tag 60a-c. In some embodiments, all fingerprint segments 64a-c have the same length: in the example of FIG. 8 , each segment 64a-c consists of two characters. The length of the fingerprint segments is determined so that the corresponding fingerprint has a length within a desired range (e.g., between 129 and 256 characters). In some embodiments, the length of the fingerprint segment is referred to as the magnification factor k. For example, a segment of length 1 is an unscaled segment (magnification factor 1), thereby producing an unscaled fingerprint; a segment of length 2 is a 2x magnified segment (magnification factor 2), thereby producing a 2x magnified segment, and so on. FIG. 9 shows a plurality of text fingerprints 42a-c determined for a text block 38 at various magnification factors k.
在步骤412(图7)中,指纹计算器56确定放大因数k的值,其产生在所要的预定范围内的指纹长度。例如,当标记计数大于下限阈值(其是根据所要指纹长度的下限而确定)时,指纹计算器可决定计算无缩放指纹(k=1),这是因为无缩放指纹已在所要长度范围内。例如,当文本块38具有过少的标记时,指纹计算器可计算2倍或3倍放大指纹。In step 412 ( FIG. 7 ), the fingerprint calculator 56 determines a value for the magnification factor k that results in a fingerprint length within a desired predetermined range. For example, when the marker count is greater than a lower threshold (which is determined based on the lower limit of the desired fingerprint length), the fingerprint calculator may decide to calculate an unscaled fingerprint (k=1) because the unscaled fingerprint is already within the desired length range. For example, when the text block 38 has too few markers, the fingerprint calculator may calculate a 2x or 3x magnified fingerprint.
然后,在步骤414中,根据每一标记的相应散列,指纹计算器56计算针对所述标记的指纹片段。为了确定片段64a到c,指纹计算器56可使用所属领域中所知的任何编码方案(例如,散列62a到c的二进制或Base64表示)。此类编码方案建立数目与来自预定字母表的字符序列之间的一对一映射。例如,当使用Base64表示时,散列的六个连续位中的每一群组可被映射成字符。Then, in step 414, fingerprint calculator 56 calculates fingerprint segments for each token based on the corresponding hash of the token. To determine segments 64a-c, fingerprint calculator 56 may use any encoding scheme known in the art (e.g., binary or Base64 representations of hashes 62a-c). Such encoding schemes establish a one-to-one mapping between numbers and character sequences from a predetermined alphabet. For example, when using Base64 representation, each group of six consecutive bits of the hash can be mapped to a character.
在一些实施例中,通过改变用于表示相应散列的字符的数目,可针对每一散列而确定多个指纹片段。为了产生长度1(例如,放大因数1)的片段,一些实施例仅使用相应散列的六个最低有效位。可使用相应散列的额外六个位来产生长度2(例如,放大因数2)的片段,等等。在Base64表示中,30位散列可因此得到高达5个字符长的指纹片段,其对应于五个放大因数。表1展示在各种放大因数下计算的示范性指纹片段(来自图9中的示范性文本块38)。In some embodiments, multiple fingerprint fragments can be determined for each hash by varying the number of characters used to represent the corresponding hash. To generate a fragment of length 1 (e.g., a magnification factor of 1), some embodiments use only the six least significant bits of the corresponding hash. An additional six bits of the corresponding hash can be used to generate a fragment of length 2 (e.g., a magnification factor of 2), and so on. In Base64 representation, a 30-bit hash can thus result in a fingerprint fragment up to 5 characters long, corresponding to five magnification factors. Table 1 shows exemplary fingerprint fragments (from exemplary text block 38 in FIG. 9 ) calculated at various magnification factors.
表1Table 1
在步骤416(图7)中,指纹计算器56组合文本指纹42,例如,通过级联步骤414中计算的片段。In step 416 ( FIG. 7 ), fingerprint calculator 56 combines text fingerprints 42 , for example, by concatenating the segments calculated in step 414 .
回到步骤406,当标记计数被发现为大于上限阈值时,指纹计算器56确定相应文本块的缩小指纹。在一些实施例中,缩小包括仅从文本块38的标记的子集计算指纹42。选择子集可包括根据散列选择准则而修剪步骤404中确定的多个文本标记。图10中说明执行此类计算的示范性步骤序列。步骤422选择用于指纹计算的缩小因数。在一些实施例中,表示为k的缩小因数平均起来指示文本块38的标记的仅1/k用于指纹计算。指纹计算器56可因此根据步骤406(图7)中确定的标记计数而选择缩小因数。在一些实施例中,缩小因数的最初选择可能无法产生在所要长度范围内的指纹(参见下文);在此类情况下,可在一循环中以试误方式执行步骤422到430,直到产生适当长度的指纹。例如,指纹计算器56可最初选择缩小因数k=2;当此值未能产生足够短的指纹时,计算器56可选择k=3,等等。Returning to step 406, when the token count is found to be greater than the upper threshold, fingerprint calculator 56 determines a reduced fingerprint for the corresponding text block. In some embodiments, reduction includes calculating fingerprint 42 from only a subset of the tokens of text block 38. Selecting the subset may include pruning the plurality of text tokens determined in step 404 according to a hash selection criterion. FIG10 illustrates an exemplary sequence of steps for performing such calculations. Step 422 selects a reduction factor for fingerprint calculation. In some embodiments, the reduction factor, denoted as k, indicates that, on average, only 1/k of the tokens of text block 38 are used for fingerprint calculation. Fingerprint calculator 56 may therefore select the reduction factor based on the token count determined in step 406 (FIG. 7). In some embodiments, the initial selection of the reduction factor may fail to produce a fingerprint within the desired length range (see below); in such cases, steps 422 through 430 may be performed in a trial-and-error loop until a fingerprint of the appropriate length is produced. For example, fingerprint calculator 56 may initially select a reduction factor k=2; when this value fails to produce a sufficiently short fingerprint, calculator 56 may select k=3, and so on.
然后,指纹计算器可根据散列选择准则而选择标记。当缩小时,指纹计算器56可使用已在步骤404(图7)中确定的标记,或可从文本块38计算新标记。在图10所说明的实例中,在步骤424中,指纹计算器56确定文本块38的聚合标记集合。在一些实施例中,通过级联连续个别标记而确定聚合标记(在图9中被说明为项目60d)。用于形成聚合标记的标记的计数可根据缩小因数而变化。The fingerprint calculator may then select a tag based on the hash selection criteria. When zooming out, the fingerprint calculator 56 may use the tags already determined in step 404 ( FIG. 7 ), or may calculate new tags from the text block 38 . In the example illustrated in FIG. 10 , in step 424 , the fingerprint calculator 56 determines an aggregate tag set for the text block 38 . In some embodiments, the aggregate tag is determined by concatenating consecutive individual tags (illustrated as item 60 d in FIG. 9 ). The count of tags used to form the aggregate tag may vary depending on the zoom factor.
在步骤426中,针对每一聚合标记而计算散列(例如,使用上文所描述的方法)。在步骤428中,计算器56根据散列选择准则而选择聚合标记子集。在一些实施例中,针对缩小因数k,所述选择准则要求针对所选择的子集的成员所确定的全部散列等于模数k。例如,为了确定2倍缩小指纹,计算器56可仅考虑聚合标记,其散列等于模数2(即,仅奇散列,或仅偶散列)。在一些实施例中,散列选择准则包括仅选择其散列能被缩小因数k整除的标记。In step 426, a hash is calculated for each aggregate tag (e.g., using the method described above). In step 428, calculator 56 selects a subset of aggregate tags based on a hash selection criterion. In some embodiments, for a reduction factor k, the selection criterion requires that all hashes determined for the members of the selected subset be equal modulo k. For example, to determine a 2x reduction fingerprint, calculator 56 may only consider aggregate tags whose hashes are equal modulo 2 (i.e., only odd hashes, or only even hashes). In some embodiments, the hash selection criterion includes selecting only tags whose hashes are divisible by the reduction factor k.
在步骤430中,指纹计算器56可检查步骤428中选择的标记的计数是否在所要指纹长度范围内。如果所述计数不在所要指纹长度范围内,那么计算器56可返回到步骤422且以另一缩小因数k重新开始。当所选择的标记的计数在范围内时,在步骤432中,计算器56根据所选择的标记的每一散列而确定指纹片段。在步骤434中,组合此类片段以产生指纹42。图9说明针对文本块38所确定的一些缩小指纹42d到h。表2展示针对图9中的同一文本块38所确定的示范性指纹片段(在各种缩小因数下)。In step 430, fingerprint calculator 56 may check whether the count of the token selected in step 428 is within the desired fingerprint length range. If the count is not within the desired fingerprint length range, calculator 56 may return to step 422 and restart with another reduction factor k. When the count of the selected token is within the range, calculator 56 determines fingerprint segments from each hash of the selected token in step 432. In step 434, such segments are combined to generate fingerprint 42. FIG9 illustrates some reduced fingerprints 42d through h determined for text block 38. Table 2 shows exemplary fingerprint segments determined for the same text block 38 in FIG9 (at various reduction factors).
表2Table 2
图11展示根据本发明的一些实施例的在安全服务器(也参见图1)上执行的示范性组件。安全服务器14包括文档分类器72,其连接到通信管理器74及指纹数据库70。通信管理器74管理与客户端系统16a到c进行的垃圾邮件/诈骗检测事务,如上文关于图4-A到B所展示。在一些实施例中,文档分类器72经配置以经由通信管理器74而接收目标指示符40,且确定指示目标文档36的分类的目标标签50。FIG11 shows exemplary components executed on a secure server (see also FIG1 ) according to some embodiments of the present invention. The secure server 14 includes a document classifier 72 connected to a communication manager 74 and a fingerprint database 70. The communication manager 74 manages spam/scam detection transactions with the client systems 16 a-c, as shown above with respect to FIG4-A-B. In some embodiments, the document classifier 72 is configured to receive a target indicator 40 via the communication manager 74 and determine a target tag 50 indicating a classification of the target document 36.
在一些实施例中,分类目标文档36包括根据针对文档36所确定的文本指纹与参考指纹集合之间的比较而将文档36指派到文档类别,每一参考指纹指示文档类别。例如,分类文档36可包含确定文档36是否为垃圾邮件及/或诈骗性,及确定文档36属于垃圾邮件/诈骗的子类别(例如,产品提供、网络钓鱼或尼日利亚诈骗)。为了分类文档36,文档分类器72可结合指纹比较而使用所属领域中所知的任何方法。此类方法包含黑及白名单、图案匹配算法等等。例如,文档分类器72可计算多个个别得分,其中每一得分指示到特定文档类别(例如,垃圾邮件)的文档36的成员,每一得分是通过相异分类方法(例如,指纹比较、黑名单等等)而确定。分类器72可随后根据被确定为个别得分的复合得分而确定文档36的分类。In some embodiments, classifying a target document 36 includes assigning the document 36 to a document category based on a comparison between a text fingerprint determined for the document 36 and a set of reference fingerprints, each reference fingerprint indicating a document category. For example, classifying the document 36 may include determining whether the document 36 is spam and/or fraudulent, and determining that the document 36 belongs to a subcategory of spam/fraud (e.g., product offers, phishing, or Nigerian scams). To classify the document 36, the document classifier 72 may utilize any method known in the art in conjunction with fingerprint comparison. Such methods include blacklists and whitelists, pattern matching algorithms, and the like. For example, the document classifier 72 may calculate a plurality of individual scores, each score indicating membership of the document 36 in a particular document category (e.g., spam), each score determined by a different classification method (e.g., fingerprint comparison, blacklisting, and the like). The classifier 72 may then determine the classification of the document 36 based on a composite score determined as the individual scores.
文档分类器72可进一步包括指纹比较器78(如图12中所展示),其经配置以通过比较目标文档的指纹与存储在数据库70中的参考指纹集合而分类目标文档36。在一些实施例中,指纹数据库70包括针对参考文档集合所确定的文本指纹的存储库(例如,电子邮件消息、网页及网站评论等等)。数据库70可包括垃圾邮件/诈骗的指纹,但也包括合法文档的指纹。针对每一参考指纹,数据库70可存储相应指纹与文档类别(例如,垃圾邮件)之间的关联的指示符。The document classifier 72 may further include a fingerprint comparator 78 (as shown in FIG. 12 ) configured to classify the target document 36 by comparing the target document's fingerprint with a set of reference fingerprints stored in the database 70. In some embodiments, the fingerprint database 70 includes a repository of text fingerprints determined for a set of reference documents (e.g., email messages, web pages, website comments, etc.). The database 70 may include fingerprints for spam/fraud, but also fingerprints for legitimate documents. For each reference fingerprint, the database 70 may store an indicator of an association between the corresponding fingerprint and a document category (e.g., spam).
在一些实施例中,数据库70中的参考指纹子集中的所有指纹具有在预定范围(例如,在129与256个字符之间)内的长度。此外,所述范围与由指纹计算器56(图6)针对目标文档所确定的目标指纹的长度范围相一致。此类配置(其中所有参考指纹具有大致相同的大小,且其中参考指纹所具有的长度大致等于目标指纹的长度)可促进用于文档分类目的的目标指纹与参考指纹之间的比较。In some embodiments, all fingerprints in the reference fingerprint subset in the database 70 have lengths within a predetermined range (e.g., between 129 and 256 characters). Furthermore, the range is consistent with the length range of the target fingerprint determined by the fingerprint calculator 56 ( FIG. 6 ) for the target document. Such a configuration (where all reference fingerprints have approximately the same size and where the reference fingerprints have a length that is approximately equal to the length of the target fingerprint) can facilitate comparisons between target and reference fingerprints for document classification purposes.
针对每一参考指纹,数据库70的一些实施例可存储文本块的长度的指示符,针对所述文本块的长度而确定相应指纹。此类指示符的实例包含相应文本块的串长度、确定相应指纹时使用的片段长度,及放大/缩小因数等等。存储具有每一指纹的文本块长度的指示符可促进文档比较,这是通过使指纹比较器78能够选择性地检索表示在长度上与产生目标指纹42的文本块类似的文本块的参考指纹而实现。For each reference fingerprint, some embodiments of the database 70 may store an indicator of the length of the text block for which the corresponding fingerprint was determined. Examples of such indicators include the string length of the corresponding text block, the segment length used in determining the corresponding fingerprint, and the magnification/reduction factor, etc. Storing an indicator of the length of the text block with each fingerprint may facilitate document comparison by enabling the fingerprint comparator 78 to selectively retrieve reference fingerprints representing text blocks that are similar in length to the text block from which the target fingerprint 42 was generated.
为了分类目标文档36,分类器72可接收目标指示符40,从指示符40提取目标指纹42且将指纹42转送到指纹比较器78。比较器78可与数据库70进行接口连接,以选择性地检索用于与目标指纹42比较的参考指纹82。在一些实施例中,指纹比较器78可优选地检索针对具有与目标文本块的长度类似的长度的文本块所计算的参考指纹。To classify a target document 36, the classifier 72 may receive the target indicator 40, extract a target fingerprint 42 from the indicator 40, and forward the fingerprint 42 to a fingerprint comparator 78. The comparator 78 may interface with the database 70 to selectively retrieve a reference fingerprint 82 for comparison with the target fingerprint 42. In some embodiments, the fingerprint comparator 78 may preferably retrieve a reference fingerprint calculated for a block of text having a length similar to that of the target block of text.
文档分类器72根据目标指纹42与从数据库70检索的参考指纹的比较而进一步确定目标文档42的分类。一些实施例中,所述比较包含计算指示指纹42与82的类似度的类似性得分。例如,此类似性得分可被确定为:Document classifier 72 further determines the classification of target document 42 based on a comparison of target fingerprint 42 with reference fingerprints retrieved from database 70. In some embodiments, the comparison includes calculating a similarity score indicating the degree of similarity between fingerprints 42 and 82. For example, such a similarity score may be determined as:
其中fT及fR分别表示目标指纹及参考指纹,d(fT,fR)表示两个指纹之间的编辑距离(例如,莱文斯坦(Levenshtein)距离),且其中|fT|及|fR|分别表示目标指纹及参考指纹的长度。得分S可取0与1之间的任何值,接近1的值指示两个指纹之间的高类似度。在示范性实施例中,当得分S超过预定阈值T(例如,0.9)时,目标指纹42据称匹配于参考指纹82。当目标指纹42匹配于来自数据库70的至少一个参考指纹时,文档分类器72可根据相应参考指纹的文档类别指示符而分类目标文档,且可制定目标标签50以反映所述分类。例如,当目标指纹42匹配于针对垃圾邮件消息所确定的参考指纹时,目标文档36可被分类为垃圾邮件,且目标标签50可指示垃圾邮件分类。Where f T and f R represent the target fingerprint and reference fingerprint, respectively, d(f T , f R ) represents the edit distance (e.g., Levenshtein distance) between the two fingerprints, and where |f T | and |f R | represent the lengths of the target fingerprint and reference fingerprint, respectively. Score S can take any value between 0 and 1, with values close to 1 indicating a high degree of similarity between the two fingerprints. In an exemplary embodiment, when score S exceeds a predetermined threshold T (e.g., 0.9), target fingerprint 42 is said to match reference fingerprint 82. When target fingerprint 42 matches at least one reference fingerprint from database 70, document classifier 72 may classify the target document according to the document category indicator of the corresponding reference fingerprint and may formulate target label 50 to reflect the classification. For example, when target fingerprint 42 matches a reference fingerprint determined for a spam message, target document 36 may be classified as spam, and target label 50 may indicate the spam classification.
上文所描述的示范性系统及方法允许电子消息传递系统(例如,电子邮件及用户贡献网站)中的未经请求的通信(垃圾邮件)的检测,以及诈骗性电子文档(例如,网络钓鱼网站)的检测。在一些实施例中,针对每一目标文档而计算文本指纹,所述指纹包括根据相应文档的多个文本标记而确定的字符序列。所述指纹随后与针对文档集合所确定的参考指纹(包含垃圾邮件/诈骗性及合法文档)比较。当目标指纹与针对垃圾邮件/诈骗性消息所确定的参考指纹相匹配时,目标通信可被加标签为垃圾邮件/诈骗。The exemplary systems and methods described above allow for the detection of unsolicited communications (spam) in electronic messaging systems (e.g., email and user-contributed websites), as well as the detection of fraudulent electronic documents (e.g., phishing websites). In some embodiments, a text fingerprint is calculated for each target document, the fingerprint comprising a character sequence determined based on a plurality of text tokens for the corresponding document. The fingerprint is then compared to a reference fingerprint determined for a collection of documents (including spam/fraudulent and legitimate documents). When the target fingerprint matches a reference fingerprint determined for a spam/fraudulent message, the target communication can be labeled as spam/fraudulent.
当将目标通信肯定地识别为垃圾邮件/诈骗时,反垃圾邮件/反诈骗系统的组件可修改相应文档的显示。例如,一些实施例可阻止相应文档的显示(例如,不允许在网站上显示垃圾邮件评论),可在单独的位置(例如,垃圾电子邮件文件夹、单独的浏览器窗口)中显示相应文档,及/或可显示警报。When a target communication is positively identified as spam/scam, components of the anti-spam/anti-scam system may modify the display of the corresponding document. For example, some embodiments may prevent the display of the corresponding document (e.g., not allowing spam comments to be displayed on the website), may display the corresponding document in a separate location (e.g., a junk email folder, a separate browser window), and/or may display an alert.
在一些实施例中,文本标记可包含目标文本的个别字或字序列,以及电子邮件地址及/或网络地址(例如,包含于目标文档的文本部分中的统一资源定位符(URL))。本发明的一些实施例识别在目标文档内的多个此类文本标记。针对每一标记而计算散列,且根据相应散列而确定指纹片段。在一些实施例中,指纹片段随后通过(例如)级联而组合以产生相应文档的文本指纹。In some embodiments, the text tokens may include individual words or sequences of words of the target text, as well as email addresses and/or network addresses (e.g., uniform resource locators (URLs) contained in the text portion of the target document). Some embodiments of the present invention recognize multiple such text tokens within the target document. A hash is calculated for each token, and a fingerprint segment is determined based on the corresponding hash. In some embodiments, the fingerprint segments are then combined, for example, by concatenation, to generate a text fingerprint for the corresponding document.
一些电子文档(例如,电子邮件消息)可在长度上有很大变化。在一些常规反垃圾邮件/反诈骗系统中,针对此类文档所确定的指纹的长度相应地变化。相比之下,在本发明的一些实施例中,文本指纹的长度约束在预定长度范围(例如,在129与256个字符之间)内,而不管目标文本块或文档的长度如何。使所有文本指纹在预定长度界限内可实质上改善消息间比较的效率。Some electronic documents (e.g., email messages) can vary greatly in length. In some conventional anti-spam/anti-fraud systems, the length of fingerprints determined for such documents varies accordingly. In contrast, in some embodiments of the present invention, the length of text fingerprints is constrained to be within a predetermined length range (e.g., between 129 and 256 characters), regardless of the length of the target text block or document. Keeping all text fingerprints within the predetermined length limit can substantially improve the efficiency of inter-message comparisons.
为了确定预定长度范围内的指纹,本发明的一些实施例使用放大及缩小方法。当文本块相对短时,通过调整指纹片段的长度而获得放大以产生所要长度的指纹。在示范性实施例中,30位散列的每6个位可转换成一字符(使用(例如)Base64表示),因此,相应散列可产生长度在1与5个字符之间的指纹片段。To determine fingerprints within a predetermined length range, some embodiments of the present invention use a magnification and reduction method. When the text block is relatively short, magnification is achieved by adjusting the length of the fingerprint segment to produce a fingerprint of the desired length. In an exemplary embodiment, every 6 bits of a 30-bit hash can be converted into a character (using, for example, Base64 representation), so that the corresponding hash can produce fingerprint segments between 1 and 5 characters in length.
针对相对长的文本块,本发明的一些实施例通过从标记子集计算指纹而实现缩小,所述子集是根据散列选择准则而选择。示范性散列选择准则包括仅选择其散列能被整数k(例如,2、3或6)整除的标记。针对给定实例,此类选择引起分别从可用标记的约1/2、1/3或1/6计算指纹。在一些实施例中,缩小可进一步包括将此类标记选择应用于多个聚合标记,其中每一聚合标记包括若干标记的级联(例如,相应电子文档的字序列)。For relatively long blocks of text, some embodiments of the present invention achieve reduction by computing fingerprints from a subset of tokens, the subset being selected according to a hash selection criterion. An exemplary hash selection criterion includes selecting only tokens whose hashes are divisible by an integer k (e.g., 2, 3, or 6). For a given example, such selection results in computing fingerprints from approximately 1/2, 1/3, or 1/6 of the available tokens, respectively. In some embodiments, reduction may further include applying such token selection to multiple aggregate tokens, where each aggregate token comprises a concatenation of several tokens (e.g., a sequence of words from a corresponding electronic document).
各种散列函数可用于指纹片段的确定。在计算机实验中,将所属领域中所知的各种散列函数应用于从呈各种语言的电子邮件消息提取的122,000个字的集合,其目的是确定散列冲突(相异字产生相同散列)的数目,所述散列冲突为每一散列函数产生实际垃圾邮件。表3中说明的结果展示所属领域中称为RSHash的散列函数产生所有所测试的散列函数的最少冲突。Various hash functions can be used to determine fingerprint fragments. In a computer experiment, various hash functions known in the art were applied to a set of 122,000 words extracted from email messages in various languages. The goal was to determine the number of hash collisions (different words producing the same hash) that would produce actual spam for each hash function. The results, illustrated in Table 3, show that the hash function known in the art, known as RSHash, produced the fewest collisions of all the hash functions tested.
表3Table 3
在另一计算机实验中,使用本发明的一些实施例来分析电子邮件消息集合(由企业服务器在一周期间接收的电子邮件的总量组成,且包括垃圾邮件及合法消息两者)。为了确定长度在129与256个字符之间的文本指纹,20.8%的消息要求无缩放,18.5%的消息要求2倍缩小,8.1%的消息要求3倍缩小,且8.7%的消息要求6倍缩小。在相同消息集合之中,14.8%的消息要求2倍放大,9.7%的消息要求4倍放大,且11.7%的消息要求8倍放大。以上结果表明在129到256个字符之间的指纹长度对于检测电子邮件垃圾邮件可为最佳的,这是因为根据放大及/或缩小因数而将实际电子邮件流分成群组的上述分割产生相对均匀填入的群组;此类情况针对指纹比较是有利的,这是因为可在大致相同的时间搜索所有群组。In another computer experiment, some embodiments of the present invention were used to analyze a collection of email messages (consisting of the total amount of email received by an enterprise server during a week, and including both spam and legitimate messages). To determine text fingerprints between 129 and 256 characters in length, 20.8% of the messages required no scaling, 18.5% required 2x scaling, 8.1% required 3x scaling, and 8.7% required 6x scaling. Among the same message collection, 14.8% required 2x magnification, 9.7% required 4x magnification, and 11.7% required 8x magnification. The above results suggest that fingerprint lengths between 129 and 256 characters may be optimal for detecting email spam, because the aforementioned partitioning of the actual email stream into groups based on magnification and/or reduction factors results in relatively evenly populated groups; such a situation is advantageous for fingerprint comparison, because all groups can be searched at approximately the same time.
在另一计算机实验中,由遍及15小时而收集的大约865,000个消息组成的连续垃圾邮件流被分成消息集合,每一集合由在相异的10分钟间隔期间接收的消息组成。使用根据本发明的一些实施例而构造的文档分类器来分析每一消息集合(例如,参见图11到12)。针对每一消息集合,指纹数据库70由针对属于较早时间间隔的垃圾邮件消息所确定的指纹组成。图13中展示使用方程式[1]及阈值T=0.75而获得的垃圾邮件检测率(实线),其与使用常规垃圾邮件检测方法(在所属领域中称为模糊散列)对相同消息集合而获得的垃圾邮件检测率(虚线)相比较。In another computer experiment, a continuous spam stream consisting of approximately 865,000 messages collected over 15 hours was divided into message sets, each set consisting of messages received during a distinct 10-minute interval. Each message set was analyzed using a document classifier constructed according to some embodiments of the present invention (e.g., see Figures 11-12). For each message set, a fingerprint database 70 consisted of fingerprints determined for spam messages belonging to the earlier time interval. The spam detection rate obtained using equation [1] and a threshold value T = 0.75 (solid line) is shown in Figure 13, compared to the spam detection rate obtained for the same message set using a conventional spam detection method (known in the art as fuzzy hashing) (dashed line).
所属领域的技术人员将清楚,在不脱离本发明的范围的情况下,可以多种方式更改以上实施例。因此,本发明的范围应由所附权利要求书及其合法等效物确定。It will be apparent to those skilled in the art that the above embodiments can be modified in many ways without departing from the scope of the present invention. Therefore, the scope of the present invention should be determined by the appended claims and their legal equivalents.
Claims (22)
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US13/790,636 | 2013-03-08 | ||
| US13/790,636 US8935783B2 (en) | 2013-03-08 | 2013-03-08 | Document classification using multiscale text fingerprints |
| PCT/RO2014/000007 WO2014137233A1 (en) | 2013-03-08 | 2014-02-04 | Document classification using multiscale text fingerprints |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1213705A1 HK1213705A1 (en) | 2016-07-08 |
| HK1213705B true HK1213705B (en) | 2019-10-18 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CA2898086C (en) | Document classification using multiscale text fingerprints | |
| CN104040557B (en) | Online swindle detection dynamic grading aggregation system and method | |
| US11095586B2 (en) | Detection of spam messages | |
| JP5990284B2 (en) | Spam detection system and method using character histogram | |
| JP5941163B2 (en) | Spam detection system and method using frequency spectrum of character string | |
| Saka et al. | Context-based clustering to mitigate phishing attacks | |
| US8910281B1 (en) | Identifying malware sources using phishing kit templates | |
| EP3837625B1 (en) | Fuzzy inclusion based impersonation detection | |
| US12470596B2 (en) | Model for detecting phishing URLS | |
| HK1213705B (en) | Document classification using multiscale text fingerprints | |
| Naru et al. | Detection of fake websites using machine learning techniques | |
| HK1197941A (en) | Online fraud detection dynamic scoring aggregation systems and methods | |
| HK1197941B (en) | Online fraud detection dynamic scoring aggregation systems and methods |