CN116484126A - List extraction and visualization in web pages - Google Patents
List extraction and visualization in web pages Download PDFInfo
- Publication number
- CN116484126A CN116484126A CN202210040984.0A CN202210040984A CN116484126A CN 116484126 A CN116484126 A CN 116484126A CN 202210040984 A CN202210040984 A CN 202210040984A CN 116484126 A CN116484126 A CN 116484126A
- Authority
- CN
- China
- Prior art keywords
- item
- tree
- list
- node
- anchor element
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
背景技术Background technique
在网络中存在大量的网页(web page),这些网页包含各种各样的信息。在一些场景下,网络用户可能需要在网络上找到感兴趣的网页,以便获得所期望的信息。搜索引擎提供商可以提供搜索服务,以辅助用户找到感兴趣的网页。例如,响应于来自用户的搜索查询,搜索服务可以向用户返回搜索结果页面,该搜索结果页面包括与搜索查询相关的网页的信息,例如,网页链接、摘要(snippet)等。There are a large number of web pages (web pages) in the network, and these web pages contain various information. In some scenarios, network users may need to find interesting webpages on the network in order to obtain desired information. Search engine providers can provide search services to assist users in finding web pages of interest. For example, in response to a search query from a user, the search service may return a search result page to the user, the search result page including information on web pages related to the search query, such as web page links, snippets, and the like.
发明内容Contents of the invention
提供本发明内容以便介绍一组概念,这组概念将在以下的具体实施方式中做进一步描述。本发明内容并非旨在标识所保护主题的关键特征或必要特征,也不旨在用于限制所保护主题的范围。This Summary is provided to introduce a set of concepts that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
本公开的实施例提出了用于网页中的列表提取和可视化的方法、装置和计算机程序产品。可以检测目标网页中的至少一个锚点元素组,所述至少一个锚点元素组包括第一锚点元素组。可以对所述第一锚点元素组中的多个锚点元素执行边界检测,以获得与所述多个锚点元素分别关联的多个项目的边界,所述多个项目对应于所述目标网页中的第一原始列表。可以利用所述多个项目的边界,从所述目标网页中获得与所述多个项目分别对应的多组代表性元数据。可以将所述多组代表性元数据可视化为结构化列表。Embodiments of the present disclosure propose methods, apparatuses and computer program products for list extraction and visualization in web pages. At least one anchor element group in the target webpage may be detected, the at least one anchor element group including the first anchor element group. Boundary detection may be performed on multiple anchor elements in the first anchor element group to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to the first original list in the target web page. Multiple sets of representative metadata respectively corresponding to the multiple items may be obtained from the target webpage by using boundaries of the multiple items. The sets of representative metadata can be visualized as a structured list.
应当注意,以上一个或多个方面包括以下详细描述以及权利要求中具体指出的特征。下面的说明书及附图详细提出了所述一个或多个方面的某些说明性特征。这些特征仅仅指示可以实施各个方面的原理的多种方式,并且本公开旨在包括所有这些方面和其等同变换。It should be noted that one or more of the above aspects include the features specified in the following detailed description as well as in the claims. Certain illustrative features of the one or more aspects are set forth in detail in the following description and accompanying drawings. These features are merely indicative of the various ways in which the principles of various aspects can be implemented and this disclosure is intended to include all such aspects and their equivalents.
附图说明Description of drawings
以下将结合附图描述所公开的多个方面,这些附图被提供用以说明而非限制所公开的多个方面。The disclosed aspects will be described below with reference to the accompanying drawings, which are provided to illustrate but not limit the disclosed aspects.
图1示出了示例性的列表网页。Figure 1 shows an exemplary listing web page.
图2示出了示例性的列表网页。Figure 2 shows an exemplary listing web page.
图3示出了现有的示例性搜索结果页面。Fig. 3 shows an existing exemplary search result page.
图4示出了根据实施例的网页中的列表提取和可视化的示例性过程。Fig. 4 shows an exemplary process of list extraction and visualization in a web page according to an embodiment.
图5示出了根据实施例的锚点元素组检测的示例性过程。Fig. 5 shows an exemplary process of anchor point element group detection according to an embodiment.
图6示出了根据实施例的示例性锚点元素组。Fig. 6 shows an exemplary set of anchor elements according to an embodiment.
图7示出了根据实施例的边界检测的示例性过程。Fig. 7 shows an exemplary process of boundary detection according to an embodiment.
图8A至图8F示出了根据实施例的迭代边界扩展示例。8A-8F illustrate examples of iterative bounds extension according to an embodiment.
图9A至图9F示出了根据实施例的迭代边界扩展示例。9A-9F illustrate examples of iterative bounds extension according to an embodiment.
图10示出了根据实施例的示例性边界检测结果。Fig. 10 shows exemplary boundary detection results according to an embodiment.
图11示出了根据实施例的主列表确定的示例性过程。FIG. 11 shows an exemplary process of master list determination according to an embodiment.
图12示出了根据实施例的代表性元数据获得的示例性过程。Fig. 12 shows an exemplary process of representative metadata acquisition according to an embodiment.
图13示出了根据实施例的示例性搜索结果页面。Figure 13 illustrates an exemplary search results page, according to an embodiment.
图14示出了根据实施例的用于网页中的列表提取和可视化的示例性方法的流程图。Fig. 14 shows a flow chart of an exemplary method for list extraction and visualization in a webpage, according to an embodiment.
图15示出了根据实施例的用于网页中的列表提取和可视化的示例性装置。Fig. 15 shows an exemplary apparatus for list extraction and visualization in a web page according to an embodiment.
图16示出了根据实施例的用于网页中的列表提取和可视化的示例性装置。Fig. 16 shows an exemplary apparatus for list extraction and visualization in a web page according to an embodiment.
具体实施方式Detailed ways
现在将参考多种示例性实施方式来讨论本公开。应当理解,这些实施方式的讨论仅仅用于使得本领域技术人员能够更好地理解并从而实施本公开的实施例,而并非教导对本公开的范围的任何限制。The present disclosure will now be discussed with reference to various exemplary embodiments. It should be understood that the discussion of these embodiments is only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than teaching any limitation to the scope of the present disclosure.
现有的搜索服务通常会从原始网页中提取特定的文本以形成文本形式的摘要,即文本摘要,并在搜索结果页面中显示该文本摘要,以便用户在看到该文本摘要时可以大致地了解原始网页所涉及的内容。在网络上存在大量的列表网页,其中,列表网页可以指网页中的主要内容是列表并且该列表包括多个项目(item)。对于这样的列表网页,现有的搜索服务仍然只是从列表网页中提取特定的文本以形成文本摘要,并且该文本摘要中的文本可能仅仅是从列表中的特定项目中提取的,例如是从列表中的第一个项目中提取的。因此,当用户在搜索结果页面中看到该文本摘要时,仅能够获知关于该列表网页的有限的局部的信息。Existing search services usually extract specific text from the original web page to form a text summary, that is, a text summary, and display the text summary on the search result page, so that users can roughly understand the content involved in the original web page when seeing the text summary. There are a large number of list webpages on the Internet, wherein the list webpage may mean that the main content of the webpage is a list and the list includes multiple items (items). For such a list web page, existing search services still only extract specific text from the list web page to form a text summary, and the text in the text summary may only be extracted from a specific item in the list, for example, from the first item in the list. Therefore, when the user sees the text abstract in the search result page, he can only learn limited partial information about the listed webpage.
本公开的实施例提出了在网页中执行列表提取和可视化,以便从目标网页中提取列表内容并且将列表内容组织成结构化形式。在本文中,目标网页可以是列表网页。本公开的实施例可以从目标网页中提取列表内容,并且对所提取的列表内容中的至少一部分进行可视化以形成列表形式的摘要,即列表摘要。与文本摘要相比,列表摘要可以包含关于目标网页中的原始列表的更丰富的内容,使得用户可以从列表摘要中了解关于原始列表的更全面的信息。由于列表摘要本身是一种结构化列表,本公开的实施例能够以更友好且更直观的方式来向用户呈现关于原始列表的信息。本公开的实施例可以在列表摘要中呈现原始列表中的项目的关键或代表性信息,从而可以全面地且简洁地提供用户可能期望的信息。当在搜索结果页面中呈现列表摘要时,用户无需点击网页链接即可方便地且全面地获知对应的目标网页的内容。Embodiments of the present disclosure propose to perform list extraction and visualization in a web page, so as to extract list content from a target web page and organize the list content into a structured form. In this context, the landing page can be a listing page. Embodiments of the present disclosure may extract list content from the target webpage, and visualize at least a part of the extracted list content to form a summary in the form of a list, that is, a list summary. Compared with text summaries, list summaries can contain richer content about the original list in the target web page, so that users can learn more comprehensive information about the original list from the list summaries. Since the list summary itself is a structured list, the embodiments of the present disclosure can present information about the original list to the user in a more friendly and intuitive manner. Embodiments of the present disclosure can present key or representative information of items in the original list in the list summary, so that information that the user may expect can be comprehensively and concisely provided. When the list abstract is presented on the search result page, the user can conveniently and comprehensively learn the content of the corresponding target webpage without clicking on the webpage link.
本公开的实施例所处理的目标网页中的原始列表并不局限于那些具有html列表标签的列表,而是可以涵盖任何可视觉感知的列表。本公开的实施例所涉及的“原始列表”是可视觉感知的列表。可视觉感知的列表可以指例如该列表包含具有在视觉上相似结构的多个项目。在本文中,“项目”可以指构成列表的组成部分,其也可以被称为对象、实体、数据记录等。可视觉感知的列表可能具有html列表标签,也可能不具有html列表标签,从而,可视觉感知的列表可以不受限制地具有任何html标签。本公开的实施例是至少针对包括可视觉感知的列表的目标网页所提出的,并且可以至少从视觉感知的角度来处理这些目标网页,而并非简单地利用html列表标签来处理这些目标网页。相应地,本公开的实施例可以被应用于任何包含视觉上的列表的目标网页。The original lists in the target web page processed by the embodiments of the present disclosure are not limited to those lists with html list tags, but may cover any visually perceivable lists. The "original list" referred to in the embodiments of the present disclosure is a visually perceivable list. A visually perceptible list may mean, for example, that the list contains multiple items with a visually similar structure. Herein, an "item" may refer to a component constituting a list, which may also be called an object, an entity, a data record, or the like. A visually perceivable list may or may not have an html list tag, whereby a visually perceptible list may have any html tag without restriction. Embodiments of the present disclosure are proposed at least for target webpages including visually perceivable lists, and can process these target webpages at least from the perspective of visual perception, rather than simply using html list tags to process these target webpages. Accordingly, embodiments of the present disclosure can be applied to any target web page that includes a visual list.
在一个方面,本公开的实施例可以至少通过检测目标网页中的锚点元素(anchorelement)组,来识别目标网页中可能包括的原始列表。锚点元素组中的锚点元素并不一定具有html列表标签。由于锚点元素可能具有原始列表中的项目的代表性信息,因此,对锚点元素组的检测将有助于发现目标网页中的原始列表。In one aspect, the embodiments of the present disclosure can at least identify the original list that may be included in the target web page by detecting a group of anchor elements (anchorelelement) in the target web page. An anchor element in an anchor element group does not necessarily have an html list tag. Since anchor elements may have representative information of items in the original list, detection of groups of anchor elements will help to discover the original list in the target web page.
在一个方面,本公开的实施例可以对锚点元素组中的多个锚点元素执行边界检测,以确定在目标网页中与该锚点元素组对应的原始列表中的多个项目的边界。在本文中,确定项目的边界可以指确定该项目中包括哪些具体元素,相应地,这些元素一起形成该项目。边界检测可以包括迭代边界扩展。对于每个锚点元素,通过迭代边界扩展可以找到可能与该锚点元素位于同一个项目内的元素,从而,该锚点元素和所找到的元素将定义出该项目的边界。边界检测还可以包括相似性检验。相似性检验可以被执行用于判断通过从不同的锚点元素进行扩展而确定的多个项目是否确实是同一个原始列表中的项目,例如,这些项目是否确实形成了一个原始列表。至少通过根据本公开实施例的边界检测,可以准确地识别出目标网页中的原始列表以及该原始列表中的各个项目。In one aspect, embodiments of the present disclosure may perform boundary detection on multiple anchor elements in an anchor element group to determine the boundaries of multiple items in the original list corresponding to the anchor element group in the target web page. In this context, determining the boundaries of a project may refer to determining which specific elements are included in the project, and accordingly these elements together form the project. Boundary detection may include iterative boundary extension. For each anchor element, elements that may be within the same item as the anchor element are found by iterative bounds expansion, so that the anchor element and the found elements define the bounds of the item. Boundary detection can also include similarity testing. A similarity check may be performed to determine whether items determined by extension from different anchor elements are indeed items in the same original list, eg whether the items do form an original list. At least through the boundary detection according to the embodiments of the present disclosure, the original list in the target web page and each item in the original list can be accurately identified.
在一个方面,如果目标网页包括两个或更多个原始列表,则本公开的实施例可以从这些原始列表中确定出主(dominant)列表。在本文中,主列表可以指例如在网页中占据主要位置、呈现主要内容等的列表。优选地,通过确定主列表并且仅对主列表执行后续的处理,本公开的实施例可以在最终所生成的结构化列表中仅包括关于主列表的信息,从而避免由关于并非是主列表的列表的信息所造成的干扰。In one aspect, if the target web page includes two or more original lists, embodiments of the present disclosure may determine the dominant list from these original lists. Herein, the main list may refer to, for example, a list occupying a main position in a web page, presenting main content, and the like. Preferably, by determining the master list and performing subsequent processing only on the master list, embodiments of the present disclosure can include only information about the master list in the finally generated structured list, thereby avoiding interference caused by information about lists that are not master lists.
在一个方面,本公开的实施例可以从目标网页中获得原始列表中的不同项目的多组代表性元数据。例如,可以至少利用原始列表中的多个项目的边界,从目标网页中获得这些项目的多组代表性元数据。在一些实现方式中,所述多组代表性元数据可以是通过排序而从目标网页中的初始元数据中所选择的重要的、代表性的元数据。In one aspect, embodiments of the present disclosure may obtain sets of representative metadata for different items in the original list from the target web page. For example, sets of representative metadata for items in the original list can be obtained from the target web page using at least the boundaries of the items. In some implementations, the plurality of sets of representative metadata may be important, representative metadata selected from initial metadata in the target webpage by sorting.
在一个方面,本公开的实施例可以对所获得的多组代表性元数据进行可视化,以形成结构化列表。所述结构化列表可以作为例如目标网页的列表摘要。In one aspect, embodiments of the present disclosure can visualize the obtained sets of representative metadata to form a structured list. The structured list can serve as, for example, a list summary of the target web page.
本公开的实施例可以被应用于各种应用场景中。例如,在搜索服务中,本公开的实施例可以针对目标网页生成结构化列表,以便例如为目标网页建立列表摘要。相应地,搜索服务可以在搜索结果页面中呈现根据本公开实施例所生成的作为列表摘要的结构化列表。应当理解,本公开的实施例并不局限于被应用在搜索服务中,而是还可以被应用于需要对目标网页进行列表提取和可视化的任何应用场景中。Embodiments of the present disclosure may be applied in various application scenarios. For example, in a search service, embodiments of the present disclosure may generate a structured list for target web pages, for example, to create list summaries for the target web pages. Correspondingly, the search service may present the structured list generated according to the embodiments of the present disclosure as list summaries in the search result page. It should be understood that the embodiments of the present disclosure are not limited to be applied in the search service, but can also be applied in any application scenario where target web pages need to be extracted and visualized.
本公开的实施例所处理的目标网页可以是来自各种网站、在线服务等的各种列表网页。图1示出了示例性的列表网页。列表网页12是在网络上的一篇示例性文章,该文章可能位于例如学术网站、在线问答社区等中。该文章介绍了在中国的十大节日,例如,“春节”、“中秋节”、“端午节”等。该文章中的涉及所介绍的节日的部分形成了可视觉感知的列表122。例如,涉及“春节”的部分、涉及“中秋节”的部分、涉及“端午节”的部分等分别形成了列表122中的多个项目。The target webpages processed by embodiments of the present disclosure may be various listing webpages from various websites, online services, and the like. Figure 1 shows an exemplary listing web page. Listing page 12 is an exemplary article on the web, which may be located, for example, on an academic website, an online question-and-answer community, and the like. This article introduces the top ten festivals in China, such as "Spring Festival", "Mid-Autumn Festival", "Dragon Boat Festival" and so on. The sections of the article dealing with the festivals presented form a visually perceptible list 122 . For example, the parts related to "Spring Festival", the parts related to "Mid-Autumn Festival", the parts related to "Dragon Boat Festival" and so on respectively form a plurality of items in the list 122 .
列表网页14是来自于例如图书销售网站、阅读交流网站等的网页。假设已经在网页14的左侧的“选项”栏中选择了多个选项,则在网页14的右侧呈现了符合所选择的选项的所推荐的四本书的介绍信息。以第一本书为例,该书的介绍信息可以包括例如封面照片144、文字介绍146等。所述四本书的介绍信息形成了可视觉感知的列表142。例如,每一本书的介绍信息形成了列表142中的一个项目。The list web page 14 is a web page from, for example, a book sales website, a reading exchange website, and the like. Assuming that multiple options have been selected in the "Options" column on the left side of the webpage 14, the introduction information of the four recommended books matching the selected options is presented on the right side of the webpage 14. Taking the first book as an example, the introduction information of the book may include, for example, a cover photo 144, a text introduction 146, and the like. The introduction information of the four books forms a visually perceptible list 142 . For example, the introductory information for each book forms an item in list 142 .
列表网页16是来自某个点评论坛的针对示例性话题“X饭店”的网页,其包括多个用户关于“X饭店”的讨论线程。例如,网页16包括分别用于用户Tom、David、Jane等的多个显示区域。以用户Tom为例,用于Tom的显示区域包括例如Tom的头像、Tom的名字、Tom的评论的发布时间、Tom的评论的具体内容等。用于Tom的显示区域、用于David的显示区域、用于Jane的显示区域等形成了可视觉感知的列表162,并且这些显示区域分别形成了列表162中的多个项目。Listing web page 16 is a web page for the exemplary topic "Restaurant X" from a certain review forum, which includes discussion threads about "Restaurant X" by multiple users. For example, web page 16 includes multiple display areas for users Tom, David, Jane, etc., respectively. Taking the user Tom as an example, the display area for Tom includes, for example, Tom's avatar, Tom's name, release time of Tom's comment, specific content of Tom's comment, and the like. The display area for Tom, the display area for David, the display area for Jane, etc. form a visually perceptible list 162, and these display areas form a plurality of items in the list 162, respectively.
图2示出了示例性的列表网页20。网页20可能来自例如在线购物网站等。在线购物网站通常会产生或提供大量的包括列表的网页,例如销售最佳网页、最受欢迎产品网页、产品分类网页、包含用户搜索的产品的网页等。网页20可以是用于呈现例如符合某些条件的手机的网页。假设已经在网页20的左侧的“选项”栏中选择了多个选项,则在网页20的右侧呈现了匹配所选择的选项的多款手机的介绍信息。例如,在区域22中呈现了手机“M手机A4”的介绍信息,包括例如该手机的图片222、该手机的简介“M手机A4,6.5寸,256G,黑色”、该手机的5颗星的评级、关于该手机的评论数量“25900个评论”、该手机的价格“5500RMB”等。类似地,在区域24中呈现了手机“M手机A3”的至少包括该手机的图片242的介绍信息,在区域26中呈现了手机“M手机A2”的至少包括该手机的图片262的介绍信息,等等。这些手机的介绍信息形成了可视觉感知的列表202,其中,每一款手机的介绍信息形成了列表202中的一个项目。此外,网页20还在区域28中呈现了关于相关产品的推荐,例如,第一相关产品的至少包括该产品的图片282的介绍信息、第二相关产品的至少包括该产品的图片284的介绍信息、第三相关产品的至少包括该产品的图片286的介绍信息等。这些相关产品的介绍信息形成了可视觉感知的列表204,其中,每一个相关产品的介绍信息形成了列表204中的一个项目。FIG. 2 shows an exemplary listing web page 20 . Web page 20 may be from, for example, an online shopping site or the like. Online shopping websites usually generate or provide a large number of webpages including lists, such as best-selling webpages, most popular product webpages, product category webpages, webpages containing products searched by users, and the like. Web page 20 may be a web page presenting, for example, cell phones that meet certain criteria. Assuming that multiple options have been selected in the "options" column on the left side of the web page 20, introduction information of multiple mobile phones matching the selected options is presented on the right side of the web page 20. For example, the introduction information of the mobile phone "M mobile phone A4" is presented in the area 22, including, for example, the picture 222 of the mobile phone, the introduction of the mobile phone "M mobile phone A4, 6.5 inches, 256G, black", the 5-star rating of the mobile phone, the number of comments about the mobile phone "25900 comments", the price of the mobile phone "5500RMB" and so on. Similarly, the introduction information of the mobile phone "M mobile phone A3" including at least the picture 242 of the mobile phone is presented in the area 24, the introduction information of the mobile phone "M mobile phone A2" at least including the picture 262 of the mobile phone is presented in the area 26, and so on. The introduction information of these mobile phones forms a visually perceivable list 202 , wherein the introduction information of each mobile phone forms an item in the list 202 . In addition, the webpage 20 also presents recommendations about related products in the area 28, for example, the introduction information of the first related product including at least the picture 282 of the product, the introduction information of the second related product including at least the picture 284 of the product, the introduction information of the third related product at least including the picture 286 of the product, etc. The introduction information of these related products forms a visually perceivable list 204 , wherein each introduction information of related products forms an item in the list 204 .
应当理解,本公开的实施例并不局限于图1和图2中所示出的示例性列表网页,而是可以涵盖来自各种其它网站、在线服务等的各种其它类型的列表网页,例如,来自关于各种领域的话题的论坛网站的列表网页、来自产品评测网站的列表网页、来自新闻网站的列表页面、来自酒店或机票预订网站的列表网页、等等。It should be understood that embodiments of the present disclosure are not limited to the exemplary listing pages shown in FIGS. 1 and 2 , but may encompass various other types of listing pages from various other websites, online services, etc., for example, listing pages from forum websites on topics in various fields, listing pages from product review websites, listing pages from news websites, listing pages from hotel or airline reservation websites, etc.
图3示出了现有的示例性搜索结果页面300。搜索结果页面300可以是在由某个通用搜索引擎提供商所提供的搜索服务中呈现给用户的。假设用户已经在搜索框310中输入了查询“M手机”,以表明其想要获得关于M手机的网页搜索结果。搜索结果页面300中的搜索结果区域320包括多个网页搜索结果。例如,在区域330中示出了针对图2中的网页20的搜索结果。如在区域330中所示,针对网页20的搜索结果包括文本摘要“M手机A4,6.5寸,256G,黑色,5颗星,25900个评论,5500RMB”。该文本摘要仅仅是利用网页20中的关于“M手机A4”的介绍信息来生成的。基于区域330内的该文本摘要,用户仅能够了解到关于网页20的有限的信息,例如,仅能够了解到关于手机“M手机A4”的信息,却并不能获知关于在网页20中的列表202中的其它手机的任何信息。此外,这样的文本摘要也缺乏直观性和易读性。FIG. 3 shows an existing exemplary search results page 300 . The search result page 300 may be presented to the user in a search service provided by a general search engine provider. Assume that the user has entered the query "M mobile phone" in the search box 310 to indicate that he wants to obtain web search results about M mobile phones. Search results area 320 in search results page 300 includes a plurality of web search results. For example, search results for web page 20 in FIG. 2 are shown in area 330 . As shown in area 330 , the search results for the webpage 20 include the text abstract "M mobile phone A4, 6.5 inches, 256G, black, 5 stars, 25900 reviews, 5500RMB". The text abstract is only generated by using the introduction information about "M mobile phone A4" in the webpage 20. Based on the text abstract in the area 330, the user can only learn limited information about the webpage 20, for example, only can learn about the information about the mobile phone "M mobile phone A4", but cannot learn any information about other mobile phones in the list 202 in the webpage 20. Moreover, such text summarization also lacks intuition and legibility.
图4示出了根据实施例的网页中的列表提取和可视化的示例性过程400。过程400可以被执行用于针对目标网页402中的原始列表实现列表提取和可视化,以生成结构化列表404。目标网页402可以是包含列表的列表网页,并且在本文中可以将目标网页402中的列表称为原始列表。如果目标网页402包括两个或更多个列表,则过程400可以针对目标网页402中的主列表来生成结构化列表404。FIG. 4 shows an exemplary process 400 of list extraction and visualization in a web page, according to an embodiment. Process 400 may be performed to implement list extraction and visualization for raw listings in target web page 402 to generate structured listing 404 . The target web page 402 may be a list web page containing lists, and the list in the target web page 402 may be referred to herein as the original list. If target web page 402 includes two or more listings, process 400 may generate structured list 404 for the main list in target web page 402 .
在410处,可以检测目标网页404中的至少一个锚点元素组。每个锚点元素组可以包括一个或多个锚点元素,并且每个锚点元素组可以对应于一个可能的原始列表。例如,如果目标网页404包括两个或更多个原始列表,则在410处可能检测出与这些原始列表分别对应的两个或更多个锚点元素组。在一种实现方式中,可以首先识别目标网页404中的多个锚点元素,然后将所述多个锚点元素聚类成至少一个锚点元素组。At 410, at least one set of anchor elements in the target web page 404 can be detected. Each anchor element group may include one or more anchor elements, and each anchor element group may correspond to a possible origin list. For example, if the target web page 404 includes two or more original listings, then at 410 two or more groups of anchor elements corresponding respectively to the original listings may be detected. In an implementation manner, multiple anchor elements in the target webpage 404 may be identified first, and then the multiple anchor elements are clustered into at least one anchor element group.
在420处,对于每个锚点元素组,可以对该锚点元素组中的多个锚点元素执行边界检测,以获得与所述多个锚点元素分别关联的多个项目的边界。这些项目可以形成目标网页404中的与该锚点元素组对应的原始列表。边界检测可以包括例如迭代边界扩展、相似性检验等,以便准确地识别出目标网页404中的至少一个原始列表以及每个原始列表中的各个项目。At 420, for each anchor element group, boundary detection may be performed on multiple anchor elements in the anchor element group to obtain boundaries of multiple items respectively associated with the multiple anchor elements. These items may form the original list in the target web page 404 corresponding to the set of anchor elements. Boundary detection may include, for example, iterative boundary expansion, similarity checking, etc., to accurately identify at least one original list in the target web page 404 and individual items in each original list.
在430处,可选地,如果通过先前的步骤确定出目标网页404包括两个或更多个原始列表,则可以从这些原始列表中确定出主列表。在一种实现方式中,可以至少利用这些原始列表的视觉特征来确定主列表。At 430, optionally, if it is determined through the previous steps that the target web page 404 includes two or more original lists, a master list may be determined from these original lists. In one implementation, at least visual features of these original lists can be utilized to determine the master list.
在440处,对于目标网页404中的原始列表,可以至少利用该原始列表中的多个项目的边界,从目标网页404中获得与所述多个项目分别对应的多组代表性元数据。可选地,在440处的代表性元数据获得可以是针对目标网页404中的主列表来执行的。在本文中,代表性元数据可以指在原始列表中包含的且将在结构化列表404中呈现的数据,例如,图像、文本等。在一种实现方式中,对于每个项目,可以首先从目标网页404中获得一组初始元数据,然后从这组初始元数据中选择出将在结构化列表404中呈现的对应于该项目的一组代表性元数据。At 440, for the original list in the target webpage 404, multiple sets of representative metadata respectively corresponding to the multiple items may be obtained from the target webpage 404 by using at least boundaries of the multiple items in the original list. Optionally, the obtaining of representative metadata at 440 may be performed for the master list in the target web page 404 . Herein, representative metadata may refer to data contained in the original list and to be presented in the structured list 404, eg, images, text, and the like. In one implementation, for each item, a set of initial metadata may first be obtained from the target web page 404, and then a set of representative metadata corresponding to the item to be presented in the structured list 404 is selected from the initial set of metadata.
在450处,可以将在440处所获得的多组代表性元数据可视化为结构化列表404。在一种实现方式中,可以根据预定的格式或布局,利用所述多组代表性元数据来形成结构化列表404。结构化列表404是目标网页402中的原始列表的简化版本,但其仍然包含使得用户能够直观地全面地了解原始列表的主要内容的足够信息。结构化列表404可以作为例如原始列表的列表摘要。At 450 , the sets of representative metadata obtained at 440 can be visualized as structured list 404 . In one implementation manner, the structured list 404 may be formed using the plurality of sets of representative metadata according to a predetermined format or layout. The structured list 404 is a simplified version of the original list in the target web page 402, but it still contains enough information for the user to intuitively and comprehensively understand the main content of the original list. Structured list 404 may be, for example, a list summary of the original list.
应当理解,过程400中的所有步骤及其顺序都是示例性的,本公开的实施例还将涵盖对过程400进行的任意方式的修改。例如,尽管过程400包括在430处的主列表确定的步骤,但是在目标网页402仅包括一个原始列表的情况下,也可以省略该步骤。例如,尽管在图4中示出步骤430是在步骤440之前执行的,但是也可以在步骤440之后执行步骤430。在这种情况下,可以首先通过步骤440来获得每个原始列表的多组代表性元数据,然后,在通过步骤430而确定出主列表之后,可以仅将主列表的多组代表性元数据提供给步骤450。It should be understood that all steps and their order in process 400 are exemplary, and embodiments of the present disclosure will also encompass modifications to process 400 in any manner. For example, although process 400 includes the step of master list determination at 430, this step may also be omitted where target web page 402 includes only one original list. For example, although step 430 is shown as being performed before step 440 in FIG. 4 , step 430 may also be performed after step 440 . In this case, multiple sets of representative metadata of each original list can be obtained first through step 440 , and then, after the main list is determined through step 430 , only multiple sets of representative metadata of the main list can be provided to step 450 .
图5示出了根据实施例的锚点元素组检测的示例性过程500。过程500是图4中的步骤410的示例性实现方式。FIG. 5 shows an exemplary process 500 of anchor element group detection according to an embodiment. Process 500 is an exemplary implementation of step 410 in FIG. 4 .
在510处,可以从目标网页502中识别多个锚点元素。例如,可以从目标网页502的html源文件中识别这些锚点元素。原始列表中的每个项目可以包括多个html元素,锚点元素可以是这些html元素中最有代表性、对于识别整个项目最有帮助的html元素。在510处所识别的锚点元素也可以被称为识别锚点元素。在一种实现方式中,可以预先定义锚点元素约束,并且可以将目标网页502中满足锚点元素约束的多个html元素识别为多个识别锚点元素。例如,锚点元素约束可以包括以下至少之一:html元素具有图像标签;html元素具有标题标签;html元素表示日期;等等。在一种情况下,原始列表中的每个项目可能都具有对应的图像,因此,html源文件中具有图像标签,例如<img>标签等,的html元素可以作为锚点元素以帮助识别对应的项目。在一种情况下,原始列表中的每个项目可能都具有对应的标题,因此,html源文件中具有标题标签,例如<h1>标签、<h2>标签等,的html元素可以作为锚点元素以帮助识别对应的项目。在一种情况下,原始列表中的每个项目可能都具有日期,例如发布日期(post date)等,因此,html源文件中具有表示日期的字符串的html元素可以作为锚点元素以帮助识别对应的项目。在这种情况下,可以通过例如正则匹配等各种技术来识别html源文件中的表示日期的字符串。应当理解,本公开的实施例并不局限于以上的示例性锚点元素约束,而是可以涵盖任何其它类型的锚点元素约束。At 510 , a plurality of anchor elements can be identified from the target web page 502 . For example, these anchor elements can be identified from the html source file of the target web page 502 . Each item in the original list may include multiple html elements, and the anchor element may be the most representative html element among these html elements and the most helpful for identifying the entire item. The anchor element identified at 510 may also be referred to as an identified anchor element. In an implementation manner, anchor element constraints may be defined in advance, and multiple html elements in the target web page 502 that satisfy the anchor element constraints may be identified as multiple identified anchor elements. For example, the anchor element constraint may include at least one of: the html element has an image tag; the html element has a title tag; the html element represents a date; and so on. In one case, each item in the original list may have a corresponding image, therefore, html elements with image tags, such as <img> tags, etc., in the html source file can be used as anchor elements to help identify the corresponding item. In one case, each item in the original list may have a corresponding heading, so html elements with heading tags, such as <h1> tags, <h2> tags, etc., in the html source file can be used as anchor elements to help identify the corresponding item. In one case, each item in the original list may have a date, such as a post date, etc. Therefore, an html element in the html source file with a string representing a date can be used as an anchor element to help identify the corresponding item. In this case, various techniques such as regular matching can be used to identify the character string representing the date in the html source file. It should be understood that embodiments of the present disclosure are not limited to the above exemplary anchor element constraints, but may cover any other types of anchor element constraints.
在520处,可以从目标网页502中提取在510处所识别的多个识别锚点元素中每个识别锚点元素的属性(property)集合。一个识别锚点元素的属性集合可以包括该识别锚点元素的一个或多个内在的属性,例如,该识别锚点元素的html标签属性(tag attribute)、层叠样式表(CSS:Cascading Style Sheets)类别(class)、XML路径语言(XPath)信息等。html标签属性可以指示该识别锚点元素的html标签类型。CSS类别可以指示该识别锚点元素具有哪些CSS类别。XPath信息可以指示该识别锚点元素的定位信息、节点信息等,其可以是例如从与html源文件对应的文档对象模型(DOM:Document Object Model)树中获得的。应当理解,本公开的实施例并不局限于以上的识别锚点元素的示例性属性,而是可以涵盖任何其它类型的属性。通过步骤520,可以获得与多个识别锚点元素分别对应的多个属性集合。At 520 , a set of properties for each of the plurality of identified anchor elements identified at 510 can be extracted from the target web page 502 . An attribute set identifying an anchor element may include one or more intrinsic attributes of the identifying anchor element, for example, the html tag attribute (tag attribute), Cascading Style Sheets (CSS: Cascading Style Sheets) category (class), XML path language (XPath) information, etc. of the identifying anchor element. The html tag attribute may indicate the type of html tag that identifies the anchor element. CSS classes may indicate which CSS classes the identifying anchor element has. The XPath information may indicate the location information and node information of the identified anchor element, which may be obtained, for example, from a Document Object Model (DOM: Document Object Model) tree corresponding to the html source file. It should be understood that the embodiments of the present disclosure are not limited to the above exemplary attributes for identifying anchor elements, but may cover any other types of attributes. Through step 520, multiple attribute sets respectively corresponding to multiple identification anchor elements can be obtained.
在530处,可以基于多个识别锚点元素的多个属性集合,将这些识别锚点元素聚类成至少一个锚点元素组504。每个识别锚点元素可以由对应的属性集合所表征,并且可以将多个识别锚点元素的多个属性集合作为输入而提供给预先训练的聚类模型。该聚类模型被训练为基于属性集合来将多个锚点元素聚类成至少一个锚点元素组。例如,那些具有相似属性的识别锚点元素将被聚类到同一个锚点元素组中。每个锚点元素组包括具有相似属性的多个锚点元素,并且可以对应于一个可能的原始列表,其中,这些锚点元素可以分别关联于该可能的原始列表中的不同项目。At 530, the plurality of identifying anchor elements can be clustered into at least one anchor element group 504 based on the plurality of attribute sets of the identifying anchor elements. Each recognition anchor element may be characterized by a corresponding attribute set, and multiple attribute sets of multiple recognition anchor elements may be provided as input to the pre-trained clustering model. The clustering model is trained to cluster the plurality of anchor elements into at least one group of anchor elements based on the set of attributes. For example, those identified anchor elements with similar attributes will be clustered into the same anchor element group. Each anchor element group includes a plurality of anchor elements with similar attributes, and may correspond to a possible original list, wherein the anchor elements may be respectively associated with different items in the possible original list.
应当理解,过程500中的所有步骤都是示例性的,本公开的实施例还将涵盖对过程500进行的任意方式的修改。例如,过程500可以采用各种锚点元素约束的任意组合、包含各种属性的任意组合的属性集合等。It should be understood that all steps in process 500 are exemplary, and embodiments of the present disclosure will also encompass modifications to process 500 in any manner. For example, process 500 may employ any combination of various anchor element constraints, attribute sets containing any combination of various attributes, and the like.
图6示出了根据实施例的示例性锚点元素组。在图6中,假设针对图2中的网页20而检测到了第一锚点元素组和第二锚点元素组。第一锚点元素组可以包括多个锚点元素,这些锚点元素对应于图2中的原始列表202中的图像222、图像242和图像262,其中,图像222、图像242和图像262可以由于具有相似的属性而被聚类到第一锚点元素组中。第二锚点元素组可以包括多个锚点元素,这些锚点元素对应于图2中的原始列表204中的图像282、图像284、图像286等,其中,图像282、图像284、图像286可以由于具有相似的属性而被聚类到第二锚点元素组中。Fig. 6 shows an exemplary set of anchor elements according to an embodiment. In FIG. 6 , it is assumed that a first set of anchor elements and a second set of anchor elements are detected for the web page 20 in FIG. 2 . The first anchor element group may include a plurality of anchor elements corresponding to the image 222, the image 242, and the image 262 in the original list 202 in FIG. The second anchor point element group may include a plurality of anchor point elements, and these anchor point elements correspond to image 282, image 284, image 286, etc. in the original list 204 in FIG.
应当理解,尽管图6示出了根据“html元素具有图像标签”的锚点元素约束而检测到的锚点元素组,但是本公开的实施例也可以根据其他类型的锚点元素约束而检测锚点元素组。例如,对于图1中的网页12,可以根据“html元素具有标题标签”的锚点元素约束而检测到由标题“春节”、标题“中秋节”、标题“端午节”等所形成的锚点元素组。例如,对于图1中的网页16,可以根据“html元素表示日期”的锚点元素约束而检测到由Tom的显示区域中的日期“2021-10-05”、David的显示区域中的日期“2021-10-05”、Jane的显示区域中的日期“2021-10-06”等所形成的锚点元素组。It should be understood that although FIG. 6 shows anchor element groups detected according to the anchor element constraint of "html element has an image tag", embodiments of the present disclosure may also detect anchor element groups according to other types of anchor element constraints. For example, for the webpage 12 in FIG. 1 , the anchor element group formed by the title "Spring Festival", the title "Mid-Autumn Festival", the title "Dragon Boat Festival" and the like can be detected according to the anchor element constraint of "html element has a title tag". For example, for the webpage 16 in FIG. 1 , the anchor element group formed by the date "2021-10-05" in the display area of Tom, the date "2021-10-05" in the display area of David, the date "2021-10-06" in the display area of Jane, etc. can be detected according to the anchor element constraint of "the html element represents the date".
图7示出了根据实施例的边界检测的示例性过程700。过程700是图4中的步骤420的示例性实现方式。过程700可以被用于对示例性的锚点元素组702执行边界检测,以获得与锚点元素组702中的多个锚点元素分别关联的多个项目的边界,从而,识别出目标网页中与锚点元素组702对应的原始列表704以及该原始列表704中的各个项目。过程700可以是至少基于与目标网页对应的DOM树来执行的。FIG. 7 shows an exemplary process 700 of boundary detection according to an embodiment. Process 700 is an exemplary implementation of step 420 in FIG. 4 . The process 700 may be used to perform boundary detection on the exemplary set of anchor elements 702, so as to obtain the boundaries of multiple items respectively associated with the multiple anchor elements in the set of anchor elements 702, thereby identifying the original list 704 corresponding to the set of anchor elements 702 in the target web page and each item in the original list 704. Process 700 may be performed based at least on a DOM tree corresponding to a target web page.
在710处,可以对锚点元素组702中的每个锚点元素执行迭代边界扩展,以找到可能与该锚点元素位于同一个项目内的元素。在一种实现方式中,可以基于与目标网页对应的DOM树,分别以锚点元素组702中的多个锚点元素为起点,同步地执行迭代边界扩展。每个锚点元素可以作为一个起点,并且通过迭代边界扩展,可以在DOM树中从该锚点元素开始依次确定并扩展到多个其它元素。该锚点元素连同所确定的其他元素一起形成了一棵树,这棵树代表一个项目,因此也可以被称为项目树。所述项目树中的多个节点可以分别对应于多个元素,例如该锚点元素以及经由迭代边界扩展而确定的元素。每一步骤的迭代可以扩展到下一节点,并且该下一节点可以被包括到所述项目树中。多个步骤的迭代形成了一条对应的扩展路径。通过710处的迭代边界扩展,可以获得分别源自于锚点元素组702中的多个锚点元素的多棵项目树。所述多棵项目树分别限定了多个项目的边界。At 710, iterative bounds expansion can be performed on each anchor element in anchor element group 702 to find elements that are likely to be within the same item as the anchor element. In an implementation manner, based on the DOM tree corresponding to the target web page, the iterative boundary expansion may be performed synchronously with multiple anchor elements in the anchor element group 702 as starting points. Each anchor point element can be used as a starting point, and through iterative boundary expansion, it can be sequentially determined from the anchor point element in the DOM tree and extended to multiple other elements. The anchor element forms a tree together with other determined elements, and this tree represents an item, so it can also be called an item tree. A plurality of nodes in the item tree may respectively correspond to a plurality of elements, such as the anchor element and elements determined through iterative boundary expansion. The iteration of each step can be expanded to the next node, and that next node can be included into the project tree. Iteration of multiple steps forms a corresponding expansion path. Through the iterative boundary expansion at 710 , multiple item trees respectively originating from multiple anchor elements in the anchor element group 702 can be obtained. The multiple project trees respectively define boundaries of multiple projects.
迭代边界扩展可以包括多种类型的扩展,例如,兄弟(sibling)节点扩展、父(parent)节点扩展等。兄弟节点扩展可以被执行用于在与目标网页对应的DOM树中从当前节点向当前节点的兄弟节点进行扩展。在一种情况下,如果当前节点具有属于同一个父节点的多个兄弟节点,则可以从当前节点开始由近到远地依次向所述多个兄弟节点进行扩展。在一种情况下,兄弟节点扩展可以采用预定的扩展方向,例如,向右扩展、向左扩展、交替地向右和向左扩展、在经过多次向右扩展或满足预定条件后变为向左扩展、在经过多次向左扩展或满足预定条件后变为向右扩展、等等。父节点扩展可以被执行用于在当前节点的所有兄弟节点都已被包括到同一棵项目树中之后,向所述当前节点的父节点进行扩展,并且将该父节点包括到所述项目树中。在扩展到父节点之后,可以继续对父节点执行兄弟节点扩展,例如,向父节点的兄弟节点进行扩展。以此类推,可以迭代地向上级节点进行扩展。此外,如果通过迭代边界扩展而将某个节点包括到了项目树中并且该节点具有自己的下级节点,例如子节点、孙节点等,则可以进而将该节点的所有下级节点也包括到项目树中。Iterative boundary expansion may include various types of expansion, for example, sibling node expansion, parent node expansion, and the like. Sibling node expansion may be performed to expand from the current node to sibling nodes of the current node in the DOM tree corresponding to the target web page. In one case, if the current node has multiple sibling nodes belonging to the same parent node, the multiple sibling nodes may be extended from the current node from near to far. In one case, sibling node expansion may adopt a predetermined expansion direction, for example, expand to the right, expand to the left, expand to the right and left alternately, change to expand to the left after multiple times of right expansion or satisfy a predetermined condition, change to right after multiple times of left expansion or satisfy a predetermined condition, and so on. Parent node expansion may be performed to expand to a parent node of the current node and include the parent node into the item tree after all sibling nodes of the current node have been included in the same item tree. After expanding to the parent node, you can continue to perform sibling expansion on the parent node, for example, expand to sibling nodes of the parent node. By analogy, it can iteratively expand to the upper node. In addition, if a certain node is included in the project tree through iterative boundary expansion and the node has its own subordinate nodes, such as child nodes, grandchildren nodes, etc., all the subordinate nodes of the node can be further included in the project tree.
迭代边界扩展可以是在与不同锚点元素对应的不同项目树之间同步地执行的。例如,在每一步骤的迭代中,同步地在这些项目树中执行一次兄弟节点扩展或父节点扩展。在一种情况下,例如,在某一步骤的兄弟节点扩展中,如果某棵项目树S当前没有兄弟节点可以扩展,而其它项目树具有兄弟节点可以扩展,则在对其它项目树执行当前步骤的兄弟节点扩展的同时,可以在当前步骤暂停一次对项目树S的扩展。Iterative boundary expansion may be performed synchronously between different project trees corresponding to different anchor elements. For example, in each iteration of the step, a sibling expansion or parent expansion is performed synchronously in the item trees. In one case, for example, in the expansion of sibling nodes in a certain step, if a project tree S currently has no sibling nodes that can be expanded, but other project trees have sibling nodes that can be expanded, then the expansion of the project tree S can be suspended once at the current step while the sibling node expansion of the current step is performed on other project trees.
根据本公开的实施例,可以尽可能大地扩展每个项目的边界,例如,通过迭代边界扩展使得每个项目包括尽可能多的元素。然而,不同项目之间不应该具有内容重叠,例如,同一元素或内容不应该被包括在不同项目中。此外,不同项目的结构应当是相似的,例如,不同项目应当具有至少预定比例的相似元素或节点等。According to an embodiment of the present disclosure, the boundary of each item may be expanded as much as possible, for example, each item includes as many elements as possible by iterative boundary expansion. However, there should be no content overlap between different items, eg, the same element or content should not be included in different items. Furthermore, the structures of different items should be similar, eg, different items should have at least a predetermined proportion of similar elements or nodes, etc.
两个项目之间的内容重叠可能是由于与这两个项目对应的两棵项目树具有节点重叠所造成的,例如,某个节点被这两棵项目树所共有。因此,可以通过在710处的迭代边界扩展期间检测节点重叠来避免不同项目之间的内容重叠。Content overlap between two projects may be caused by overlapping nodes of the two project trees corresponding to the two projects, for example, a certain node is shared by the two project trees. Therefore, content overlap between different items can be avoided by detecting node overlap during the iterative boundary expansion at 710 .
在720处,可以确定迭代边界扩展中的当前步骤的迭代是否导致在至少两棵项目树之间出现节点重叠。例如,是否当前步骤的迭代导致将相同的一个或多个节点同时包含到至少两棵项目树中。在720处的节点重叠确定可以是与710处的迭代边界扩展同步执行的,例如,在每一步骤的迭代之后都确定是否存在节点重叠。At 720, it can be determined whether the iteration of the current step in the iterative boundary expansion resulted in node overlap between at least two item trees. For example, whether iteration of the current step resulted in the inclusion of the same node or nodes into at least two project trees simultaneously. The node overlap determination at 720 may be performed synchronously with the iterative boundary extension at 710, eg, after each iteration of the step it is determined whether there is node overlap.
如果在720处确定当前步骤的迭代未导致出现节点重叠,则过程700可以返回到710并继续执行迭代边界扩展。If it is determined at 720 that the iteration of the current step has not resulted in a node overlap, then process 700 may return to 710 and continue to perform iterative boundary expansion.
如果在720处确定当前步骤的迭代导致出现节点重叠,则在730处停止执行迭代边界扩展,并且从每棵项目树中排除通过当前步骤的迭代而确定的节点。例如,使得每棵项目树都回退或重置到在当前步骤的迭代之前的上一步骤的迭代处的状态。If it is determined at 720 that the iteration of the current step resulted in overlapping nodes, then at 730 execution of the iterative bounds extension is stopped and the nodes determined by the iteration of the current step are excluded from each item tree. For example, causing each project tree to roll back or reset to the state at the iteration of the previous step before the iteration of the current step.
通过执行步骤720和步骤730,本公开的实施例可以避免所得到的多棵项目树具有节点重叠,从而避免不同的项目具有内容重叠。By executing step 720 and step 730, the embodiment of the present disclosure can prevent multiple project trees obtained from overlapping nodes, thereby preventing different projects from having content overlapping.
为了确定不同项目的结构是否相似,过程700可以对多棵项目树执行相似性检验。在一种情况下,相似性检验可以是响应于确定多棵项目树中的至少一棵项目树中的节点的数量超出节点数量阈值而执行的。在一方面,例如,所述至少一棵项目树可以是所述多棵项目树中的预定数量或预定比例的项目树,从而,相似性检验的执行可以要求:在所述多棵项目树中的预定数量或预定比例的项目树内的每棵项目树中的节点的数量超出了节点数量阈值。在另一方面,例如,相似性检验的执行可以要求:在710处的迭代边界扩展已经执行了预定次数步骤的迭代,即,每个项目已经包含了预定数量的元素或每棵项目树已经包含了预定数量的节点。在一种情况下,相似性检验可以是与710处的迭代边界扩展同步执行的,例如,在每一步骤的迭代之后都执行相似性检验。在一种情况下,可以每当执行了预定次数步骤的迭代之后,例如,每当在每个项目中新增了预定数量的元素之后或每当在每棵项目树中新增了预定数量的节点之后,就执行相似性检验。本公开的实施例并不局限于上述的执行相似性检验的示例性时机。To determine whether the structures of different projects are similar, process 700 may perform a similarity check on multiple project trees. In one instance, the similarity check may be performed in response to determining that the number of nodes in at least one of the plurality of item trees exceeds a node number threshold. In one aspect, for example, the at least one project tree may be a predetermined number or a predetermined proportion of project trees in the plurality of project trees, whereby the performance of the similarity check may require that the number of nodes in each project tree in the predetermined number or predetermined proportion of project trees in the plurality of project trees exceed a node number threshold. In another aspect, for example, performance of the similarity check may require that the iterative bounds expansion at 710 has been performed for a predetermined number of iterations of steps, i.e., each item has contained a predetermined number of elements or each item tree has contained a predetermined number of nodes. In one case, the similarity check may be performed synchronously with the iterative boundary extension at 710, eg, after each iteration of the step. In one case, the similarity check may be performed every time a predetermined number of iterations of steps are performed, for example, every time a predetermined number of new elements are added to each item or every time a predetermined number of new nodes are added to each item tree, the similarity check may be performed. Embodiments of the present disclosure are not limited to the above exemplary timings for performing the similarity check.
在740处,可以计算在多棵项目树中的任意两棵项目树之间的树相似性。本公开的实施例并不局限于用于计算树相似性的任何特定技术。优选地,本公开的实施例提出了通过对现有的简单树匹配算法进行改进而得到的树相似性计算方法,所提出的树相似性计算方法至少利用了基于CSS相似性的权重和/或最小深度层级来计算树相似性。At 740, tree similarity between any two item trees of the plurality of item trees can be calculated. Embodiments of the present disclosure are not limited to any particular technique for computing tree similarity. Preferably, the embodiments of the present disclosure propose a tree similarity calculation method obtained by improving an existing simple tree matching algorithm, and the proposed tree similarity calculation method at least utilizes CSS similarity-based weights and/or minimum depth levels to calculate tree similarity.
在一种实现方式中,本公开的实施例可以至少利用基于两棵项目树的根节点之间的CSS相似性所计算的匹配权重来计算树相似性。在网页中,由CSS所呈现的风格是指示页面布局的重要信息。因此,通过基于两棵项目树的根节点之间的CSS相似性来计算匹配权重并且将该匹配权重用于计算这两棵项目树之间的树相似性,可以有效地提高树相似性计算的准确性。所述匹配权重可以是基于例如两个根节点的各自的CSS类别来计算的。In an implementation manner, the embodiments of the present disclosure may at least use matching weights calculated based on CSS similarity between root nodes of two item trees to calculate tree similarity. In a web page, the style presented by CSS is important information indicating the page layout. Therefore, by calculating the matching weight based on the CSS similarity between the root nodes of two item trees and using the matching weight to calculate the tree similarity between the two item trees, the accuracy of tree similarity calculation can be effectively improved. The matching weights may be calculated based on, for example, the respective CSS classes of the two root nodes.
在一种实现方式中,本公开的实施例可以利用两棵项目树中的在最小深度层级以内的节点来计算树相似性。在本文中,最小深度层级可以被定义为使得:在一棵项目树的最小深度层级以内的可见(visible)节点的数量达到这棵项目树中所有可见节点的数量的预定比例,例如80%或任何其它比例。在另一个方面,最小深度层级也可以被定义为使得:在这棵项目树的小于最小深度层级的层级以内的可见节点的数量未能达到这棵项目树中所有可见节点的数量的预定比例。在本文中,可见节点可以指网页中的在视觉上可见的节点,例如呈现图像的节点、呈现文本的节点等,因此,与其他节点相比,可见节点对于确定项目树之间的结构相似性更为重要。一棵项目树可能具有多个层级,例如,假设该项目树的根节点位于深度为0的层级,则该根节点的子节点位于深度为1的层级,以此类推。深度越大的层级对于判断两棵树之间的结构的相似性的贡献越小。因此,本公开的实施例提出了仅利用项目树的部分层级,而并非是所有层级,来计算树相似性,从而可以有效地提高计算效率并节省计算资源。可以利用最小深度层级以及小于最小深度层级的那些层级来计算树相似性。例如,假设最小深度层级为3,则可以利用深度为0、1、2和3的层级来计算树相似性。由于最小深度层级是至少考虑可见节点的数量来确定的,例如,最小深度层级以内的可见节点的数量应当不低于项目树中所有可见节点的数量的预定比例,因此,适当设置的该预定比例将可以保证即便在树相似性的计算中并未考虑大于最小深度层级的那些层级,也仍然能够计算出准确的树相似性。所述预定比例可以具有根据实际应用需求而预先设置的任意值。应当理解,尽管以上讨论了利用两棵项目树中的在最小深度层级以内的节点来计算树相似性,但是本公开的实施例并不局限于此,而是可以替代地利用两棵项目树中的所有层级内的节点来计算树相似性。In one implementation, the embodiments of the present disclosure may utilize nodes within the minimum depth level in two item trees to calculate tree similarity. Herein, the minimum depth level can be defined such that: the number of visible (visible) nodes within the minimum depth level of an item tree reaches a predetermined ratio, such as 80% or any other ratio, of the number of all visible nodes in the item tree. In another aspect, the minimum depth level can also be defined such that the number of visible nodes within a level of the item tree less than the minimum depth level does not reach a predetermined proportion of the number of all visible nodes in the item tree. In this paper, visible nodes may refer to visually visible nodes in web pages, such as nodes presenting images, nodes presenting text, etc. Therefore, compared with other nodes, visible nodes are more important for determining the structural similarity between item trees. An item tree may have multiple levels. For example, assuming that the root node of the item tree is located at a level with a depth of 0, the child nodes of the root node are located at a level with a depth of 1, and so on. The deeper the hierarchy, the smaller the contribution to judging the similarity of the structure between the two trees. Therefore, the embodiments of the present disclosure propose to use only part of the levels of the item tree instead of all the levels to calculate the tree similarity, so that the calculation efficiency can be effectively improved and the calculation resources can be saved. Tree similarity can be computed using the minimum depth level and those levels that are less than the minimum depth level. For example, assuming that the minimum depth level is 3, the tree similarity can be calculated using levels of depth 0, 1, 2 and 3. Since the minimum depth level is determined by at least considering the number of visible nodes, for example, the number of visible nodes within the minimum depth level should not be lower than a predetermined ratio of the number of all visible nodes in the project tree, therefore, properly setting the predetermined ratio will ensure that even if those levels greater than the minimum depth level are not considered in the calculation of tree similarity, accurate tree similarity can still be calculated. The predetermined ratio may have any value preset according to actual application requirements. It should be understood that although the above discusses using nodes within the minimum depth level in the two item trees to calculate the tree similarity, embodiments of the present disclosure are not limited thereto, but may instead use nodes in all levels of the two item trees to calculate the tree similarity.
假设T和T’是两棵项目树。Root(T)表示树T的根节点,并且Root(T’)表示树T’的根节点。应当理解,如果T和T’不具有实际的根节点,则可以为T和T’分别设置虚拟的根节点,这两个虚拟的根节点可以具有相同的属性配置。对于T和T’中的每一个,L0,L1,…,Ln分别表示在层级深度0,1,…,n的子树集合。Li1,Li2,…,Lik分别表示在层级深度i中的k个子树,即,子树集合Li中的子树。Suppose T and T' are two item trees. Root(T) represents the root node of the tree T, and Root(T') represents the root node of the tree T'. It should be understood that if T and T' do not have actual root nodes, virtual root nodes may be set for T and T' respectively, and the two virtual root nodes may have the same attribute configuration. For each of T and T', L 0 , L 1 ,...,L n denote the set of subtrees at hierarchical depth 0, 1,...,n, respectively. L i1 , L i2 ,...,L ik respectively denote k subtrees in the hierarchical depth i, that is, the subtrees in the subtree set L i .
假设css1表示Root(T)所具有的CSS类别的集合,其中,Root(T)可能具有0个、1个或任何其它数量的CSS类别。假设css2表示Root(T’)所具有的CSS类别的集合,其中,Root(T’)可能具有0个、1个或任何其它数量的CSS类别。在一种实现方式中,可以利用Jaccard系数来计算T和T’之间的匹配权重。例如,T和T’之间的匹配权重可以被计算为:Assume that css 1 represents the set of CSS classes that Root(T) has, where Root(T) may have 0, 1 or any other number of CSS classes. Suppose css 2 represents the set of CSS classes that Root(T') has, where Root(T') may have 0, 1 or any other number of CSS classes. In an implementation manner, the matching weight between T and T' can be calculated by using the Jaccard coefficient. For example, the matching weight between T and T' can be calculated as:
其中,MatchWeight(·)是计算匹配权重的函数,|css1|表示css1中包含的CSS类别的数量,|css2|表示css2中包含的CSS类别的数量,|css1∩css2|表示css1和css2中共同包含的CSS类别的数量。Among them, MatchWeight( ) is a function to calculate the matching weight, |css 1 | indicates the number of CSS categories included in css 1 , |css 2 | indicates the number of CSS categories included in css 2 , |css 1 ∩css 2 | indicates the number of CSS categories included in both css 1 and css 2 .
可以基于例如以下表1中的过程来计算T和T’之间的树相似性。The tree similarity between T and T' can be calculated based on, for example, the procedure in Table 1 below.
表1Table 1
在步骤1.1处,可以确定最小深度层级MinDepth。在步骤1.2处,定义了用于计算T和T’在当前层级layer处的相似性度量的相似性检验函数SimilarityCheck(·),该函数可以包括后续的步骤1.3至步骤1.18中的处理。在步骤1.3和步骤1.4处,如果确定Root(T)和Root(T’)具有不同的html标签,则SimilarityCheck(·)的计算结果为0。在步骤1.5和步骤1.6处,如果确定当前层级layer大于最小深度层级MinDepth,则SimilarityCheck(·)的计算结果为0,从而可以避免在当前层级大于最小深度层级的情况下进行树相似性计算。在步骤1.8处,利用m来表示T中的对应于深度layer+1的子树集合Llayer+1中的子树。在步骤1.9处,利用n来表示T’中的对应于深度layer+1的子树集合L’layer+1中的子树。在表1的过程中,定义了相似性函数M[i,j],其表示T中的前i个子树与T’中的前j个子树之间的最大相似性。在步骤1.10和步骤1.11处,将M[i,0]和M[0,j]分别初始化为0。在步骤1.12处,定义将遍历T的子树集合m中的子树。在步骤1.13处,定义将遍历T’的子树集合n中的子树。在步骤1.14和步骤1.15处,可以利用相似性函数M[i,j]来计算相似性。例如,可以采用动态编程等技术来执行步骤1.14和步骤1.15处的计算。M[i,j]将从三个候选中获得最佳的相似性,所述三个候选包括M[i,j-1]、M[i-1,j]以及M[i-1,j-1]+W[i,j]。W[i,j]将递归地计算在layer+1的层级中、T中的第i个子树Ti和T’中的第j个子树T'j之间的相似性。从而,可以考虑到整个树结构,而并不仅仅是根节点。在步骤1.18处,可以返回SimilarityCheck(·)的计算结果,其被表示为MatchWeight(Root(T),Root(T'))*(M[m,n]+1),其中,M[m,n]是T和T’的子树的最佳相似性,“1”代表根节点。应当理解,在步骤1.18处返回的计算结果可以指示例如相似节点的数量。在步骤1.19处,可以进一步利用深度不大于最小深度层级的所有节点来计算最终的树相似性,其中,TreeSimilarity(·)是树相似性函数,|T|是T中深度不大于最小深度层级的节点的数量,|T′|是T’中深度不大于最小深度层级的节点的数量。应当理解,表1中的所有步骤都是示例性的,本公开的实施例还将涵盖对这些步骤进行的任何方式的修改。At step 1.1, a minimum depth level MinDepth may be determined. At step 1.2, a similarity check function SimilarityCheck(·) for calculating the similarity measure of T and T' at the current level layer is defined, and this function may include subsequent processing in steps 1.3 to 1.18. At step 1.3 and step 1.4, if it is determined that Root(T) and Root(T') have different html tags, the calculation result of SimilarityCheck(·) is 0. At step 1.5 and step 1.6, if it is determined that the current layer layer is greater than the minimum depth level MinDepth, then the calculation result of SimilarityCheck( ) is 0, so that tree similarity calculation can be avoided when the current level is greater than the minimum depth level. At step 1.8, use m to denote the subtrees in the subtree set L layer+1 in T corresponding to the depth layer +1 . At step 1.9, use n to denote the subtrees in the subtree set L' layer+1 corresponding to the depth layer+1 in T'. In the procedure of Table 1, a similarity function M[i,j] is defined, which represents the maximum similarity between the first i subtrees in T and the first j subtrees in T'. At step 1.10 and step 1.11, M[i,0] and M[0,j] are initialized to 0 respectively. At step 1.12, define the subtrees in the set m of subtrees of T that will be traversed. At step 1.13, define the subtrees in the set n of subtrees that will be traversed T'. At step 1.14 and step 1.15, the similarity can be calculated using the similarity function M[i,j]. For example, techniques such as dynamic programming may be employed to perform the calculations at steps 1.14 and 1.15. M[i,j] will get the best similarity from three candidates including M[i,j-1], M[i-1,j] and M[i-1,j-1]+W[i,j]. W[i,j] will recursively compute the similarity between the i-th subtree T i in T and the j -th subtree T' j in T' at the level of layer+1. Thus, the entire tree structure can be considered, not just the root node. At step 1.18, the calculation result of SimilarityCheck( ) may be returned, expressed as MatchWeight(Root(T), Root(T'))*(M[m,n]+1), where M[m,n] is the best similarity of the subtrees of T and T', and "1" represents the root node. It should be understood that the calculation result returned at step 1.18 may indicate, for example, the number of similar nodes. At step 1.19, all nodes whose depth is not greater than the minimum depth level can be further used to calculate the final tree similarity, where TreeSimilarity( ) is a tree similarity function, |T| is the number of nodes in T whose depth is not greater than the minimum depth level, and |T'| is the number of nodes in T' whose depth is not greater than the minimum depth level. It should be understood that all steps in Table 1 are exemplary, and embodiments of the present disclosure will also encompass modifications to these steps in any manner.
在通过步骤740而计算出在多棵项目树中的任意两棵项目树之间的树相似性之后,可以在745处至少利用相似性阈值将所述多棵项目树划分成至少一个树集合。所述至少一个树集合中的每个树集合可以包括至少一棵项目树,并且同一树集合中的至少一棵项目树在彼此之间具有不低于所述相似性阈值的树相似性。通过步骤745,可以将彼此相似性高的项目树划分到同一个树集合中。在750处,可以确定所述至少一个树集合中包含最多数量项目树的树集合中的项目树的数量是否低于树数量阈值。包含最多数量项目树的树集合可以被作为用以判断是否应当停止执行迭代的目标树集合。树数量阈值可以具有预先设置的值,该值可以用于例如确保所述多棵项目树中的大部分项目树都被包含到目标树集合中。如果在750处确定目标树集合中的项目树的数量不低于树数量阈值,,则过程700可以返回到710并继续执行迭代边界扩展。如果在750处确定目标树集合中的项目树的数量低于树数量阈值,则可以在760处停止执行迭代边界扩展,并且从所述多棵项目树中分别排除通过预定数量先前步骤的迭代而确定的节点。在一种实现方式中,所述预定数量先前步骤的迭代可以是通过以下方式来确定的:通过从所述多棵项目树中分别排除通过预定数量先前步骤的迭代而确定的节点,可以使得在例如经由步骤740和步骤745的处理而针对经更新的多棵项目树所获得的目标树集合中的项目树的数量不低于树数量阈值。After calculating the tree similarity between any two item trees in the plurality of item trees through step 740 , the plurality of item trees may be divided into at least one tree set at 745 using at least a similarity threshold. Each tree set in the at least one tree set may include at least one item tree, and at least one item tree in the same tree set has a tree similarity between each other that is not lower than the similarity threshold. Through step 745, item trees with high similarity to each other can be divided into the same tree set. At 750, it may be determined whether the number of item trees in the set of trees of the at least one set of trees containing the greatest number of item trees is below a tree count threshold. The set of trees containing the largest number of item trees may be used as the target set of trees for determining whether execution of the iteration should stop. The tree quantity threshold may have a preset value, which may be used, for example, to ensure that most of the plurality of item trees are included in the target tree set. If it is determined at 750 that the number of item trees in the target tree set is not below the tree count threshold, then process 700 may return to 710 and continue to perform iterative bound expansion. If it is determined at 750 that the number of item trees in the target set of trees is below the tree number threshold, then performing the iterative bounds expansion may cease at 760 and nodes determined through iterations of a predetermined number of previous steps are respectively excluded from the plurality of item trees. In one implementation manner, the iterations of the predetermined number of previous steps may be determined in the following manner: by respectively excluding nodes determined through the iterations of the predetermined number of previous steps from the plurality of item trees, the number of item trees in the target tree set obtained for the updated plurality of item trees may not be lower than the tree number threshold, for example, through the processing of steps 740 and 745.
通过执行步骤740、步骤745、步骤750和步骤760,本公开的实施例可以实现对多棵项目树的相似性检验。相似性检验有助于确保所得到的项目树具有结构相似性。例如,在将目标树集合中的多棵项目树提供作为迭代扩展结果的情况下,迭代扩展结果中的这些项目树将彼此具有高相似性,从而确保不同的项目具有相似的结构。By executing step 740 , step 745 , step 750 and step 760 , the embodiment of the present disclosure can realize the similarity check on multiple project trees. Similarity checks help ensure that the resulting project trees are structurally similar. For example, in case a plurality of item trees in the target tree set are provided as the iterative expansion result, these item trees in the iterative expansion result will have high similarity to each other, thereby ensuring that different items have similar structures.
根据过程700,可选地,可以在770处执行进一步的迭代边界扩展,以尝试找到多棵更好的项目树。在一种实现方式中,在执行了步骤730之后,如果确定当前步骤的迭代是兄弟节点扩展,则可以在与当前步骤的迭代的方向相反的方向上对所述多棵项目树执行进一步的迭代边界扩展。例如,对于每棵项目树,如果当前步骤的迭代是向右扩展到一个兄弟节点,则可以尝试向左扩展到一个不同的兄弟节点。在另一种实现方式中,在执行了步骤760之后,可以首先将所述多棵项目树重置到在预定先前步骤的迭代处的状态。所述预定先前步骤的迭代可以是例如利用在760处所涉及的预定数量先前步骤的迭代来确定的。例如,如果所述预定数量先前步骤的迭代是先前2个步骤的迭代,则所述预定先前步骤的迭代可以是先前第3个步骤的迭代。然后,如果确定在所述预定先前步骤的迭代之后的下一步骤的迭代是兄弟节点扩展,则可以尝试在与所述下一步骤的迭代的方向相反的方向上对所述多棵项目树执行进一步的迭代边界扩展。例如,对于每棵项目树,如果该项目树被重置到在先前第3个步骤的迭代处的状态,并且先前第2个步骤的迭代是向右扩展到一个兄弟节点,则可以尝试进而向左扩展到一个不同的兄弟节点。According to process 700, optionally further iterative bounds expansion can be performed at 770 in an attempt to find better item trees. In one implementation, after step 730 is performed, if it is determined that the iteration of the current step is sibling node expansion, further iteration boundary expansion may be performed on the plurality of item trees in a direction opposite to the iteration direction of the current step. For example, for each item tree, if the current step's iteration was expanding right to a sibling, it might try expanding left to a different sibling. In another implementation manner, after step 760 is performed, the plurality of project trees may first be reset to the state at the iteration of the predetermined previous step. The predetermined number of iterations of previous steps may be determined, for example, using the predetermined number of iterations of previous steps involved at 760 . For example, if the iteration of the predetermined number of previous steps is an iteration of the previous 2 steps, the iteration of the predetermined number of previous steps may be an iteration of the 3rd step preceding. Then, if it is determined that the iteration of the next step after the iteration of the predetermined previous step is sibling expansion, a further iteration boundary expansion of the plurality of item trees may be attempted in a direction opposite to that of the iteration of the next step. For example, for each item tree, if the item tree was reset to the state at the previous iteration of step 3, and the previous iteration of step 2 was right-extending to a sibling, then further left-extending to a different sibling could be attempted.
应当理解,尽管未示出,过程700还可以包括针对770处的进一步的迭代边界扩展执行步骤720至730和/或步骤740至760,以便确保所获得的多棵项目树不具有节点重叠并且具有相似的结构,从而确保不同的项目不具有内容重叠并且具有相似的结构。It should be appreciated that, although not shown, process 700 may also include performing steps 720-730 and/or steps 740-760 for further iterative boundary expansion at 770, so as to ensure that the obtained multiple project trees have no overlapping nodes and have similar structures, thereby ensuring that different projects have no overlapping content and similar structures.
通过过程700,可以获得分别源自于锚点元素组702中的多个锚点元素的多棵项目树,这些项目树分别定义了多个对应项目的边界,从而,可以最终识别出由这些项目所形成的原始列表704。Through the process 700, multiple item trees respectively originating from multiple anchor elements in the anchor element group 702 can be obtained, and these item trees respectively define boundaries of multiple corresponding items, so that the original list 704 formed by these items can be finally identified.
应当理解,过程700中的所有步骤都是示例性的,本公开的实施例还将涵盖对过程700进行的任意方式的修改。例如,过程700中关于确定是否出现节点重叠的处理和关于执行相似性检验的处理都是可选的,可以在过程700中包括这两项处理中的任意一者或两者,或者可以从过程700中省略这两项处理中的任意一者或两者。此外,例如,过程700也可以仅将目标树集合中的多棵项目树提供作为迭代扩展结果,并利用这些项目树形成原始列表704。It should be understood that all steps in the process 700 are exemplary, and embodiments of the present disclosure will also cover modifications to the process 700 in any manner. For example, both the process of determining whether node overlap occurs and the process of performing a similarity check in process 700 are optional, and any one or both of these two processes may be included in process 700, or any one or both of these two processes may be omitted from process 700. In addition, for example, the process 700 may also only provide a plurality of item trees in the target tree set as the iterative expansion result, and use these item trees to form the original list 704 .
图8A至图8F示出了根据实施例的迭代边界扩展示例。在图8A至图8F中,在与目标网页对应的示例性DOM树中示出了迭代边界扩展的示例性过程。8A-8F illustrate examples of iterative bounds extension according to an embodiment. In FIGS. 8A-8F , an exemplary process of iterative boundary expansion is shown in an exemplary DOM tree corresponding to a target web page.
该DOM树可以包括多个节点,例如节点801至节点826以及其他未示出的节点。在表示节点的方框中所显示的诸如“Div”、“A”、“Span”、“P”、“Img”等符号指示对应节点的html标签。此外,本公开的实施例提出了为html源文件中出现的文本串设置“Text(文本)”标签,尽管并不存在这样的html标签,并且这些文本串也可以作为DOM树中的节点,例如节点819、节点820等。这些文本串可能是可被呈现的可见元素,因此,为这些文本串设置Text标签和对应的节点将有助于更准确地确定项目边界。应当理解,为了解释的目的,在图8A至图8F中仅示出了几种示例性的节点标签,在实际应用中可能存在任何其它类型的节点标签,并且本公开的实施例并不受到DOM树中的节点具有何种具体标签的任何限制。此外,在图8A至图8F中,以阴影来突出显示通过迭代边界扩展而被包括到项目树中的节点,并且以箭头来指示迭代边界扩展的扩展路径。The DOM tree may include multiple nodes, such as node 801 to node 826 and other nodes not shown. Symbols such as "Div", "A", "Span", "P", "Img" and the like displayed in boxes representing nodes indicate html tags of corresponding nodes. In addition, the embodiments of the present disclosure propose to set a "Text (text)" tag for the text strings appearing in the html source file, although there is no such html tag, and these text strings can also be used as nodes in the DOM tree, such as node 819, node 820, etc. These text strings may be visible elements that can be rendered, so setting Text tags and corresponding nodes for these text strings will help to more accurately determine the item boundary. It should be understood that, for the purpose of explanation, only several exemplary node labels are shown in FIGS. 8A to 8F , any other types of node labels may exist in practical applications, and embodiments of the present disclosure are not limited by what specific labels the nodes in the DOM tree have. Furthermore, in FIGS. 8A to 8F , the nodes included into the item tree by iterative boundary expansion are highlighted with shading, and the expansion paths of iterative boundary expansion are indicated with arrows.
在图8A中,假设节点818、节点821和节点824已被识别为具有Img(图像)标签的锚点元素。然后,可以分别以这些节点为起点同步地执行迭代边界扩展,以便获得分别源自于这些节点的项目树。在下文中,将源自于节点818的项目树称为第一项目树,将源自于节点821的项目树称为第二项目树,将源自于节点824的项目树称为第三项目树。In FIG. 8A, assume that node 818, node 821, and node 824 have been identified as anchor elements with an Img (image) tag. An iterative boundary expansion can then be performed synchronously starting from each of these nodes in order to obtain a tree of items originating from each of these nodes. Hereinafter, the item tree originating from node 818 is referred to as a first item tree, the item tree originating from node 821 is referred to as a second item tree, and the item tree originating from node 824 is referred to as a third item tree.
图8B示出了第1步骤的迭代。由于节点818、节点821和节点824都没有兄弟节点,因此,在第1步骤的迭代中将执行父节点扩展。例如,在第一项目树中,从节点818扩展到节点818的父节点806;在第二项目树中,从节点821扩展到节点821的父节点810;在第三项目树中,从节点824扩展到节点824的父节点814。Figure 8B shows an iteration of Step 1. Since node 818, node 821, and node 824 have no sibling nodes, parent node expansion will be performed in the iteration of step 1 . For example, in the first item tree, expand from node 818 to the parent node 806 of node 818; in the second item tree, expand from node 821 to the parent node 810 of node 821; in the third item tree, expand from node 824 to the parent node 814 of node 824.
图8C示出了第2步骤至第4步骤的迭代,其中,将执行兄弟节点扩展。以第一项目树为例,通过第1步骤的迭代而确定的节点806是当前节点,其具有兄弟节点807、808和809,因此,第2步骤至第4步骤的迭代将如箭头所示依次向右扩展到节点807、节点808和节点809。类似地,在第二项目树中,第2步骤至第4步骤的迭代将如箭头所示依次向右扩展到节点811、节点812和节点813;在第三项目树中,第2步骤至第4步骤的迭代将如箭头所示依次向右扩展到节点815、节点816和节点817。此外,由于节点808具有子节点819且节点809具有子节点820,因此,节点819和节点820也可以被包括到第一项目树中。类似地,节点812的子节点822和节点813的子节点823可以被包括到第二项目树中,并且节点816的子节点825和节点817的子节点826可以被包括到第三项目树中。Figure 8C shows the iteration of steps 2 to 4, where sibling expansion will be performed. Taking the first item tree as an example, the node 806 determined by the iteration of the first step is the current node, which has sibling nodes 807, 808 and 809. Therefore, the iterations of the second step to the fourth step will be extended to the right in turn to the node 807, node 808 and node 809 as shown by the arrows. Similarly, in the second project tree, the iterations from the 2nd step to the 4th step will be extended to node 811, node 812 and node 813 to the right as shown by the arrow; Furthermore, since node 808 has child node 819 and node 809 has child node 820, node 819 and node 820 may also be included in the first item tree. Similarly, child node 822 of node 812 and child node 823 of node 813 may be included into a second item tree, and child node 825 of node 816 and child node 826 of node 817 may be included into a third item tree.
图8D示出了第5步骤的迭代,其中,将执行父节点扩展。以第一项目树为例,由于已经通过第2步骤至第4步骤的迭代而将节点806的所有兄弟节点807、808和809都包括到了第一项目树中,因此,第5步骤的迭代将如箭头所示扩展到节点806至节点809的父节点803。类似地,在第二项目树中,第5步骤的迭代将如箭头所示扩展到节点804;在第三项目树中,第5步骤的迭代将如箭头所示扩展到节点805。Figure 8D shows an iteration of step 5, where parent node expansion will be performed. Taking the first item tree as an example, since all sibling nodes 807, 808 and 809 of the node 806 have been included in the first item tree through the iteration of the second step to the fourth step, the iteration of the fifth step will be extended to the parent node 803 of the node 806 to the node 809 as shown by the arrow. Similarly, in the second item tree, the iteration of step 5 will expand to node 804 as indicated by the arrow; in the third item tree, the iteration of step 5 will expand to node 805 as indicated by the arrow.
图8E示出了第6步骤的迭代,其中,将执行父节点扩展。以第一项目树为例,通过第5步骤的迭代而确定的节点803没有兄弟节点,因此,第6步骤的迭代将如箭头所示扩展到节点803的父节点802。类似地,在第二项目树中,第6步骤的迭代将如箭头所示扩展到节点802;在第三项目树中,第6步骤的迭代将如箭头所示扩展到节点802。Figure 8E shows an iteration of step 6, where parent node expansion will be performed. Taking the first item tree as an example, the node 803 determined by the iteration of step 5 has no sibling nodes, so the iteration of step 6 will extend to the parent node 802 of node 803 as shown by the arrow. Similarly, in the second item tree, the iteration of step 6 will expand to node 802 as indicated by the arrow; in the third item tree, the iteration of step 6 will expand to node 802 as indicated by the arrow.
通过第6步骤的迭代,节点802将被同时包括到第一项目树、第二项目树和第三项目树中,由此导致出现节点重叠。因此,将停止执行迭代边界扩展,并且分别从第一项目树、第二项目树和第三项目树中排除通过第6步骤的迭代所确定的节点802。在图8F中,利用虚线框示出了最终获得的第一项目树830、最终获得的第二项目树840、以及最终获得的第三项目树850。第一项目树830、第二项目树840、以及第三项目树850分别对应于目标网页中的原始列表中的第一项目、第二项目和第三项目。从而,通过图8A至图8F中的迭代边界扩展,可以识别出目标网页中的原始列表及该原始列表中的项目。应当理解,还可以对于图8F中所示的最终的项目树执行以上结合图7所描述的相似性检验。此外,应当注意,如图8F所示,每棵项目树都具有自己的根节点,例如,第一项目树830、第二项目树840和第三项目树850分别具有自己的根节点803、804、805,从而,每棵项目树的边界实际上可以是由根节点的html标签来指示的,例如,第一项目树的边界可以是由根节点803的“Div”标签来指示的。Through the iteration of the sixth step, the node 802 will be included in the first item tree, the second item tree and the third item tree at the same time, thus resulting in node overlap. Therefore, execution of the iterative boundary expansion will cease and the nodes 802 determined by the iteration of step 6 will be excluded from the first item tree, the second item tree and the third item tree respectively. In FIG. 8F , the finally obtained first item tree 830 , the finally obtained second item tree 840 , and the finally obtained third item tree 850 are shown by dotted line boxes. The first item tree 830, the second item tree 840, and the third item tree 850 respectively correspond to the first item, the second item, and the third item in the original list in the target web page. Thus, through the iterative boundary expansion in FIGS. 8A to 8F , the original list in the target web page and the items in the original list can be identified. It should be understood that the similarity check described above in connection with FIG. 7 may also be performed on the final item tree shown in FIG. 8F. In addition, it should be noted that, as shown in FIG. 8F , each item tree has its own root node. For example, the first item tree 830, the second item tree 840 and the third item tree 850 have their own root nodes 803, 804, and 805 respectively. Therefore, the boundary of each item tree can actually be indicated by the html tag of the root node. For example, the boundary of the first item tree can be indicated by the "Div" label of the root node 803.
图9A至图9F示出了根据实施例的迭代边界扩展示例。图9A至图9F示出了在与目标网页对应的示例性DOM树中进行不同方式的迭代边界扩展的示例。该DOM树可以包括多个节点,例如节点901至节点929以及其他未示出的节点。9A-9F illustrate examples of iterative bounds extension according to an embodiment. FIGS. 9A to 9F illustrate examples of iterative boundary expansion in different ways in an exemplary DOM tree corresponding to a target web page. The DOM tree may include multiple nodes, such as node 901 to node 929 and other nodes not shown.
在图9A中,假设节点919、节点923和节点927已被识别为具有Img(图像)标签的锚点元素。然后,可以分别以这些节点为起点同步地执行迭代边界扩展,以便获得分别源自于这些节点的项目树。在下文中,将源自于节点919的项目树称为第一项目树,将源自于节点923的项目树称为第二项目树,将源自于节点927的项目树称为第三项目树。In FIG. 9A, assume that node 919, node 923, and node 927 have been identified as anchor elements with an Img (image) tag. An iterative boundary expansion can then be performed synchronously starting from each of these nodes in order to obtain a tree of items originating from each of these nodes. Hereinafter, the item tree originating from node 919 is referred to as a first item tree, the item tree originating from node 923 is referred to as a second item tree, and the item tree originating from node 927 is referred to as a third item tree.
图9B示出了第1步骤的迭代。由于节点919、节点923和节点927都没有兄弟节点,因此,在第1步骤的迭代中将执行父节点扩展。例如,在第一项目树中,从节点919扩展到节点919的父节点904;在第二项目树中,从节点923扩展到节点923的父节点909;在第三项目树中,从节点927扩展到节点927的父节点914。Figure 9B shows an iteration of Step 1. Since node 919, node 923, and node 927 have no sibling nodes, parent node expansion will be performed in the iteration of step 1 . For example, in the first item tree, expand from node 919 to the parent node 904 of node 919; in the second item tree, expand from node 923 to the parent node 909 of node 923; in the third item tree, expand from node 927 to the parent node 914 of node 927.
图9C示出了在图9B中的第1步骤的迭代基础上,以一种扩展方式执行后续的迭代边界扩展所最终获得的项目树。如图9C所示,第2步骤至第5步骤的迭代将如箭头所示依次向左执行兄弟节点扩展。在第一项目树中,第2步骤的迭代将如箭头所示向左扩展到节点903,并且由于没有其他兄弟节点,因此,将在第一项目树中暂停第3步骤至第5步骤的迭代。在第二项目树中,第2步骤至第5步骤的迭代将如箭头所示依次向左扩展到节点908、节点907、节点906和节点905。在第三项目树中,第2步骤至第5步骤的迭代将如箭头所示依次向左扩展到节点913、节点912、节点911和节点910。由于节点902是第一项目树、第二项目树和第三项目树共同的父节点,因此,为了避免节点重叠,最终获得的第一项目树、第二项目树和第三项目树并未包括节点902。在图9C中,利用虚线框示出了最终获得的第一项目树932、最终获得的第二项目树934、以及最终获得的第三项目树936。应当注意,节点915、节点916、节点917、节点928和节点929并未被包括到任何项目树中。此外,假设对这些最终获得的项目树进一步执行了以上结合图7所描述的相似性检验,并且发现尽管第二项目树934与第三项目树936具有较高的相似性,但是第一项目树932与第二项目树934和第三项目树936之间的树相似性均低于相似性阈值。因此,第一项目树932可以被认为是不符合要求的项目树并从而被舍弃,相应地,图9C的扩展方式实际上最终仅输出了两棵项目树934和936。FIG. 9C shows an item tree finally obtained by performing subsequent iterative boundary expansion in an extended manner on the basis of the iteration in the first step in FIG. 9B . As shown in FIG. 9C , the iterations from the second step to the fifth step will sequentially perform sibling node expansion to the left as indicated by the arrow. In the first project tree, the iteration of step 2 will expand leftwards to node 903 as indicated by the arrow, and since there are no other sibling nodes, the iteration of steps 3 to 5 will be suspended in the first project tree. In the second project tree, the iterations of steps 2 to 5 will be expanded to the left sequentially to node 908 , node 907 , node 906 and node 905 as indicated by the arrows. In the third item tree, the iterations of the 2nd step to the 5th step will be expanded to the left sequentially to node 913 , node 912 , node 911 and node 910 as indicated by the arrows. Since node 902 is the common parent node of the first item tree, the second item tree and the third item tree, in order to avoid node overlap, the finally obtained first item tree, second item tree and third item tree do not include node 902 . In FIG. 9C , the finally obtained first item tree 932 , the finally obtained second item tree 934 , and the finally obtained third item tree 936 are shown by dotted line boxes. It should be noted that node 915, node 916, node 917, node 928, and node 929 are not included in any project tree. In addition, it is assumed that the similarity test described above in conjunction with FIG. 7 is further performed on these finally obtained item trees, and it is found that although the second item tree 934 and the third item tree 936 have a high similarity, the tree similarities between the first item tree 932 and the second item tree 934 and the third item tree 936 are all lower than the similarity threshold. Therefore, the first item tree 932 can be considered as an unqualified item tree and thus discarded. Correspondingly, the expansion method in FIG. 9C actually outputs only two item trees 934 and 936 in the end.
图9D示出了在图9B中的第1步骤的迭代基础上,以另一种扩展方式执行后续的迭代边界扩展所最终获得的项目树。如图9D所示,第2步骤至第5步骤的迭代将如箭头所示依次向右执行兄弟节点扩展。在第一项目树中,第2步骤至第5步骤的迭代将如箭头所示依次向右扩展到节点905、节点906、节点907和节点908。在第二项目树中,第2步骤至第5步骤的迭代将如箭头所示依次向右扩展到节点910、节点911、节点912和节点913。在第三项目树中,第2步骤至第4步骤的迭代将如箭头所示依次向右扩展到节点915、节点916和节点917,并且由于没有进一步的兄弟节点,因此,将在第三项目树中暂停第5步骤的迭代。在图9D中,利用虚线框示出了最终获得的第一项目树942、最终获得的第二项目树944、以及最终获得的第三项目树946。应当注意,节点903和节点918并未被包括到任何项目树中。此外,假设对这些最终获得的项目树进一步执行了以上结合图7所描述的相似性检验,并且发现这些项目树之间的树相似性均不低于相似性阈值。相应地,图9D的扩展方式实际上最终输出了所有三棵项目树942、944和946。FIG. 9D shows the item tree finally obtained by performing subsequent iterative boundary expansion in another expansion manner on the basis of the iteration in the first step in FIG. 9B . As shown in FIG. 9D , the iterations from step 2 to step 5 will perform sibling node expansion to the right in sequence as indicated by the arrow. In the first project tree, the iterations of steps 2 to 5 will be expanded to the right sequentially to node 905 , node 906 , node 907 and node 908 as indicated by the arrows. In the second project tree, the iterations from step 2 to step 5 will be expanded to the right sequentially to node 910 , node 911 , node 912 and node 913 as indicated by the arrows. In the third item tree, the iteration of step 2 to step 4 will be extended to the right in turn as indicated by the arrows to node 915, node 916, and node 917, and since there are no further sibling nodes, the iteration of step 5 will be suspended in the third item tree. In FIG. 9D , the finally obtained first item tree 942 , the finally obtained second item tree 944 , and the finally obtained third item tree 946 are shown by dashed boxes. It should be noted that node 903 and node 918 are not included into any project tree. In addition, it is assumed that the similarity test described above in conjunction with FIG. 7 is further performed on these finally obtained item trees, and it is found that the tree similarity between these item trees is not lower than the similarity threshold. Accordingly, the expansion of FIG. 9D actually ends up outputting all three item trees 942 , 944 and 946 .
根据本公开的实施例,为了找到更好的项目树,可以根据例如图7中的步骤770来执行进一步的迭代边界扩展。假设将图9D中示出的项目树分别重置到在第4步骤的迭代处的状态,如图9E所示。在图9E中,第一项目树的当前扩展路径依次包括如箭头所示的节点919、节点904、节点905、节点906和节点907,第二项目树的当前扩展路径依次包括如箭头所示的节点923、节点909、节点910、节点911和节点912,第三项目树的当前扩展路径依次包括如箭头所示的节点927、节点914、节点915、节点916和节点917。图9D中的第5步骤的迭代是向右进行兄弟节点扩展,与此不同,图9F中的第5步骤的迭代可以尝试向左进行兄弟节点扩展,即,采用与图9D中的第5步骤的迭代的方向相反的方向。相应地,在图9F中的第5步骤的迭代中,第一项目树的扩展路径将如箭头所示进一步包括节点903,第二项目树的扩展路径将如箭头所示进一步包括节点908,第三项目树的扩展路径将如箭头所示进一步包括节点913。在图9F中,利用虚线框示出了最终获得的第一项目树952、最终获得的第二项目树954、以及最终获得的第三项目树956。此外,假设对这些最终获得的项目树进一步执行了以上结合图7所描述的相似性检验,并且发现这些项目树之间的树相似性均不低于相似性阈值。相应地,图9F的扩展方式实际上最终输出了所有三棵项目树952、954和956。According to an embodiment of the present disclosure, in order to find a better item tree, further iterative boundary expansion may be performed according to, for example, step 770 in FIG. 7 . Assume that the item tree shown in FIG. 9D is respectively reset to the state at the iteration of step 4, as shown in FIG. 9E. In FIG. 9E , the current expansion path of the first project tree includes node 919, node 904, node 905, node 906, and node 907 as shown by the arrows, the current expansion path of the second project tree includes node 923, node 909, node 910, node 911, and node 912 as shown by the arrow, and the current expansion path of the third item tree includes node 927, node 914, node 915, node 916, and node 917. Unlike the step 5 iteration in FIG. 9D , which was sibling expansion to the right, the step 5 iteration in FIG. 9F may attempt sibling expansion to the left, i.e., in the direction opposite to that of the iteration of step 5 in FIG. 9D . Correspondingly, in the iteration of the fifth step in FIG. 9F , the expansion path of the first item tree will further include node 903 as shown by the arrow, the expansion path of the second item tree will further include node 908 as shown by the arrow, and the expansion path of the third item tree will further include node 913 as shown by the arrow. In FIG. 9F , the finally obtained first item tree 952 , the finally obtained second item tree 954 , and the finally obtained third item tree 956 are shown by dashed boxes. In addition, it is assumed that the similarity test described above in conjunction with FIG. 7 is further performed on these finally obtained item trees, and it is found that the tree similarity between these item trees is not lower than the similarity threshold. Accordingly, the expansion of FIG. 9F actually ends up outputting all three item trees 952 , 954 and 956 .
在本公开的实施例中,迭代边界扩展的执行可以遵循预定的标准,例如,所获得的项目树越多越好、每棵项目树中的节点越多越好、不同项目树之间的树相似性越高越好、等等。通过在图9C、图9D和图9F之间进行比较可以看出,图9D和图9F的扩展方式将优于图9C的扩展方式,因为图9D和图9F的扩展方式可以输出更多数量的项目树。此外,图9F的扩展方式将优于图9D的扩展方式,这是因为:图9F的扩展方式可以将更多的节点(例如,节点903和节点918)包括到项目树中;并且通过图9F的扩展方式所获得的项目树之间的树相似性更高,例如,在第三项目树956与第一项目树952和第二项目树954之间的树相似性将高于在第三项目树946与第一项目树942和第二项目树944之间的树相似性。In the embodiments of the present disclosure, the execution of the iterative boundary extension may follow a predetermined standard, for example, the more item trees obtained, the better, the more nodes in each item tree, the better, the higher the tree similarity between different item trees, the better, and so on. By comparing FIG. 9C, FIG. 9D and FIG. 9F, it can be seen that the expansion method of FIG. 9D and FIG. 9F is better than that of FIG. 9C, because the expansion method of FIG. 9D and FIG. 9F can output a larger number of item trees. In addition, the expansion mode of Fig. 9F will be better than the expansion mode of Fig. 9D, this is because: the expansion mode of Fig. 9F can include more nodes (for example, node 903 and node 918) in the item tree; tree similarity.
应当理解,本公开的实施例可以通过各种方式来执行进一步的迭代边界扩展。例如,替代将图9D中示出的项目树分别重置到如图9E所示在第4步骤的迭代处的状态,可以将图9D中示出的项目树分别重置到任何其它预定先前步骤的迭代处的状态,例如,重置到第3步骤的迭代处的状态。然后,可以在与该预定先前步骤的迭代之后的下一步骤的迭代的方向相反的方向上对经重置的项目树执行扩展。此外,应当理解,本公开的实施例也可以通过执行多种不同方式的进一步的迭代边界扩展,来在这些不同方式中选择出能够获得最好的项目树的方式。It should be understood that embodiments of the present disclosure may perform further iterative boundary expansion in various ways. For example, instead of respectively resetting the item tree shown in FIG. 9D to the state at the iteration of the 4th step as shown in FIG. 9E , the item tree shown in FIG. 9D may be respectively reset to the state at the iteration of any other predetermined previous step, for example, reset to the state at the iteration of the 3rd step. Expansion of the reset tree of items may then be performed in a direction opposite to that of the iteration of the next step following the iteration of the predetermined previous step. In addition, it should be understood that the embodiments of the present disclosure may also perform further iterative boundary expansion in a variety of different ways, and select the way that can obtain the best item tree among these different ways.
图10示出了根据实施例的示例性边界检测结果。假设已经针对图2中的目标网页20执行了边界检测。如图10所示,虚线框1010标示了通过边界检测而识别的与“M手机A4”对应的项目,虚线框1020标示了通过边界检测而识别的与“M手机A3”对应的项目,虚线框1030标示了通过边界检测而识别的与“M手机A2”对应的项目。由虚线框1010、1020和1030所标示的项目一起形成了如图2所示的目标网页20中的原始列表202。Fig. 10 shows exemplary boundary detection results according to an embodiment. Assume that boundary detection has been performed for the target web page 20 in FIG. 2 . As shown in FIG. 10 , the dotted line frame 1010 marks the item corresponding to "M mobile phone A4" identified through the boundary detection, the dotted line frame 1020 marks the item corresponding to "M mobile phone A3" recognized through the boundary detection, and the dotted line frame 1030 marks the item corresponding to "M mobile phone A2" recognized through the boundary detection. The items marked by dashed boxes 1010, 1020 and 1030 together form the original list 202 in the target web page 20 as shown in FIG. 2 .
图11示出了根据实施例的主列表确定的示例性过程1100。过程1100是图4中的步骤430的示例性实现方式。假设已经确定出目标网页包括一个以上的原始列表,例如,第一原始列表1102、第二原始列表1104等,则过程1100可以被执行用于从这些原始列表中确定出一个主列表。FIG. 11 illustrates an exemplary process 1100 for master list determination, according to an embodiment. Process 1100 is an example implementation of step 430 in FIG. 4 . Assuming it has been determined that the target web page includes more than one original list, eg, first original list 1102, second original list 1104, etc., process 1100 can be performed to determine a master list from these original lists.
在1110处,可以至少利用第一原始列表1102中的项目的边界来确定第一原始列表1102的视觉特征。在1120处,可以至少利用第二原始列表1104中的项目的边界来确定第二原始列表1104的视觉特征。At 1110 , visual characteristics of the first raw list 1102 can be determined using at least boundaries of items in the first raw list 1102 . At 1120 , visual characteristics of the second original list 1104 can be determined using at least boundaries of items in the second original list 1104 .
在本公开的实施例中,一个原始列表的视觉特征可以指有助于确定该原始列表在目标网页中是否占据主要位置、是否用于呈现该目标网页的主要内容等的各种视觉上的特征。在一种实现方式中,视觉特征可以包括原始列表内的相邻项目之间的最小边界距离,其可以指示这两个项目之间的视觉距离。例如,可以利用两个相邻项目的边界来计算出在这两个项目之间的最小边界距离。在一种实现方式中,视觉特征可以包括列表位置,其可以指示原始列表在目标网页中是否占据主要位置并从而作为目标网页中的主要内容部分。例如,列表位置可以包括该原始列表在目标网页中的水平方向上的位置。例如,列表位置可以包括该原始列表在目标网页中的垂直方向上的位置,如,该列表是否位于屏幕的折叠线以上(above-the-fold)等。在一种实现方式中,视觉特征可以包括项目内容丰富度,其指示原始列表内的项目在视觉上的内容丰富度。例如,一个项目的项目内容丰富度可以包括基于该项目的边界所确定的诸如该项目的尺寸、该项目所包含的节点的数量等。In an embodiment of the present disclosure, the visual features of an original list may refer to various visual features that help determine whether the original list occupies a main position in the target webpage, whether it is used to present the main content of the target webpage, and the like. In one implementation, the visual features may include a minimum border distance between adjacent items within the original list, which may indicate the visual distance between the two items. For example, the borders of two adjacent items can be used to calculate the minimum border distance between these two items. In one implementation, the visual feature may include a listing position, which may indicate whether the original listing occupies a dominant position in the target web page and thus serves as a major content portion in the target web page. For example, the listing location may include the horizontal location of the original listing in the target web page. For example, the list position may include the vertical position of the original list in the target webpage, such as whether the list is located above-the-fold of the screen, and so on. In one implementation, the visual characteristics may include item content richness, which indicates the visual content richness of the items within the original list. For example, the item content richness of an item may include such items as the size of the item, the number of nodes contained in the item, etc. determined based on the boundary of the item.
在1130处,可以基于第一原始列表1102的视觉特征和第二原始列表1104的视觉特征,从第一原始列表1102和第二原始列表1104中确定出主列表。在一种实现方式中,可以利用针对视觉特征所定义的多个启发式规则来确定主列表。例如,针对最小边界距离,可以定义关于列表中的项目之间的视觉距离是否较小的启发式规则,这是基于主列表中的项目之间的距离通常不会很远的考虑。例如,针对列表位置,可以定义关于原始列表是否在目标网页中占据主要位置的启发式规则,这是基于主列表通常在目标网页中占据主要位置的考虑。例如,针对项目内容丰富度,可以定义关于原始列表是否具有较高的项目内容丰富度的启发式规则,这是基于主列表通常具有较高的项目内容丰富度的考虑。根据上述的启发式规则,可以从第一原始列表1102和第二原始列表1104中选择能够更好地满足这些启发式规则的一个原始列表来作为主列表。At 1130 , a master list may be determined from the first original list 1102 and the second original list 1104 based on the visual characteristics of the first original list 1102 and the second original list 1104 . In one implementation, a number of heuristic rules defined for visual features can be utilized to determine the master list. For example, for the minimum border distance, a heuristic rule can be defined as to whether the visual distance between items in the list is small, based on the consideration that the distance between items in the main list is usually not very far. For example, for list position, you can define a heuristic as to whether the original list dominates the landing page, based on the consideration that the main list usually dominates the landing page. For example, regarding item content richness, a heuristic rule about whether the original list has high item content richness can be defined, which is based on the consideration that the main list usually has high item content richness. According to the above heuristic rules, an original list that can better satisfy these heuristic rules can be selected from the first original list 1102 and the second original list 1104 as the main list.
以图2中的目标网页20为例,通过执行过程1100,可以在原始列表202和原始列表204中将原始列表202识别为主列表。Taking the target webpage 20 in FIG. 2 as an example, by performing the process 1100 , the original list 202 can be identified as the main list among the original list 202 and the original list 204 .
应当理解,过程1100中的所有步骤都是示例性的,本公开的实施例还将涵盖对过程1100进行的任意方式的修改。例如,当存在两个以上的原始列表时,可以通过与过程1100类似的方式,至少利用这些原始列表的视觉特征来确定出主列表。此外,以上所给出的各种视觉特征和各种启发式规则都是示例性的,本公开的实施例可以采用这些视觉特征和启发式规则中的任意一个或多个,或者采用任何其它类型的视觉特征和启发式规则。It should be understood that all steps in the process 1100 are exemplary, and embodiments of the present disclosure will also cover modifications to the process 1100 in any manner. For example, when there are more than two original lists, in a manner similar to process 1100, at least the visual features of these original lists can be used to determine the main list. In addition, the various visual features and various heuristic rules given above are exemplary, and embodiments of the present disclosure may adopt any one or more of these visual features and heuristic rules, or any other type of visual features and heuristic rules.
图12示出了根据实施例的代表性元数据获得的示例性过程1200。过程1200是图4中的步骤440的示例性实现方式。假设过程1200被执行用于针对一个原始列表中的特定项目1202来获得一组代表性元数据。FIG. 12 illustrates an exemplary process 1200 for representative metadata acquisition, according to an embodiment. Process 1200 is an example implementation of step 440 in FIG. 4 . Assume that process 1200 is performed to obtain a representative set of metadata for a particular item 1202 in an original list.
在1210处,可以识别在由项目1202的边界所标识的项目树中的一组叶子节点。例如,可以识别出与项目1202对应的项目树中的叶子节点。At 1210, a set of leaf nodes in the project tree identified by the boundaries of project 1202 can be identified. For example, a leaf node in the project tree corresponding to project 1202 can be identified.
在1220处,可以提取与所识别的一组叶子节点对应的一组初始元数据。例如,可以提取每个叶子节点的初始元数据。以图10中由虚线框1010所标示的项目为例,初始元数据可以包括例如该项目中的图片、字符串“M手机A4,6.5寸,256G,黑色”、5颗实心星星的图标、字符串“25900个评论”、字符串“5500RMB”等。At 1220, an initial set of metadata corresponding to the identified set of leaf nodes can be extracted. For example, the initial metadata of each leaf node can be extracted. Taking the item marked by the dotted line box 1010 in FIG. 10 as an example, the initial metadata may include, for example, the picture in the item, the character string "M mobile phone A4, 6.5 inches, 256G, black", the icon of 5 solid stars, the character string "25900 comments", the character string "5500RMB", etc.
在1230处,可以确定与所提取的一组初始元数据对应的一组标签。例如,为每一个初始元数据确定一个对应的标签,以指示该初始元数据的具体含义。可以通过各种方式来在1230处确定所述一组标签。At 1230, a set of tags corresponding to the extracted initial set of metadata can be determined. For example, a corresponding label is determined for each initial metadata to indicate the specific meaning of the initial metadata. The set of tags can be determined at 1230 in various ways.
在一种实现方式中,可以首先利用所述一组初始元数据形成词条(token)序列。该词条序列中的每个词条对应于所述一组初始元数据中的一个初始元数据。然后,可以计算词条序列中的每个词条的特征集合。该特征集合可以包括有助于确定标签的多种特征,例如,DOM树特征、XPath特征、内容特征、语言特征、渲染特征等。DOM树特征可以包括例如与该词条对应的节点的层级深度、标签、类别ID、等等。XPath特征可以包括例如与该词条对应的节点的名称、CSS类别等。内容特征可以包括例如该词条的文本向量、首字母是否大写等。语言特征可以包括例如该词条所采用的语言、该词条的Word2vec语义特征向量等。渲染特征可以包括例如对与该词条对应的节点进行渲染所涉及的各种特征,诸如位置、长度、宽度等。应当理解,本公开的实施例并不局限于以上给出的包括在特征集合中的示例性特征,而是可以涵盖任何其它的特征或这些特征的任意组合。可以通过预先训练的标记器(tagger)模型,基于词条序列中的多个词条的多个特征集合来生成每个词条的标签。示例性地,标记器模型可以是由鉴别模型和生成模型所形成的组合模型,其中,鉴别模型可以是例如二分类或多分类模型,生成模型可以是例如序列到序列(Seq2seq)模型。应当理解,本公开的实施例并不局限于通过上述的标记器模型来生成标签,而是还可以通过任何其它方式来生成标签。In one implementation manner, the set of initial metadata may be used first to form a sequence of tokens. Each term in the sequence of terms corresponds to an initial metadata in the set of initial metadata. Then, a set of features for each term in the sequence of terms can be calculated. The set of features may include various features that are helpful for determining tags, for example, DOM tree features, XPath features, content features, language features, rendering features, and the like. DOM tree features may include, for example, the hierarchical depth of the node corresponding to the term, tags, category IDs, and the like. XPath features may include, for example, the name of the node corresponding to the term, a CSS class, and the like. Content features may include, for example, the text vector of the entry, whether the first letter is capitalized, and so on. The language feature may include, for example, the language used by the entry, the Word2vec semantic feature vector of the entry, and the like. The rendering features may include, for example, various features involved in rendering the node corresponding to the entry, such as position, length, width, and so on. It should be understood that the embodiments of the present disclosure are not limited to the exemplary features included in the set of features given above, but may cover any other features or any combination of these features. A tag for each entry may be generated based on multiple feature sets of multiple entries in the sequence of entries by using a pre-trained tagger model. Exemplarily, the marker model can be a combined model formed by a discriminative model and a generative model, wherein the discriminative model can be, for example, a binary or multi-class model, and the generative model can be, for example, a sequence-to-sequence (Seq2seq) model. It should be understood that the embodiments of the present disclosure are not limited to generating labels through the above-mentioned marker model, but may also generate labels in any other manner.
仍然以图10中由虚线框1010所标示的项目为例,通过步骤1230,可以为该项目中的图片生成“图像”标签,可以为字符串“M手机A4,6.5寸,256G,黑色”生成“标题”标签,可以为5颗实心星星的图标生成“评级”标签,可以为字符串“25900个评论”生成“评论”标签,可以为字符串“5500RMB”生成“价格”标签。Still taking the project marked by the dotted frame 1010 in Figure 10 as an example, through step 1230, an "image" tag can be generated for the pictures in the project, a "title" tag can be generated for the string "M mobile phone A4, 6.5 inches, 256G, black", a "rating" tag can be generated for the icon of 5 solid stars, a "comment" tag can be generated for the string "25900 comments", and a "price" tag can be generated for the string "5500RMB".
在1240处,可以利用所生成的一组标签对所述一组初始元数据进行排序。在一种实现方式中,可以预先训练一个关键词排序模型,其可以用于对作为关键词的一组标签进行排序。例如,该关键词排序模型可以被训练为按照诸如重要性程度、代表性等对多个标签进行排序。作为示例,对于图像标签、标题标签、评级标签、评论标签、价格标签等,通过1240处的排序,可以从高到低将这些标签排序为例如图像标签、标题标签、价格标签、评级标签、评论标签等。相应地,与这些标签分别对应的初始元数据也按照相同的顺序而进行了排序。At 1240, the set of initial metadata can be sorted using the generated set of tags. In an implementation manner, a keyword ranking model can be pre-trained, which can be used to rank a set of tags serving as keywords. For example, the keyword ranking model can be trained to rank multiple tags according to, for example, degree of importance, representativeness, and the like. As an example, for image tags, title tags, rating tags, review tags, price tags, etc., by sorting at 1240, these tags may be sorted from high to low, such as image tags, title tags, price tags, rating tags, review tags, etc. Correspondingly, the initial metadata corresponding to these tags are also sorted in the same order.
在1250处,可以选择一个或多个排序最高的初始元数据作为与项目1202对应的一组代表性元数据。At 1250 , one or more of the highest ranked initial metadata can be selected as a representative set of metadata corresponding to item 1202 .
通过对于原始列表中的每个项目执行过程1200,可以获得与原始列表中的多个项目分别对应的多组代表性元数据。所述多组代表性元数据可以被后续用于生成结构化列表。By performing the process 1200 for each item in the original list, multiple sets of representative metadata respectively corresponding to multiple items in the original list can be obtained. The sets of representative metadata may be subsequently used to generate a structured list.
应当理解,过程1200中的所有步骤都是示例性的,本公开的实施例还将涵盖对过程1200进行的任意方式的修改。It should be understood that all steps in process 1200 are exemplary, and embodiments of the present disclosure will also encompass modifications to process 1200 in any manner.
根据本公开的实施例,在获得了与原始列表中的多个项目分别对应的多组代表性元数据之后,可以将所述多组代表性元数据可视化为结构化列表。每一组代表性元数据可以形成该结构化列表中的一个新项目。应当理解,本公开的实施例并不局限于将多组代表性元数据可视化为结构化列表的任何特定方式。在一种实现方式中,可以预先定义结构化列表的格式或布局,以规定例如该结构化列表中的多个项目的排列方式(例如,横向排列、纵向排列等)、每个项目中的多个元素的排列方式、项目和元素的尺寸等。在一种实现方式中,结构化列表的格式或布局可以类似于原始列表的结构和布局,除了结构化列表可能包括比原始列表更少的项目或元素外。According to an embodiment of the present disclosure, after obtaining multiple sets of representative metadata respectively corresponding to multiple items in the original list, the multiple sets of representative metadata may be visualized as a structured list. Each set of representative metadata can form a new item in the structured list. It should be understood that embodiments of the present disclosure are not limited to any particular manner of visualizing sets of representative metadata as a structured list. In one implementation, the format or layout of the structured list may be pre-defined to specify, for example, the arrangement of multiple items in the structured list (for example, horizontal arrangement, vertical arrangement, etc.), the arrangement of multiple elements in each item, the size of the items and elements, and the like. In one implementation, the format or layout of the structured list may be similar to the structure and layout of the original list, except that the structured list may include fewer items or elements than the original list.
图13示出了根据实施例的示例性搜索结果页面1300。假设用户已经在搜索框1310中输入了查询“M手机”,以表明其想要获得关于M手机的网页搜索结果。搜索结果页面1300中的搜索结果区域1320包括多个网页搜索结果。与在图3的区域330中示出的、针对图2中的网页20的搜索结果不同,图13中的搜索结果区域1320包括针对图2中的网页20所生成的示例性结构化列表1330。结构化列表1330是目标网页20中的原始列表202的简化版本,其可以作为原始列表202的列表摘要。结构化列表1330仍然包含使得用户能够直观地全面地了解原始列表202的主要内容的足够信息。例如,结构化列表1330包括项目1332、项目1334和项目1336,这些项目分别对应于目标网页20中的原始列表中的项目1010、项目1020和项目1030(如图10所示)并且包括原始列表中的相应项目中的主要的代表性的内容。以项目1332为例,其包括在图2的区域22中所呈现的关于手机“M手机A4”的图片、简介以及价格。因此,用户通过查看搜索结果区域1320中的结构化列表1330就可以直观地方便地了解到目标网页20中的主要内容,而并不需要例如点击到达目标网页20的链接以便了解该网页的内容。应当理解,图13中的搜索结果页面1300及其中的结构化列表1330仅仅是示例性的,本公开的实施例并不受到该示例的任何限制。Figure 13 illustrates an exemplary search results page 1300, according to an embodiment. Assume that the user has entered the query "M mobile phone" in the search box 1310 to indicate that he wants to obtain web search results about M mobile phones. Search results area 1320 in search results page 1300 includes a plurality of web search results. Unlike the search results shown in area 330 of FIG. 3 for web page 20 in FIG. 2 , search results area 1320 in FIG. 13 includes an exemplary structured listing 1330 generated for web page 20 in FIG. 2 . The structured listing 1330 is a simplified version of the original listing 202 in the target web page 20 , which can serve as a listing summary of the original listing 202 . The structured list 1330 still contains enough information to enable the user to intuitively and comprehensively understand the main content of the original list 202 . For example, structured list 1330 includes item 1332, item 1334, and item 1336, which respectively correspond to item 1010, item 1020, and item 1030 in the original list in target web page 20 (as shown in FIG. 10 ) and include major representative content in the corresponding items in the original list. Taking item 1332 as an example, it includes the picture, introduction and price of the mobile phone "M mobile phone A4" presented in area 22 of FIG. 2 . Therefore, by viewing the structured list 1330 in the search result area 1320, the user can intuitively and conveniently understand the main content of the target webpage 20 without, for example, clicking a link to the target webpage 20 to understand the content of the webpage. It should be understood that the search result page 1300 and the structured list 1330 in FIG. 13 are only exemplary, and embodiments of the present disclosure are not limited by this example.
图14示出了根据实施例的用于网页中的列表提取和可视化的示例性方法1400的流程图。FIG. 14 shows a flowchart of an exemplary method 1400 for list extraction and visualization in web pages, according to an embodiment.
在1410处,可以检测目标网页中的至少一个锚点元素组,所述至少一个锚点元素组包括第一锚点元素组。At 1410, at least one anchor element group in the target webpage may be detected, the at least one anchor element group including the first anchor element group.
在1420处,可以对所述第一锚点元素组中的多个锚点元素执行边界检测,以获得与所述多个锚点元素分别关联的多个项目的边界,所述多个项目对应于所述目标网页中的第一原始列表。At 1420, boundary detection may be performed on a plurality of anchor elements in the first anchor element group to obtain boundaries of a plurality of items respectively associated with the plurality of anchor elements, the plurality of items corresponding to the first original list in the target webpage.
在1430处,可以利用所述多个项目的边界,从所述目标网页中获得与所述多个项目分别对应的多组代表性元数据。At 1430, multiple sets of representative metadata respectively corresponding to the multiple items may be obtained from the target webpage by using boundaries of the multiple items.
在1440处,可以将所述多组代表性元数据可视化为结构化列表。At 1440, the sets of representative metadata can be visualized as a structured list.
在一种实现方式中,所述检测至少一个锚点元素组可以包括:将所述目标网页中满足锚点元素约束的多个html元素识别为多个识别锚点元素;从所述目标网页中提取所述多个识别锚点元素中每个识别锚点元素的属性集合;以及基于所述多个识别锚点元素的多个属性集合,将所述多个识别锚点元素聚类成所述至少一个锚点元素组。In an implementation manner, the detecting at least one anchor element group may include: identifying a plurality of html elements satisfying anchor element constraints in the target webpage as a plurality of identified anchor elements; extracting an attribute set of each identified anchor element in the plurality of identified anchor elements from the target web page; and clustering the plurality of identified anchor elements into the at least one anchor element group based on the plurality of attribute sets of the identified anchor elements.
所述锚点元素约束可以包括以下至少之一:html元素具有图像标签;html元素具有标题标签;以及html元素表示日期。每个识别锚点元素的属性集合可以包括所述识别锚点元素的html标签属性、CSS类别以及XPath信息中至少之一。The anchor element constraint may include at least one of: the html element has an image tag; the html element has a title tag; and the html element represents a date. Each attribute set identifying an anchor element may include at least one of the html tag attribute, CSS category, and XPath information identifying the anchor element.
在一种实现方式中,所述边界检测可以包括:基于与所述目标网页对应的DOM树,分别以所述多个锚点元素为起点同步地执行迭代边界扩展,以获得分别源自于所述多个锚点元素的多棵项目树,其中,每棵项目树代表一个项目并且包括多个节点,并且每个节点对应于经由所述迭代边界扩展而确定的一个元素。In an implementation manner, the boundary detection may include: based on the DOM tree corresponding to the target web page, synchronously performing iterative boundary expansion starting from the multiple anchor elements respectively to obtain multiple item trees respectively originating from the multiple anchor elements, wherein each item tree represents an item and includes multiple nodes, and each node corresponds to an element determined through the iterative boundary expansion.
所述迭代边界扩展可以包括:对于每棵项目树,在每一步骤的迭代中,扩展到下一节点并且将所述下一节点包括到所述项目树中。The iterative boundary extension may include: for each item tree, in each iteration of the step, expanding to a next node and including the next node into the item tree.
所述迭代边界扩展可以包括以下至少之一:执行兄弟节点扩展,以从当前节点向当前节点的兄弟节点进行扩展;以及执行父节点扩展,以在当前节点的所有兄弟节点都已被包括到所述项目树中之后,向所述当前节点的父节点进行扩展。The iterative boundary expansion may include at least one of: performing sibling expansion to expand from the current node to sibling nodes of the current node; and performing parent node expansion to expand to the parent node of the current node after all sibling nodes of the current node have been included in the item tree.
所述边界检测可以包括:确定当前步骤的迭代是否导致在所述项目树与所述多棵项目树中至少一棵其它项目树之间出现节点重叠;以及响应于确定出现所述节点重叠,停止执行所述迭代边界扩展,并且从所述多棵项目树中分别排除通过所述当前步骤的迭代而确定的节点。The boundary detection may include: determining whether iteration of the current step results in node overlap between the item tree and at least one other item tree of the plurality of item trees; and in response to determining that the node overlap occurs, stopping execution of the iteration boundary expansion and excluding nodes determined by the iteration of the current step from the plurality of item trees, respectively.
所述方法还可以包括:如果当前步骤的迭代是兄弟节点扩展,则在与所述当前步骤的迭代的方向相反的方向上对所述多棵项目树执行进一步的迭代边界扩展。The method may further comprise: if the iteration of the current step is a sibling node expansion, performing further iteration boundary expansion on the plurality of item trees in a direction opposite to that of the iteration of the current step.
在一种实现方式中,所述边界检测可以包括:对所述多棵项目树执行相似性检验。In an implementation manner, the boundary detection may include: performing a similarity check on the multiple item trees.
所述相似性检验可以是响应于确定所述多棵项目树中的至少一棵项目树中的节点的数量超出节点数量阈值而执行的。The similarity check may be performed in response to determining that the number of nodes in at least one of the plurality of item trees exceeds a node number threshold.
所述相似性检验可以包括:计算在所述多棵项目树中的任意两棵项目树之间的树相似性;至少利用相似性阈值将所述多棵项目树划分成至少一个树集合,所述至少一个树集合中的每个树集合中的项目树在彼此之间具有不低于所述相似性阈值的树相似性;确定所述至少一个树集合中包含最多数量项目树的树集合中的项目树的数量是否低于树数量阈值;以及响应于确定所述项目树的数量低于所述树数量阈值,停止执行所述迭代边界扩展,并且从所述多棵项目树中分别排除通过预定数量先前步骤的迭代而确定的节点。The similarity checking may comprise: calculating tree similarity between any two item trees of the plurality of item trees; dividing the plurality of item trees into at least one set of trees using at least a similarity threshold, the item trees in each set of trees of the at least one set of trees having a tree similarity between each other not lower than the similarity threshold; determining whether the number of item trees in the set of trees of the at least one set of trees containing the greatest number of item trees is below a tree number threshold; and in response to determining that the number of item trees is below the number of tree threshold, stopping The iterative boundary expansion is performed, and nodes determined through iterations of a predetermined number of previous steps are respectively excluded from the plurality of item trees.
所述计算树相似性可以包括以下至少之一:至少利用基于所述两棵项目树的根节点之间的CSS相似性所计算的匹配权重,来计算所述树相似性;以及利用所述两棵项目树中的在最小深度层级以内的节点来计算所述树相似性,所述最小深度层级被定义为使得在一棵项目树的最小深度层级以内的可见节点的数量达到这棵项目树中所有可见节点的数量的预定比例。The calculating the tree similarity may include at least one of the following: at least calculating the tree similarity using the matching weight calculated based on the CSS similarity between the root nodes of the two item trees; and calculating the tree similarity using nodes within a minimum depth level in the two item trees, the minimum depth level being defined so that the number of visible nodes within the minimum depth level of an item tree reaches a predetermined ratio of the number of all visible nodes in the item tree.
所述方法还可以包括:将所述多棵项目树重置到在预定先前步骤的迭代处的状态;以及如果在所述预定先前步骤的迭代之后的下一步骤的迭代是兄弟节点扩展,则在与所述下一步骤的迭代的方向相反的方向上对所述多棵项目树执行进一步的迭代边界扩展。The method may further include: resetting the plurality of item trees to a state at an iteration of a predetermined previous step; and if an iteration of a next step after the iteration of the predetermined previous step is sibling node expansion, performing a further iteration boundary expansion on the plurality of item trees in a direction opposite to that of the iteration of the next step.
在一种实现方式中,所述获得多组代表性元数据可以包括,对于所述多个项目中的每个项目:识别在由所述项目的边界所标识的项目树中的一组叶子节点;提取与所述一组叶子节点对应的一组初始元数据;确定与所述一组初始元数据对应的一组标签;利用所述一组标签对所述一组初始元数据进行排序;以及选择一个或多个排序最高的初始元数据作为与所述项目对应的一组代表性元数据。In one implementation, the obtaining multiple sets of representative metadata may include, for each of the plurality of items: identifying a set of leaf nodes in the item tree identified by the boundary of the item; extracting a set of initial metadata corresponding to the set of leaf nodes; determining a set of labels corresponding to the set of initial metadata; sorting the set of initial metadata using the set of labels;
所述确定一组标签可以包括:利用所述一组初始元数据形成词条序列,所述词条序列中的每个词条对应于所述一组初始元数据中的一个初始元数据;计算所述词条序列中的每个词条的特征集合;以及通过预先训练的标记器模型,基于所述词条序列中的多个词条的多个特征集合来生成每个词条的标签。The determining a set of tags may include: using the set of initial metadata to form a sequence of entries, each entry in the sequence of entries corresponds to an initial metadata in the set of initial metadata; calculating a feature set of each entry in the sequence of entries; and generating a label for each entry based on a plurality of feature sets of multiple entries in the sequence of entries through a pre-trained tagger model.
在一种实现方式中,所述至少一个锚点元素组可以包括第二锚点元素组。所述方法还可以包括:对所述第二锚点元素组中的多个锚点元素执行边界检测,以获得与所述多个锚点元素分别关联的多个项目的边界,所述多个项目对应于所述目标网页中的第二原始列表。所述方法还可以包括,在所述获得多组代表性元数据之前,或者在将所述多组代表性元数据可视化为结构化列表之前:利用所述第一原始列表中的项目的边界和所述第二原始列表中的项目的边界,分别确定所述第一原始列表的视觉特征和所述第二原始列表的视觉特征;以及基于所述第一原始列表的视觉特征和所述第二原始列表的视觉特征,确定所述第一原始列表是所述第一原始列表和所述第二原始列表中的主列表。In an implementation manner, the at least one anchor element group may include a second anchor element group. The method may further include: performing boundary detection on a plurality of anchor elements in the second anchor element group to obtain boundaries of a plurality of items respectively associated with the plurality of anchor elements, the plurality of items corresponding to the second original list in the target web page. The method may further include, before said obtaining the plurality of sets of representative metadata, or before visualizing the plurality of sets of representative metadata into a structured list: using boundaries of items in the first original list and boundaries of items in the second original list, respectively determining visual features of the first original list and visual features of the second original list;
所述视觉特征可以包括以下至少之一:相邻项目之间的最小边界距离;列表位置;以及项目内容丰富度。The visual features may include at least one of: minimum border distance between adjacent items; list position; and item content richness.
在一种实现方式中,所述结构化列表可以被呈现在由搜索服务所提供的搜索结果页面中。In one implementation, the structured list may be presented in a search results page provided by a search service.
应当理解,方法1400还可以包括根据上述本公开实施例的用于网页中的列表提取和可视化的任何步骤/过程。It should be understood that the method 1400 may also include any steps/processes for list extraction and visualization in web pages according to the above-mentioned embodiments of the present disclosure.
图15示出了根据实施例的用于网页中的列表提取和可视化的示例性装置1500。Fig. 15 shows an exemplary apparatus 1500 for list extraction and visualization in web pages according to an embodiment.
装置1500可以包括:锚点元素组检测模块1510,用于检测目标网页中的至少一个锚点元素组,所述至少一个锚点元素组包括第一锚点元素组;边界检测模块1520,用于对所述第一锚点元素组中的多个锚点元素执行边界检测,以获得与所述多个锚点元素分别关联的多个项目的边界,所述多个项目对应于所述目标网页中的第一原始列表;代表性元数据获得模块1530,用于利用所述多个项目的边界,从所述目标网页中获得与所述多个项目分别对应的多组代表性元数据;以及代表性元数据可视化模块1540,用于将所述多组代表性元数据可视化为结构化列表。The apparatus 1500 may include: an anchor element group detection module 1510, configured to detect at least one anchor element group in the target webpage, the at least one anchor element group comprising a first anchor element group; a boundary detection module 1520, configured to perform boundary detection on a plurality of anchor elements in the first anchor element group, to obtain boundaries of a plurality of items respectively associated with the plurality of anchor elements, the plurality of items corresponding to the first original list in the target webpage; a representative metadata obtaining module 1530, configured to use the boundaries of the plurality of items, from the Multiple sets of representative metadata respectively corresponding to the multiple items are obtained from the target webpage; and a representative metadata visualization module 1540, configured to visualize the multiple sets of representative metadata as a structured list.
此外,装置1500还可以包括被配置为执行根据上述本公开实施例的用于网页中的列表提取和可视化的方法的任何操作的任何其它模块。In addition, the apparatus 1500 may also include any other modules configured to perform any operations of the method for extracting and visualizing a list in a web page according to the above-mentioned embodiments of the present disclosure.
图16示出了根据实施例的用于网页中的列表提取和可视化的示例性装置1600。Fig. 16 shows an exemplary apparatus 1600 for list extraction and visualization in web pages according to an embodiment.
装置1600可以包括至少一个处理器1610。装置1600还可以包括与至少一个处理器1610连接的存储器1620。存储器1620可以存储计算机可执行指令,当所述计算机可执行指令被执行时,使得至少一个处理器1610:检测目标网页中的至少一个锚点元素组,所述至少一个锚点元素组包括第一锚点元素组;对所述第一锚点元素组中的多个锚点元素执行边界检测,以获得与所述多个锚点元素分别关联的多个项目的边界,所述多个项目对应于所述目标网页中的第一原始列表;利用所述多个项目的边界,从所述目标网页中获得与所述多个项目分别对应的多组代表性元数据;以及将所述多组代表性元数据可视化为结构化列表。此外,至少一个处理器1610还可以被配置为执行根据上述本公开实施例的用于网页中的列表提取和可视化的方法的任何其它操作。Apparatus 1600 may include at least one processor 1610 . The apparatus 1600 may further include a memory 1620 connected to at least one processor 1610 . The memory 1620 may store computer-executable instructions. When the computer-executable instructions are executed, at least one processor 1610: detects at least one anchor element group in the target webpage, and the at least one anchor element group includes a first anchor element group; performs boundary detection on a plurality of anchor elements in the first anchor element group, so as to obtain boundaries of a plurality of items respectively associated with the plurality of anchor elements, and the plurality of items correspond to the first original list in the target webpage; corresponding sets of representative metadata; and visualizing the sets of representative metadata as a structured list. In addition, at least one processor 1610 may also be configured to perform any other operations of the method for list extraction and visualization in a web page according to the above-mentioned embodiments of the present disclosure.
本公开的实施例提出了用于网页中的列表提取和可视化的计算机程序产品。所述计算机程序产品包括计算机程序,所述计算机程序被至少一个处理器运行用于:检测目标网页中的至少一个锚点元素组,所述至少一个锚点元素组包括第一锚点元素组;对所述第一锚点元素组中的多个锚点元素执行边界检测,以获得与所述多个锚点元素分别关联的多个项目的边界,所述多个项目对应于所述目标网页中的第一原始列表;利用所述多个项目的边界,从所述目标网页中获得与所述多个项目分别对应的多组代表性元数据;以及将所述多组代表性元数据可视化为结构化列表。此外,所述计算机程序还可以被所述至少一个处理器运行用于执行根据上述本公开实施例的用于网页中的列表提取和可视化的方法的任何其它操作。Embodiments of the present disclosure propose computer program products for list extraction and visualization in web pages. The computer program product includes a computer program, the computer program being run by at least one processor to: detect at least one anchor element group in a target web page, the at least one anchor element group including a first anchor element group; perform boundary detection on a plurality of anchor elements in the first anchor element group to obtain boundaries of a plurality of items respectively associated with the plurality of anchor elements, the plurality of items corresponding to a first original list in the target web page; use the boundaries of the plurality of items to obtain from the target web page multiple sets of representative metadata corresponding to the plurality of items; Representative metadata is visualized as a structured list. In addition, the computer program may also be executed by the at least one processor to perform any other operations of the method for extracting and visualizing a list in a webpage according to the above-mentioned embodiments of the present disclosure.
本公开的实施例可以实施在非暂时性计算机可读介质中。该非暂时性计算机可读介质可以包括指令,当所述指令被执行时,使得一个或多个处理器执行根据上述本公开实施例的用于网页中的列表提取和可视化的方法的任何步骤/过程。Embodiments of the present disclosure can be embodied on a non-transitory computer readable medium. The non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to execute any steps/processes of the method for list extraction and visualization in webpages according to the above-mentioned embodiments of the present disclosure.
应当理解,以上描述的方法中的所有操作都仅仅是示例性的,本公开并不限制于方法中的任何操作或这些操作的顺序,而是应当涵盖在相同或相似构思下的所有其它等同变换。It should be understood that all operations in the methods described above are exemplary only, and the present disclosure is not limited to any operations in the methods or the order of these operations, but should cover all other equivalent transformations under the same or similar concept.
另外,除非另有规定或者从上下文能清楚得知针对单数形式,否则如本说明书和所附权利要求书中所使用的冠词“一(a)”和“一个(an)”通常应当被解释为意指“一个”或者“一个或多个”。In addition, the articles "a" and "an" as used in this specification and the appended claims should generally be construed to mean "one" or "one or more" unless otherwise specified or clear from the context to refer to a singular form.
还应当理解,以上描述的装置中的所有模块都可以通过各种方式来实施。这些模块可以被实施为硬件、软件、或其组合。此外,这些模块中的任何模块可以在功能上被进一步划分成子模块或组合在一起。It should also be understood that all modules in the apparatus described above may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. Furthermore, any of these modules may be functionally further divided into sub-modules or grouped together.
已经结合各种装置和方法描述了处理器。这些处理器可以使用电子硬件、计算机软件或其任意组合来实施。这些处理器是实施为硬件还是软件将取决于具体的应用以及施加在系统上的总体设计约束。作为示例,本公开中给出的处理器、处理器的任意部分、或者处理器的任意组合可以实施为微处理器、微控制器、数字信号处理器(DSP)、现场可编程门阵列(FPGA)、可编程逻辑器件(PLD)、状态机、门逻辑、分立硬件电路、以及配置用于执行在本公开中描述的各种功能的其它适合的处理部件。本公开给出的处理器、处理器的任意部分、或者处理器的任意组合的功能可以实施为由微处理器、微控制器、DSP或其它适合的平台所执行的软件。Processors have been described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. As examples, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, digital signal processor (DSP), field programmable gate array (FPGA), programmable logic device (PLD), state machine, gate logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors given in this disclosure may be implemented as software executed by a microprocessor, microcontroller, DSP, or other suitable platform.
软件应当被广泛地视为表示指令、指令集、代码、代码段、程序代码、程序、子程序、软件模块、应用、软件应用、软件包、例程、子例程、对象、运行线程、过程、函数等。软件可以驻留在计算机可读介质中。计算机可读介质可以包括例如存储器,存储器可以例如为磁性存储设备(如,硬盘、软盘、磁条)、光盘、智能卡、闪存设备、随机存取存储器(RAM)、只读存储器(ROM)、可编程ROM(PROM)、可擦除PROM(EPROM)、电可擦除PROM(EEPROM)、寄存器或者可移动盘。尽管在本公开给出的多个方面中将存储器示出为是与处理器分离的,但是存储器也可以位于处理器内部(如,缓存或寄存器)。Software shall be taken broadly to mean instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer readable medium. The computer readable medium can include, for example, memory, which can be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), registers, or a removable disk. Although memory is shown as being separate from the processor in various aspects of the present disclosure, memory may also be located internal to the processor (eg, cache or registers).
以上描述被提供用于使得本领域任何技术人员可以实施本文所描述的各个方面。这些方面的各种修改对于本领域技术人员是显而易见的,本文限定的一般性原理可以应用于其它方面。因此,权利要求并非旨在被局限于本文示出的方面。关于本领域技术人员已知或即将获知的、对本公开所描述各个方面的元素的所有结构和功能上的等同变换,都将由权利要求所覆盖。The above description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Accordingly, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the described aspects of this disclosure that are known or come to be known to those skilled in the art are intended to be covered by the claims.
Claims (20)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210040984.0A CN116484126A (en) | 2022-01-14 | 2022-01-14 | List extraction and visualization in web pages |
| PCT/US2022/048129 WO2023136875A1 (en) | 2022-01-14 | 2022-10-28 | List extraction and visualization in web pages |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210040984.0A CN116484126A (en) | 2022-01-14 | 2022-01-14 | List extraction and visualization in web pages |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116484126A true CN116484126A (en) | 2023-07-25 |
Family
ID=84361964
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210040984.0A Pending CN116484126A (en) | 2022-01-14 | 2022-01-14 | List extraction and visualization in web pages |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN116484126A (en) |
| WO (1) | WO2023136875A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116975167A (en) * | 2023-09-20 | 2023-10-31 | 联通在线信息科技有限公司 | Metadata grading method and system based on weighted Jaccard coefficient |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103678510A (en) * | 2013-11-25 | 2014-03-26 | 北京奇虎科技有限公司 | Method and device for providing visualized label for webpage |
| CN104699841A (en) * | 2015-03-31 | 2015-06-10 | 北京奇虎科技有限公司 | Method and device for providing list summary information of search results |
| CN107918615A (en) * | 2016-10-09 | 2018-04-17 | 北京优朋普乐科技有限公司 | The search method and device of retrieval result are presented with tree-shaped drop-down list box |
| CN109086361A (en) * | 2018-07-20 | 2018-12-25 | 北京开普云信息科技有限公司 | A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint |
-
2022
- 2022-01-14 CN CN202210040984.0A patent/CN116484126A/en active Pending
- 2022-10-28 WO PCT/US2022/048129 patent/WO2023136875A1/en not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103678510A (en) * | 2013-11-25 | 2014-03-26 | 北京奇虎科技有限公司 | Method and device for providing visualized label for webpage |
| CN104699841A (en) * | 2015-03-31 | 2015-06-10 | 北京奇虎科技有限公司 | Method and device for providing list summary information of search results |
| CN107918615A (en) * | 2016-10-09 | 2018-04-17 | 北京优朋普乐科技有限公司 | The search method and device of retrieval result are presented with tree-shaped drop-down list box |
| CN109086361A (en) * | 2018-07-20 | 2018-12-25 | 北京开普云信息科技有限公司 | A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116975167A (en) * | 2023-09-20 | 2023-10-31 | 联通在线信息科技有限公司 | Metadata grading method and system based on weighted Jaccard coefficient |
| CN116975167B (en) * | 2023-09-20 | 2024-02-27 | 联通在线信息科技有限公司 | Metadata grading method and system based on weighted Jaccard coefficient |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2023136875A1 (en) | 2023-07-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9514216B2 (en) | Automatic classification of segmented portions of web pages | |
| Liu et al. | Vide: A vision-based approach for deep web data extraction | |
| US8046681B2 (en) | Techniques for inducing high quality structural templates for electronic documents | |
| US8843490B2 (en) | Method and system for automatically extracting data from web sites | |
| JP6116247B2 (en) | System and method for searching for documents with block division, identification, indexing of visual elements | |
| Zhao et al. | Automatic extraction of dynamic record sections from search engine result pages | |
| US9594730B2 (en) | Annotating HTML segments with functional labels | |
| US20090248707A1 (en) | Site-specific information-type detection methods and systems | |
| US9582494B2 (en) | Object extraction from presentation-oriented documents using a semantic and spatial approach | |
| US20090125529A1 (en) | Extracting information based on document structure and characteristics of attributes | |
| US20070098266A1 (en) | Cascading cluster collages: visualization of image search results on small displays | |
| US20090049062A1 (en) | Method for Organizing Structurally Similar Web Pages from a Web Site | |
| US20150067476A1 (en) | Title and body extraction from web page | |
| US20150287047A1 (en) | Extracting Information from Chain-Store Websites | |
| Song et al. | A hybrid approach for content extraction with text density and visual importance of DOM nodes | |
| CN112084451B (en) | Webpage LOGO extraction system and method based on visual blocking | |
| US20120124077A1 (en) | Domain Constraint Based Data Record Extraction | |
| CN116484126A (en) | List extraction and visualization in web pages | |
| Cardoso et al. | An efficient language-independent method to extract content from news webpages | |
| Bing et al. | Robust detection of semi-structured web records using a dom structure-knowledge-driven model | |
| Chen et al. | TableGraph: An image segmentation–based table knowledge interpretation model for civil and construction inspection documentation | |
| Zeleny et al. | Cluster-based Page Segmentation-a fast and precise method for web page pre-processing | |
| Zeng et al. | A web page segmentation approach using visual semantics | |
| CN103218130B (en) | A kind of method and apparatus for performing to select operation to object to be selected | |
| Alcic et al. | Measuring performance of web image context extraction |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |