RU2709647C1

RU2709647C1 - Method of associating a domain name with a characteristic of visiting a website

Info

Publication number: RU2709647C1
Application number: RU2018139988A
Authority: RU
Inventors: Дашунь ЧЖАН
Original assignee: Шанхай Яму Коммюникейшн Текнолоджи Ко., Лтд
Priority date: 2016-04-14
Filing date: 2016-08-17
Publication date: 2019-12-19
Also published as: JP2019514137A; RU2709647C9; CN105763633B; CN105763633A; GB2567749A; WO2017177590A1; JP6703621B2

Abstract

FIELD: computer engineering.

SUBSTANCE: invention relates to computer engineering. Method comprises steps of: simulating a website visiting characteristic by a user through a search robot; segmenting the DNS journal to obtain n sets of domain name requests; and performing mapping to set of scanned set of DNS domain name query requests and n sets of domain name requests received by DNS segmentation, and if one of the n domain name query sets obtained by DNS segmentation is equal to or contained in the scanned set of DNS domain name query requests, the DNS log indicates that the user has moved over the domain name of the single resource identifier (URL) requested by the search robot during scanning.

EFFECT: easier analysis of viewing characteristics in the Internet.

6 cl, 2 dwg

Description

ОБЛАСТЬ ТЕХНИКИ, К КОТОРОЙ ОТНОСИТСЯ ИЗОБРЕТЕНИЕFIELD OF THE INVENTION

Раскрытие относится к области разрешения доменного имени DNS сети Интернет и технологии поискового веб-робота и, в частности, к способу ассоциирования доменного имени с характеристикой посещения веб-сайта.The disclosure relates to the domain of resolving a DNS domain name of the Internet and search web robot technology, and in particular, to a method for associating a domain name with a characteristic of visiting a website.

УРОВЕНЬ ТЕХНИКИBACKGROUND

DNS (система доменных имен) представляет собой распределенную базу данных, которая обеспечивает сопоставление между доменным именем и IP-адресом в сети Интернет.DNS может предоставить пользователю возможность осуществить доступ к сети Интернет более удобным образом без запоминания IP-строк чисел, которые могут быть непосредственно считаны машиной. «Технология разрешения имени DNS» означает, что при посещении веб-сайта пользователь сначала вводит в браузере его доменное имя и нажимает клавишу ввода. Затем браузер инициирует запрос DNS. С помощью технологии DNS браузер может получить IP-адрес сервера, соответствующий доменному имени, и инициировать HTTP-запрос для этого IP-адреса.DNS (Domain Name System) is a distributed database that provides a mapping between a domain name and an IP address on the Internet. DNS can provide the user with the ability to access the Internet in a more convenient way without storing IP strings of numbers that can be directly read by machine. “DNS name resolution technology” means that when visiting a website, the user first enters his domain name in the browser and presses the enter key. Then the browser initiates a DNS query. Using DNS technology, the browser can obtain the server IP address corresponding to the domain name and initiate an HTTP request for this IP address.

Технология поискового веб-робота представляет собой программу или сценарий, который автоматически сканирует веб-информацию согласно определенным правилам. Технология поискового веб-робота имитирует пользователя, инициирующего HTTP-запрос для веб-сайта, и записывает DNS-запрос, сформированный во время этого процесса.Web crawler technology is a program or script that automatically crawls web information according to certain rules. Web crawler technology mimics the user initiating an HTTP request for a website and records the DNS request generated during this process.

Значение данных DNS всегда оставлялось без внимания и рассматривалось только как соответствующее отношение между IP и доменным именем, таким образом в настоящее время никто на рынке не стал бы осуществлять ассоциирование с помощью данных DNS.The value of the DNS data was always ignored and was considered only as the corresponding relationship between the IP and the domain name, so currently no one on the market would associate using the DNS data.

РАСКРЫТИЕ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

Раскрытие обеспечивает способ ассоциирования доменного имени с характеристикой посещения веб-сайта. Посредством комбинации сбора журналов DNS и технологии поискового веб-робота анализ характеристики просмотра сети Интернет пользователем также может быть осуществлен и с помощью журнала DNS.Disclosure provides a way of associating a domain name with a characteristic of a website visit. By combining the collection of DNS logs and Web crawler technology, an analysis of Internet browsing behavior by a user can also be carried out using the DNS log.

В этом раскрытии способ ассоциирования доменного имени с характеристикой посещения веб-сайта включает в себя следующие этапы: этап S1, на котором имитируют характеристику посещения веб-сайта пользователем посредством программы поискового робота так, чтобы получить все запросы доменных имен DNS в текущем HTTP-запросе, т.е. просканированный набор запросов доменных имен DNS; этап S2, на котором сегментируют журнал DNS для получения n наборов запросов доменных имен, где n - целое число, большее или равное 1; и этап S3, на котором выполняют сопоставление набор к набору просканированного набора запросов доменных имен DNS на этапе S1 и наборов запросов доменных имен, полученных посредством сегментации журнала DNS на этапе S2, и, если один из наборов запросов доменных имен, полученных посредством сегментации журнала DNS, равен или содержится в просканированном наборе запросов доменных имен DNS, учитывают, что журнал DNS указывает то, что пользователь перешел (click) по доменному имени URL, запрошенного программой поискового робота во время сканирования.In this disclosure, a method of associating a domain name with a website visiting characteristic includes the following steps: step S1, in which the user visiting website is mimicked by a search robot program so as to obtain all DNS domain name queries in the current HTTP request, those. Scanned set of DNS domain name queries step S2, in which the DNS log is segmented to obtain n sets of domain name queries, where n is an integer greater than or equal to 1; and step S3, in which a set is matched against a set of scanned set of DNS domain name queries in step S1 and domain name query sets obtained by segmenting the DNS log in step S2, and if one of the sets of domain name queries obtained by segmenting the DNS log is equal to or contained in the scanned set of DNS domain name queries, take into account that the DNS log indicates that the user clicked on the domain name URL requested by the search robot during the scan.

Предпочтительно, на этапе S2, журнал DNS представляет собой журнал DNS, регистрирующий в день характеристики посещения.Preferably, in step S2, the DNS log is a DNS log that records the visit characteristics on the day.

Предпочтительно, на этапе S2, сегментация журнала DNS включает в себя двукратную сегментацию, т.е. сначала сегментацию, основанную на IP-адресе источника, а затем другую сегментацию, основанную на разнице между метками времени.Preferably, in step S2, the DNS log segmentation includes a double segmentation, i.e. first segmentation based on the source IP address, and then another segmentation based on the difference between timestamps.

Предпочтительно, сегментация журнала DNS, основанная на IP-адресе источника, заключается в том, чтобы получать последовательные журналы DNS с одинаковым IP-адресом источника в течение периода времени.Preferably, segmentation of the DNS log based on the source IP address is to obtain consecutive DNS logs with the same source IP address over a period of time.

Предпочтительно, сегментация, основанная разнице между метками времени, заключается в том, чтобы на основе разницы между метками времени в журналах DNS сегментировать журнал после того, как он был сегментирован на основе IP-адреса источника, и, если разница между метками времени в двух журналах DNS больше, чем определенный временной промежуток, два журнала DNS разделяются.Preferably, segmentation based on the difference between timestamps is based on the difference between the timestamps in the DNS logs to segment the log after it has been segmented based on the source IP address, and if there is a difference between the timestamps in the two logs DNS is longer than a certain period of time, the two DNS logs are separated.

Предпочтительно, определенный временной промежутокPreferably, a certain time period

представляет собой три секунды.represents three seconds.

Посредством способа ассоциирования доменного имени с характеристикой посещения веб-сайта согласно раскрытию анализ характеристики просмотра в сети Интернет пользователем также может быть осуществлен посредством журнала DNS.By a method of associating a domain name with a characteristic of visiting a website according to the disclosure, an analysis of a characteristic of browsing on the Internet by a user can also be carried out through a DNS log.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

Фиг. 1 представляет собой схематическое представление набора запросов доменных имен DNS, просканированного программой поискового робота.FIG. 1 is a schematic representation of a set of DNS domain name queries scanned by a search robot program.

Фиг. 2 представляет собой блок-схему последовательности операций способа ассоциирования доменного имени с характеристикой посещения веб-сайта согласно раскрытию.FIG. 2 is a flowchart of a method for associating a domain name with a website visiting characteristic according to the disclosure.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯDETAILED DESCRIPTION OF THE INVENTION

Далее раскрытие будет описано подробно со ссылкой на прилагаемые чертежи и варианты осуществления. Нижеследующие варианты осуществления не предназначены для ограничения изобретения. Изменения и преимущества, которые могут быть поняты специалистами в данной области техники, включены в настоящее раскрытие без отступления от сущности и объема раскрытия.The disclosure will now be described in detail with reference to the accompanying drawings and embodiments. The following embodiments are not intended to limit the invention. Changes and advantages that may be understood by those skilled in the art are included in the present disclosure without departing from the spirit and scope of the disclosure.

Как указано выше, DNS (система доменных имен) представляет собой распределенную базу данных, которая обеспечивает сопоставление между доменным именем и IP-адресом в сети Интернет.DNS может предоставить пользователю возможность осуществить доступ к сети Интернет более удобным образом без запоминания IP-строк чисел, которые могут быть непосредственно считаны машиной. При посещении веб-сайта пользователь сначала вводит в браузере его доменное имя и нажимает клавишу ввода. Затем браузер инициирует запрос DNS. С помощью технологии DNS браузер может получить IP-адрес сервера, соответствующий доменному имени, и инициировать HTTP-запрос для этого IP-адреса. Это является технологией разрешения имени DNS.As indicated above, DNS (Domain Name System) is a distributed database that provides a mapping between a domain name and an IP address on the Internet.DNS can provide the user with the ability to access the Internet in a more convenient way without storing IP numbers strings, which can be directly read by the machine. When visiting a website, the user first enters his domain name in the browser and presses the enter key. Then the browser initiates a DNS query. Using DNS technology, the browser can obtain the server IP address corresponding to the domain name and initiate an HTTP request for this IP address. This is a DNS name resolution technology.

Журнал DNS может быть сформирован во время вышеупомянутого процесса разрешения доменного имени. В журнал DNS может записываться содержание ответов каждого запроса DNS и может почти записываться информация о доменных именах всех запросов пользователей. Формат журнала DNS описывается следующим образом:A DNS log may be generated during the above domain name resolution process. The contents of the responses of each DNS query can be written to the DNS log, and information about the domain names of all user queries can almost be written. The format of the DNS log is described as follows:

Таким образом, журналы DNS состоят из «IP-адреса источника», «Доменного имени», «Метки времени», «IP разрешения» и «Кода состояния». Способ ассоциирования доменного имени с характеристикой посещения веб-сайта согласно раскрытию будет далее описан подробно со ссылкой на Фиг. 1.Thus, DNS logs consist of “Source IP Address”, “Domain Name”, “Timestamp”, “IP Resolution” and “Status Code”. A method for associating a domain name with a website visiting characteristic according to the disclosure will now be described in detail with reference to FIG. 1.

Во-первых, характеристика посещения веб-сайта пользователем имитируется посредством программы поискового робота таким образом, чтобы получить все запросы доменных имен DNS в текущем HTTP-запросе, т.е. просканированный набор запросов доменных имен DNS (этап S1). Например, когда страница открывается или происходит переход по URL (ссылке), программа поискового робота может просканировать все запросы доменных имен DNS в текущем HTTP-запросе. Так как пользователь также может запросить другие доменные имена в дополнение к доменному имени текущего URL при переходе по URL, все запросы доменных имен DNS, сформированные после перехода по URL, могут быть получены с помощью технологии поискового робота. При этом единый указатель ресурса (URL) представляет собой компактное представление о расположении и способе доступа к ресурсам, которые доступны из Интернета, и является адресом стандартных ресурсов в сети Интернет. Каждый файл в сети Интернет имеет уникальный URL, который содержит информацию, указывающую расположение файла и то, как браузер должен его обрабатывать.Firstly, the characteristic of a user visiting a website is simulated by a search robot program in such a way as to obtain all DNS domain name queries in the current HTTP request, i.e. a scanned set of DNS domain name queries (step S1). For example, when a page opens or a URL (link) is clicked, the search robot program can scan all DNS domain name queries in the current HTTP request. Since the user can also request other domain names in addition to the domain name of the current URL when navigating the URL, all DNS domain name queries generated after navigating the URL can be obtained using search engine technology. Moreover, a single resource indicator (URL) is a compact representation of the location and method of access to resources that are accessible from the Internet, and is the address of standard resources on the Internet. Each file on the Internet has a unique URL that contains information indicating the location of the file and how the browser should process it.

Например, пользователь перешел по определенному URL (ссылке), как показано ниже:For example, a user clicked on a specific URL (link), as shown below:

Программа поискового робота может просканировать все запросы доменных имен DNS, сформированные после перехода по URL, т.е. набор запросов доменных имен DNS, как подробно показано на Фиг. 1.The search robot program can crawl all DNS domain name queries generated after clicking on the URL, i.e. a set of DNS domain name queries, as shown in detail in FIG. 1.

Затем журнал DNS сегментируется для получения n наборов запросов доменных имен, где n - целое число, большее или равное 1 (этап S2). При этом журнал DNS является журналом DNS, регистрирующим в день характеристики посещения. Сегментация включает в себя двукратную сегментацию, т.е. сначала сегментацию, основанную на IP-адресе источника, а затем другую сегментацию, основанную на разнице между метками времени.Then, the DNS log is segmented to obtain n sets of domain name queries, where n is an integer greater than or equal to 1 (step S2). In this case, the DNS log is a DNS log that records the characteristics of the visit on the day. Segmentation involves twofold segmentation, i.e. first segmentation based on the source IP address, and then another segmentation based on the difference between timestamps.

1) Сегментация журнала DNS основана на IP-адресе источника, т.е. последовательные журналы DNS могут быть разделены, если IP-адрес источника журнала отличается. Сегментация, основанная на IP-адресе источника, заключается в получении последовательных журналов DNS с одинаковым IP-адресом источника в течение периода времени. Как показано ниже:1) DNS log segmentation is based on the source IP address, i.e. consecutive DNS logs can be split if the IP address of the log source is different. Segmentation based on the source IP address is to obtain consecutive DNS logs with the same source IP address over a period of time. As below:

2) Сегментация, основанная на разнице между метками времени, означает, что после того, как журналы сегментируются на основе IP-адреса источника, они сегментируются на основе разницы между метками времени в журналах DNS. Если разница между метками времени в двух последовательных журналах больше, чем определенный временной промежуток, два журнала DNS разделяются (причиной для этого является то, что интервал между журналами настолько большой, что они рассматриваются как две различные характеристики). Определенный временной промежуток может быть настроен по желанию. В этом варианте осуществления определенный временной промежуток равен трем секундам, т.е. журнал может быть разделен, если интервал между метками времени больше, чем три секунды.2) Segmentation based on the difference between timestamps means that after the logs are segmented based on the source IP address, they are segmented based on the difference between the timestamps in the DNS logs. If the difference between the timestamps in two consecutive logs is greater than a certain time interval, the two DNS logs are separated (the reason for this is that the interval between the logs is so large that they are considered as two different characteristics). A specific time period can be adjusted as desired. In this embodiment, the determined time period is three seconds, i.e. the log can be split if the interval between timestamps is more than three seconds.

Например, журнал DNS IP-адреса источника 2.2.2.2 может быть дополнительно сегментирован на основе его разницы между метками времени, как показано ниже. (Метка времени 20141211035932 представляет собой 3 (часа):59(минут):32(секунды), 11 декабрь, 2014).For example, the DNS log of source IP address 2.2.2.2 can be further segmented based on its difference between timestamps, as shown below. (The timestamp 20141211035932 is 3 (hours): 59 (minutes): 32 (seconds), December 11, 2014).

Как описано выше, так как разница между 05 секундами в метке времени 20141211000005 и 09 секундами в метке времени 20141211000009 равна четырем секундам (больше, чем три секунды), то журнал разделяется.As described above, since the difference between 05 seconds in the time stamp 20141211000005 and 09 seconds in the time stamp 20141211000009 is four seconds (more than three seconds), the log is split.

www.baidu.com, а. qq.com, b. baidu. com, ctanx.com, ctanx.com часть области набора запросов доменных имен в журнале DNS.www.baidu.com a. qq.com, b. baidu. com, ctanx.com, ctanx.com are part of the domain name query set scope in the DNS log.

Затем выполняется сопоставление набор к набору набора запросов доменных имен, просканированного поисковым роботом на этапе S1, и наборов запросов доменных имен, полученных посредством сегментации журнала DNS на этапе S2, (этап S3). Правило сопоставления представляет собой [(a,b,c)=(b,c,a)=(а,с,b)].Then, matching is made between the set and the set of the set of domain name queries scanned by the search robot in step S1 and the sets of domain name queries obtained by segmenting the DNS log in step S2 (step S3). The matching rule is [(a, b, c) = (b, c, a) = (a, c, b)].

После сопоставления считается, что журнал DNS указывает, что пользователь перешел по доменному имени (т.е. доменному имени URL, запрошенному поисковым роботом во время сканирования), если часть множества запросов доменных имен в журнале DNS включена в набор запросов доменных имен, просканированный поисковым роботом, или два набора равны друг другу. Например,After matching, it is assumed that the DNS log indicates that the user navigated to the domain name (i.e., the domain name URL requested by the search robot during the scan) if part of the set of domain name queries in the DNS log is included in the set of domain name queries scanned by the search robot, or two sets are equal to each other. For instance,

URL (как характеристика перехода пользователем), просканированный поисковым роботом, представляет собой www.а.com/doc/1234. Набор А всех просканированных запросов доменных имен представляет собой «www.a.com, www.b.com, www.с.com, www.d.com, и www.е.com».The URL (as a characteristic of a user’s transition) crawled by a search robot is www.a.com / doc / 1234. Set A of all scanned domain name requests is “www.a.com, www.b.com, www.c.com, www.d.com, and www.e.com”.

Часть набора В запросов доменных имен после сегментации журнала DNS представляет собой «www.a.com, www.b.com, www.e.com, и www.d.com».Part of the set of B domain name queries after segmentation of the DNS log is “www.a.com, www.b.com, www.e.com, and www.d.com”.

Как указано выше, когда набор В включен в набор А, считается, что набор В запросов доменных имен отражает www.а.com/doc/1234. который является характеристикой посещения пользователем, отображаемой набором А доменных имен. Таким образом, характеристики просмотра сети Интернет пользователями также могут быть проанализированы с помощью журнала DNS.As indicated above, when set B is included in set A, it is believed that set B of domain name queries reflects www.a.com / doc / 1234. which is a characteristic of a user visiting displayed by set A of domain names. Thus, the characteristics of browsing the Internet by users can also be analyzed using the DNS log.

Аспекты, описанные выше, представляют собой только предпочтительные варианты осуществления раскрытия, и они не предназначены для ограничения объема раскрытия. Любые эквивалентные изменения или модификации, сделанные в соответствии с содержанием формулы изобретения раскрытия, должны подпадать в пределы технического объема раскрытия.The aspects described above are only preferred embodiments of the disclosure, and are not intended to limit the scope of the disclosure. Any equivalent changes or modifications made in accordance with the content of the claims of the disclosure should fall within the technical scope of the disclosure.

Claims

1. A computer-implemented method for analyzing a user’s visit to a website, comprising the following steps:

step S1, in which a user’s visit to a website is imitated by a search robot so as to obtain all domain name system (DNS) domain name requests in the current HTTP request, i.e. Scanned set of DNS domain name queries

step S2, in which the DNS log is segmented to obtain n sets of domain name queries, where n is an integer greater than or equal to 1; and

step S3, in which a set is matched against a set of scanned set of DNS domain name queries in step S1 and n are sets of domain name queries obtained by segmenting the DNS log in step S2, and if one of the n sets of domain name queries obtained by log segmentation DNS is equal to or contained in the scanned set of DNS domain name queries, take into account that the DNS log indicates that the user has navigated to the domain name of the single resource locator (URL) requested by the search robot while being scanned i.

2. The method according to claim 1, wherein the DNS log in step S2 is a DNS log on the day of the visit characteristic.

3. The method according to claim 1, wherein the segmentation of the DNS log in step S2 includes a twofold segmentation, i.e. first segmentation based on the source IP address, and then another segmentation based on the difference between timestamps.

4. The method according to claim 3, wherein the segmentation of the DNS log based on the source IP address is to obtain sequential DNS logs with the same source IP addresses for a period of time.

5. The method according to claim 4, wherein the segmentation based on the difference between timestamps consists in segmentation based on the difference between timestamps in the DNS logs of the log, after it has been segmented based on the source IP address, and if the difference is between the labels time in two DNS logs is longer than a certain time period, the two DNS logs are separated.

6. The method according to claim 5, wherein a specific time period is three seconds.