WO2020044469A1 - Dispositif de détection de page web illicite, procédé de commande de dispositif de détection de page web illicite, et programme de commande - Google Patents
Dispositif de détection de page web illicite, procédé de commande de dispositif de détection de page web illicite, et programme de commande Download PDFInfo
- Publication number
- WO2020044469A1 WO2020044469A1 PCT/JP2018/031993 JP2018031993W WO2020044469A1 WO 2020044469 A1 WO2020044469 A1 WO 2020044469A1 JP 2018031993 W JP2018031993 W JP 2018031993W WO 2020044469 A1 WO2020044469 A1 WO 2020044469A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- web page
- unauthorized
- html document
- detection device
- html
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
Definitions
- the present disclosure relates to an unauthorized Web page detection device, a control method of the unauthorized Web page detection device, and a control program.
- Patent Document 1 discloses a communication control device that prohibits access to a URL (Uniform Resource Locator) of a phishing site.
- the communication control device is provided on a communication path between the user's terminal and another device with which the user's terminal communicates, and includes a URL of an access destination content included in communication data transmitted by the terminal, a phishing site list, That is, the URL is compared with the URL included in the blacklist.
- the communication control device prohibits access to the content.
- the purpose of the unauthorized Web page detection device, the control method of the unauthorized Web page detection device, and the control program is to make it possible to determine with high accuracy whether or not the Web page is an unauthorized Web page.
- the unauthorized Web page detection device is configured to detect a feature vector of a plurality of unauthorized HTML (HyperText Markup Language) documents constituting each of the plurality of unauthorized Web pages, based on a related state of a plurality of character strings in each HTML document.
- An acquisition unit that acquires an HTML document to be inspected that constitutes a Web page to be inspected, a vector calculation unit that calculates a feature vector of the HTML document to be inspected, a feature vector of the HTML document to be inspected, A similarity calculation unit that calculates a similarity with each of the feature vectors of the malicious HTML document, and whether the inspection target Web page is a malicious Web page based on the calculated similarities and the threshold value.
- the determination unit includes a determination unit and a determination result output unit that outputs a determination result by the determination unit.
- the storage unit further stores a feature vector of a plurality of regular HTML documents constituting each of the plurality of regular Web pages into a regular URL (Uniform Resource Locator) indicating the regular Web page.
- the acquisition unit further acquires the inspection target URL indicating the inspection target Web page, and the similarity calculation unit determines that the domain name in the inspection target URL is any of the domain names in the plurality of regular URLs. If they do not match, it is preferable to further calculate the similarity between the feature vector of the inspection target HTML and each of the feature vectors of the plurality of normal HTML documents.
- the similarity calculation unit does not calculate the similarity for the unauthorized HTML document when the difference between the size of the unauthorized HTML document and the size of the inspection target HTML document is equal to or larger than a predetermined value. Is preferred.
- the plurality of character strings include an HTML tag and a word.
- the plurality of character strings are preferably continuous character strings.
- the method for controlling an unauthorized Web page detection device having a storage unit and an output unit is characterized in that the unauthorized Web page detection device includes a plurality of unauthorized HTML documents constituting each of the plurality of unauthorized Web pages in each HTML document. Is stored in the storage unit, an HTML document to be inspected that forms the Web page to be inspected is obtained, a feature vector of the HTML document to be inspected is calculated, and the HTML vector of the HTML document to be inspected is stored. A similarity between the feature vector and each of the feature vectors of the plurality of unauthorized HTML documents is calculated, and based on each of the calculated similarities and the threshold value, it is determined whether or not the inspection target Web page is an unauthorized Web page. Determining, and outputting the result of the determination to the output unit.
- the control program of the unauthorized Web page detecting device having the storage unit and the output unit relates to the association of a plurality of unauthorized HTML documents constituting each of the plurality of unauthorized Web pages with a plurality of character strings in each HTML document.
- a feature vector based on the state is stored in the storage unit, an HTML document to be inspected constituting the Web page to be inspected is acquired, a feature vector of the HTML document to be inspected is calculated, and a feature vector of the HTML document to be inspected and plural unauthorized
- the similarity with each of the feature vectors of the HTML document is calculated, and based on each of the calculated similarities and the threshold value, it is determined whether or not the inspection target Web page is an unauthorized Web page.
- the output to the output unit is performed by the unauthorized Web page detection device.
- the unauthorized Web page detection device the control method of the unauthorized Web page detection device, and the control program make it possible to determine with high accuracy whether or not the Web page is an unauthorized Web page.
- FIG. 4 is a diagram illustrating an example of a processing outline in an unauthorized Web page detection device.
- FIG. 1 is a diagram illustrating an example of a schematic configuration of a communication system 1.
- FIG. 2 is a diagram illustrating an example of a schematic configuration of an unauthorized Web page detection device 4.
- FIG. 7A is a diagram illustrating an example of a data structure of an unauthorized Web page table
- FIG. 7B is a diagram illustrating an example of a data structure of a regular Web page table.
- 6 is a flowchart illustrating an example of an operation of the unauthorized Web page detection device 4. It is a flowchart which shows an example of an initial process. It is a flow chart which shows an example of inspection processing.
- FIG. 9 is a diagram illustrating an example of a feature vector processing outline.
- (A)-(d) is a figure which shows an example of the screen which the terminal 2 displays.
- FIG. 1 is a diagram showing an example of a processing outline in the unauthorized Web page detection device.
- the unauthorized web page detection device stores a plurality of unauthorized HTML documents constituting each of a plurality of known unauthorized web pages.
- the fraudulent Web page is a Web page used in phishing scams, and the URL of a known fraudulent Web page is provided by an organization such as the Anti-Phishing Council, for example.
- the Web page includes an HTML document and an image described in the HTML document.
- the unauthorized Web page detection device calculates, for each of the plurality of unauthorized HTML documents, feature vectors 1 to n based on the associated state of a plurality of character strings in each HTML document.
- the character string is an HTML tag or a word.
- the related state of a plurality of character strings is a relation between the character strings, for example, an arrangement relation of a predetermined plurality of character strings in each HTML document.
- the plurality of character strings may include HTML tags and words, and may be continuous character strings.
- the feature vector is a vector having a plurality of dimensions, for example, 1 ⁇ 150. Each feature vector is calculated such that the feature vectors of two HTML documents having similar character string arrangements in the documents are more similar to the feature vectors of two dissimilar HTML documents.
- the unauthorized Web page detection device acquires the inspection target HTML document included in the inspection target Web page.
- the inspection target Web page is a Web page to be inspected to determine whether or not it is a Web page used in phishing fraud, and is, for example, a Web page requested to access by a terminal different from the unauthorized Web page detection device.
- the unauthorized Web page detection device calculates the feature vector A for the inspection-target HTML document, similarly to the unauthorized HTML document.
- the unauthorized Web page detection device calculates similarities 1 to n between the calculated feature vector A and each of the feature vectors 1 to n.
- the fraudulent Web page detection device determines whether the inspection target Web page is a fraudulent Web page by comparing the calculated maximum value of the similarities 1 to n with a threshold value.
- the unauthorized Web page detection device determines that the inspection target Web page is similar to the unauthorized Web page corresponding to the feature vector for which the maximum similarity was calculated. Is determined to be an unauthorized Web page.
- the unauthorized Web page detection device calculates a feature vector based on a related state of a plurality of character strings in each HTML document for each of a plurality of known unauthorized HTML documents and the inspection target HTML document.
- the unauthorized Web page detection device determines whether the inspection target Web page is an unauthorized Web page based on the similarity of the feature vectors.
- Unauthorized Web pages are often generated by a common tool, and a plurality of unauthorized Web pages generated by a common tool have common features attributed to the tool and are likely to be similar. Therefore, even if the URL of the inspection target Web page is different from the URL of the known invalid Web page, the unauthorized Web page detection device uses the feature vector of the HTML document to determine whether the inspection target Web page is an unauthorized Web page. Can be determined with high accuracy.
- FIG. 2 is a diagram illustrating an example of a schematic configuration of the communication system 1.
- the communication system 1 includes a terminal 2, a Web server 3, an unauthorized Web page detection device 4, and the like.
- the terminal 2, the Web server 3, and the unauthorized Web page detection device 4 are connected via a communication network 5 such as the Internet.
- the terminal 2 is a terminal used by the user for browsing the Web page.
- the terminal 2 communicates with the Web server 3 and the unauthorized Web page detection device 4 via the communication network 5 by a communication method such as TCP / IP (Transmission Control Protocol / Internet Protocol) and performs display according to the content of the communication. .
- TCP / IP Transmission Control Protocol / Internet Protocol
- the Web server 3 is a server that transmits a Web page in response to a request from the terminal 2 and the unauthorized Web page detection device 4.
- the Web server 3 communicates with the terminal 2 and the unauthorized Web page detection device 4 via the communication network 5 by a communication method such as TCP / IP.
- the terminal 2 accesses the Web page of the Web server 3 by specifying the URL, the terminal 2 transmits the same URL to the unauthorized Web page detection device 4.
- the unauthorized Web page detection device 4 specifies the transmitted URL, requests the Web server 3 to acquire an HTML document, and receives the HTML document from the Web server 3.
- the unauthorized Web page detection device 4 determines whether the received HTML document is an unauthorized HTML document, and transmits the determined result to the terminal 2.
- the terminal 2 displays a Web page or a warning screen transmitted from the Web server 3 according to the transmitted inspection result.
- FIG. 3 is a diagram showing an example of a schematic configuration of the unauthorized Web page detection device 4. As shown in FIG.
- the unauthorized Web page detection device 4 includes a communication unit 41, a storage unit 42, and a processing unit 43.
- the communication unit 41 has a wired communication interface circuit such as a wired LAN or a wireless communication interface circuit such as a wireless LAN.
- the communication unit 41 communicates with the terminal 2, the Web server 3, and the like via the communication network 5 by a communication method such as TCP / IP.
- the communication unit 41 supplies data received from the terminal 2, the Web server 3, and the like to the processing unit 43.
- the communication unit 41 transmits the data supplied from the processing unit 43 to the terminal 2, the Web server 3, and the like.
- the communication unit 41 is an example of an output unit.
- the storage unit 42 has, for example, at least one of a semiconductor memory, a magnetic disk device, and an optical disk device.
- the storage unit 42 stores a driver program, an operating system program, an application program, data, and the like used for processing by the processing unit 43.
- the storage unit 42 stores a communication device driver program for controlling the communication unit 41 as a driver program. Further, the storage unit 42 stores a connection control program or the like according to a communication method such as TCP / IP as an operating system program. Further, the storage unit 42 stores a data processing program for transmitting and receiving various data and the like as an application program.
- the computer program is stored in a storage unit 42 from a computer-readable portable recording medium such as a CD-ROM (Compact Disk Read Only Memory) and a DVD-ROM (Digital Versatile Disk Read Only Memory) using a known setup program. May be installed.
- the storage unit 42 stores an unauthorized Web page table, a normal Web page table, and the like as data. The details of the unauthorized Web page table and the regular Web page table will be described later.
- the processing unit 43 has one or a plurality of processors and their peripheral circuits, and totally controls the overall operation of the unauthorized Web page detection device 4.
- the processing unit 43 is, for example, a CPU (Central Processing Unit).
- the processing unit 43 may be a DSP (digital signal processor), an LSI (large scale integration), an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programming Gate Array), or the like.
- the processing unit 43 controls operations of the communication unit 41 and the like so that various processes of the unauthorized Web page detection device 4 are executed in an appropriate procedure according to a program or the like stored in the storage unit 42.
- the processing unit 43 executes a process based on a program (a driver program, an operating system program, an application program, etc.) stored in the storage unit 42. Further, the processing unit 43 can execute a plurality of programs (such as application programs) in parallel.
- the processing unit 43 includes an acquisition unit 431, a preprocessing unit 432, a morphological analysis unit 433, a vector calculation unit 434, a similarity calculation unit 435, a determination unit 436, a determination result output unit 437, and the like.
- Each of these units included in the processing unit 43 is a functional module implemented by a program executed on a processor included in the processing unit 43.
- each of these units included in the processing unit 43 may be mounted on the unauthorized Web page detection device 4 as an independent integrated circuit, microprocessor, or firmware.
- FIG. 4A is a diagram showing an example of the data structure of the unauthorized Web page table.
- An ID for identifying an unauthorized Web page, a URL indicating the unauthorized Web page, an unauthorized HTML document included in the unauthorized Web page, a feature vector calculated based on the unauthorized HTML document, and the like are associated with the unauthorized Web page table. It is memorized.
- a plurality of malicious HTML documents are stored in the malicious Web page table, and the plurality of malicious HTML documents constitute each of the plurality of malicious Web pages.
- the feature vector may be stored in the storage unit 42 in association with an ID, a URL, and the like, separately from the unauthorized Web page table.
- the URL does not have to be included in the unauthorized Web page table.
- the storage unit 42 stores the plurality of character strings in each HTML document of the plurality of malicious HTML documents constituting each of the plurality of malicious Web pages.
- the feature vector based on the related state is stored.
- FIG. 4B is a diagram showing an example of the data structure of the regular Web page table.
- An ID for identifying a regular Web page, a regular URL indicating the regular Web page, a regular HTML document included in the regular Web page, a feature vector calculated based on the regular HTML document, and the like are associated with the regular Web page table. Is memorized.
- the feature vector may be stored in the storage unit 42 in association with an ID, a regular URL, or the like, separately from the regular Web page table. Regardless of whether or not the feature vectors are stored in the normal Web page table, the storage unit 42 stores the feature vectors of the plurality of normal HTML documents constituting each of the plurality of normal Web pages into the normal URL indicating the normal Web page. Is stored in association with.
- FIG. 5 is a flowchart showing an example of the operation of the unauthorized Web page detection device 4.
- the acquiring unit 431 reads out the unauthorized Web page table or the authorized Web page table from the storage unit 42, and acquires a plurality of unauthorized HTML documents and a plurality of authorized HTML documents, respectively (step S11).
- the unauthorized Web page detection device 4 executes an initial process (step S12).
- the vector calculation unit 434 of the unauthorized Web page detection device 4 calculates a feature vector for each of a plurality of unauthorized HTML documents and a plurality of normal HTML documents in the initial processing. Details of the initial processing will be described later.
- the processing in steps S11 and S12 is executed immediately after the unauthorized Web page detection device 4 is started.
- the acquisition unit 431 of the unauthorized Web page detection device 4 waits until a URL is received from the terminal 2 (Step S13).
- the terminal 2 transmits a Web page transmission request to the Web server 3 by specifying a URL, and transmits the same URL to the unauthorized Web page detection device 4.
- the acquisition unit 431 of the unauthorized Web page detection device 4 receives the URL transmitted from the terminal 2 via the communication unit 41, and acquires the URL as the inspection target URL indicating the inspection target Web page.
- the acquisition unit 431 specifies the acquired URL and transmits an HTML document transmission request to the Web server 3 via the communication unit 41 (step S14).
- the Web server 3 transmits the HTML document specified by the URL to the unauthorized Web page detection device 4.
- the acquisition unit 431 receives the HTML document from the Web server 3 via the communication unit 41, and acquires the HTML document as the inspection-target HTML document constituting the inspection-target Web page (Step S15).
- the determination unit 436 of the unauthorized Web page detection device 4 performs an inspection process on the inspection-target HTML document (step S16).
- the determining unit 436 determines whether or not the inspection target Web page including the inspection target HTML document is an unauthorized Web page in the inspection processing. Details of the inspection processing will be described later.
- the determination result output unit 437 outputs the determination result in the inspection process by transmitting it to the terminal 2 via the communication unit 41 (step S17).
- the determination result output unit 437 returns the processing to step S13, and repeats the processing from step S13 to step S17.
- the terminal 2 when receiving the determination result, specifies the received determination result.
- the terminal 2 displays the Web page received from the Web server 3 when the determination result indicates that it is a legitimate Web page, and displays the Web page that is received from the Web server 3 when the determination result indicates that it is an unauthorized Web page. Display a warning screen without displaying.
- the terminal 2 may receive and display the Web page from the Web server 3 before receiving the determination result indicating that the Web page is the unauthorized Web page from the unauthorized Web page detection device 4. In that case, the terminal 2 displays a warning screen instead of the displayed Web page.
- FIG. 6 is a flowchart showing an example of the initial process. The initial processing is executed in step S12 of FIG.
- the preprocessing unit 432 performs preprocessing on each of the plurality of unauthorized HTML documents and the plurality of regular HTML documents acquired in step S11 (step S21).
- the preprocessing unit 432 analyzes the contents of each HTML document based on HTML grammar rules as preprocessing, and deletes some characters in each HTML document based on the analysis result. For example, the preprocessing unit 432 deletes a line feed code that is a control character indicating a line feed in each HTML document, a blank character before and after the line feed code, a comment character string, a JavaScript execution code, and the like. Further, the preprocessing unit 432 may delete the URL path described in the HTML tag of each HTML document, delete a part of the HTML tag, and change the other part of the HTML document to the HTML document. May be processed.
- the morphological analysis unit 433 performs a morphological analysis process on each HTML document processed by the preprocessing unit 432 (step S22).
- the morphological analysis unit 433 performs morphological analysis on each HTML document, thereby converting the contents of each HTML document into a set of a plurality of character strings.
- the morphological analysis unit 433 performs a morphological analysis process using a known morphological analysis engine such as MeCab.
- the morphological analysis unit 433 performs processing such that, for example, an HTML tag such as ⁇ p> and a word other than the HTML tag are each one character string.
- the vector calculation unit 434 calculates, for each HTML document processed by the morphological analysis unit 433, a feature vector based on the associated state of a plurality of character strings in each HTML document (step S23).
- the vector calculation unit 434 calculates a feature vector by a learning device that is pre-trained so as to output a feature vector of the HTML document when an HTML document having a plurality of character strings is input.
- This learning device is pre-learned using an HTML document of an existing Web page by, for example, a neural network or the like, and is stored in the storage unit 42 in advance.
- the learning device outputs a similar feature vector for an HTML document in which the arrangement of character strings in the HTML document is similar, and outputs a dissimilar feature vector for an HTML document in which the arrangement state of the character strings in the HTML document is not similar. Learned to do.
- the learning device executes this learning using a known method such as Doc2Vec.
- the HTML document used for the pre-learning is, for example, a Wikipedia HTML document.
- the vector calculation unit 434 may calculate the feature vector without using a learning device. In that case, the vector calculation unit 434 calculates a feature vector in which the number of appearances of two or more predetermined numbers of character strings in each document is an element. A plurality of the predetermined number of character strings are set in advance and stored in the storage unit 42. In this case, the related state of a plurality of character strings is the magnitude relation of the number of appearances of each character string, and for similar HTML documents, the magnitude relation of the number of appearances of each character string is similar.
- the vector calculation unit 434 calculates a similar feature vector for HTML documents in which the number of appearances of each character string in the HTML document is similar to each other, and outputs HTML documents in which the number of appearances of each character string in the HTML document is not similar. , A dissimilar feature vector is calculated.
- the vector calculation unit 434 stores each of the calculated feature vectors in the unauthorized Web page table or the authorized Web page table in association with the corresponding unauthorized HTML document or regular HTML document (step S24). Thus, a series of processing ends.
- FIG. 7 is a flowchart showing an example of the inspection processing.
- the initial processing is executed in step S16 in FIG.
- the preprocessing unit 432 performs preprocessing on the inspection target HTML document acquired in step S15 (step S31). This preprocessing is the same as the preprocessing described in step S21 except that the target is an HTML document to be inspected.
- the morphological analysis unit 433 performs a morphological analysis process on the inspection-target HTML document processed by the preprocessing unit 432 (step S32).
- This morphological analysis processing is the same as the morphological analysis processing described in step S22 except that the target is an HTML document to be inspected.
- the vector calculation unit 434 calculates the feature vector of the inspection-target HTML document processed by the morphological analysis unit 433 (step S33).
- This feature vector calculation process is the same as the feature vector calculation process described in step S23 except that the target is an inspection target HTML document.
- the vector calculation unit 434 determines, for each of the plurality of invalid HTML documents, the plurality of regular HTML documents, and the inspection target HTML document, the feature vector based on the related state of the plurality of character strings in each HTML document. Is calculated.
- the similarity calculator 435 calculates the similarity between the feature vector of the inspection-target HTML document and each of the feature vectors of the plurality of unauthorized HTML documents stored in step S24 (step S34).
- the determination unit 436 determines whether the inspection target Web page is an unauthorized Web page based on the calculated similarities and the threshold (step S35).
- step S35-Y If the maximum value of the similarity is equal to or greater than the threshold value (step S35-Y), the determination unit 436 determines that the inspection target Web page is an unauthorized Web page corresponding to the feature vector whose maximum similarity has been calculated. Is determined (step S36), and a series of processing ends.
- step S35-N the determination unit 436 reads the regular Web table and acquires a plurality of regular URLs (step S37).
- the determination unit 436 determines whether or not the domain name in the URL to be inspected acquired in step S13 matches any of the domain names in the plurality of regular URLs acquired in step S37 (step S38). .
- the determination unit 436 determines that the Web page to be inspected belongs to a regular Web site, It is determined that the page is not a page (step S39). Thus, a series of processing ends.
- the determination unit 436 determines that the Web page to be inspected does not belong to a legitimate Web site. .
- the similarity calculation unit 435 calculates the similarity between the feature vector of the inspection target HTML and each of the feature vectors of the plurality of normal HTML documents (step S40).
- the determination unit 436 determines whether or not the inspection target Web page is an unauthorized Web page by comparing the calculated maximum value of each similarity with the second threshold value (Step S41).
- the second threshold value may be the same value as the threshold value used in step S35 or a different value.
- the determination unit 436 determines in step S38 that the Web page to be inspected does not belong to a legitimate Web site. Therefore, when the maximum value of the similarity is equal to or larger than the second threshold, the determination unit 436 determines that the inspection target Web page is an unauthorized Web page similar to the registered regular Web page (Step S42). .
- the determination unit 436 determines that the inspection target Web page does not belong to the regular Web site, but the content is not similar to any of the regular Web pages. Therefore, it is determined that the page is an unregistered regular Web page (step S43). Thus, a series of processing ends.
- FIG. 8A shows an example of input data to the morphological analysis unit 433, and FIG. 8B shows an example of output data of the morphological analysis unit 433.
- the input data to the morphological analysis unit 433 is obtained from the HTML documents of the illegal Web page, the regular Web page, and the inspection target Web page, and the pre-processing unit 432 outputs a part of the line feed code and the like.
- This is an HTML document from which characters have been deleted.
- the output data of the morphological analysis unit 433 is obtained by performing the morphological analysis on the input data by the morphological analysis unit 433, and collecting the morphemes obtained as the execution result in units of words. Data placed between quotes. Note that the morphological analysis unit 433 performs morphological analysis after removing the HTML tag from the input data, puts the morpheme into words, and inserts an HTML tag with double quotes at the original position. May generate output data.
- FIG. 9 is a diagram showing an example of a feature vector processing outline.
- the storage unit 42 stores the illegal HTML documents 1 to n of the plurality of illegal Web pages 1 to n.
- the vector calculation unit 434 calculates feature vectors 1 to n for the malicious HTML documents 1 to n of the malicious web pages 1 to n stored in the storage unit 42, respectively.
- the vector calculation unit 434 calculates the feature vector A for the inspection target HTML document of the inspection target Web page acquired by the acquisition unit 431.
- the similarity calculating section 435 calculates cosine similarities 1 to n of the feature vector A and each of the feature vectors 1 to n.
- the two feature vectors are similar when the cosine similarity is close to 1, and are not similar when the cosine similarity is close to -1.
- the similarity 1 is 0.9
- the similarity 2 is 0.4
- the similarity n is ⁇ 0.9.
- step S35 the determination unit 436 determines whether or not the inspection target Web page is an unauthorized Web page by comparing the maximum value of 0.9 of the similarities 1 to n with a threshold value.
- a threshold value For example, when the threshold value is 0.8, the maximum value 0.9 of the similarities 1 to n is equal to or larger than the threshold value, and therefore, the inspection target Web page is an unauthorized Web page corresponding to the unauthorized Web page 1. Is determined.
- FIGS. 10 (a) to 10 (d) are diagrams showing an example of a screen displayed by the terminal 2.
- FIG. 10 (a) to 10 (d) are diagrams showing an example of a screen displayed by the terminal 2.
- the terminal 2 when the user instructs to start the Web browser, the terminal 2 starts and displays the Web browser.
- the display screen 60 of the Web browser includes a URL input area 61 and a display area 62.
- the terminal 2 activates the Web browser, the terminal 2 activates an application program that communicates with the unauthorized Web page detection device 4.
- the terminal 2 accesses the Web server 3 indicated by the specified URL, and accesses the Web server 3. 3 receives a Web page. Further, the terminal 2 transmits the URL input to the Web browser to the unauthorized Web page detection device 4 according to the application program.
- the unauthorized Web page detection device 4 acquires the URL transmitted from the terminal 2 in step S13, executes the processes in steps S14 to S17, and transmits the determination result to the terminal 2.
- the terminal 2 receives from the unauthorized Web page detection device 4 a determination result indicating that the Web page corresponding to the URL transmitted from the terminal 2 is a regular Web page
- the Web page 81 received from the server 3 is displayed on the display screen 80.
- the terminal 2 issues a warning when the determination result indicating that the Web page corresponding to the URL transmitted from the terminal 2 is an unauthorized Web page is received from the unauthorized Web page detection device 4.
- a screen 90 is displayed.
- the data for the warning screen is stored in the terminal 2 in advance.
- a character display 91 and an end button 92 are displayed.
- the character display 91 is a text warning that the Web page received from the Web server 3 may be a phishing page.
- the terminal 2 closes the warning screen 90.
- the unauthorized Web page detection device 4 calculates a feature vector based on a related state of a plurality of character strings in each HTML document for each of the plurality of known unauthorized HTML documents and the inspection-target HTML document.
- the fraudulent Web page detection device 4 determines whether or not the inspection target Web page is a fraudulent Web page based on the calculated similarity of the feature vectors.
- Unauthorized Web pages are often generated by a common tool, and a plurality of unauthorized Web pages generated by a common tool have common features attributed to the tool and are likely to be similar. For this reason, the fraudulent Web page detection device 4 uses the feature vector of the HTML document, so that even if the URL of the inspection target Web page is different from the URL of the known fraudulent Web page, the fraudulent Web page is detected. Can be determined with high accuracy.
- the unauthorized Web page detection device 4 further determines the feature vector of the HTML to be inspected and the feature of the plurality of regular HTML documents. The similarity with each of the vectors is calculated. The fraudulent Web page detection device 4 determines whether the inspection target HTML document is similar to the legitimate HTML document. Therefore, the fraudulent Web page detection device 4 creates a fraudulent Web page that is created to be similar to the legitimate Web page and has not been registered as a fraudulent Web page. Can be detected.
- the unauthorized Web page detection device 4 calculates a feature vector based on an associated state of a plurality of character strings including an HTML tag and a word.
- a plurality of malicious Web pages generated by a common tool are likely to have a specific association between the HTML tag and the word due to the tool.
- the fraudulent Web page detection device 4 determines whether or not the state of association between the HTML tag and the word is similar between the test target Web page and each fraudulent Web page. It can be detected with higher accuracy.
- the unauthorized Web page detection device 4 calculates the feature vector based on the related state of a plurality of continuous character strings. Web pages that tend to use similar HTML tags and / or word sets in consecutive character strings are likely to be similar Web pages. Therefore, the fraudulent Web page detection device 4 can detect a fraudulent Web page similar to a Web page registered as a fraudulent Web page with higher accuracy.
- the preprocessing unit 432 may calculate the size of each HTML document generated by the preprocessing in steps S21 and S31.
- the similarity calculation unit 435 calculates a difference between each of the calculated plurality of unauthorized HTML documents and the calculated size of the inspection target HTML document, and the size difference is equal to or larger than a predetermined value. In this case, the similarity is not calculated for the invalid HTML document.
- the similarity calculation unit 435 calculates the difference between the calculated size of each of the plurality of normal HTML documents and the calculated size of the inspection target HTML document, and the size difference is equal to or larger than a predetermined value. In this case, the similarity is not calculated for the regular HTML document.
- the fraudulent Web page detection device 4 can speed up the inspection process without reducing the accuracy of determining the fraudulent Web page.
- the similarity calculation unit 435 may calculate the difference between the sizes of the HTML documents before the preprocessing unit 432 performs the preprocessing.
- the similarity calculation unit 435 may calculate the difference between the sizes of the HTML documents after the morphological analysis unit 433 has performed the morphological analysis processing.
- the morphological analysis unit 433 replaces the regular HTML document acquired in step S11 and the inspection target HTML document acquired in step S15 with a morphological Analysis processing may be performed.
- the vector calculation unit 434 may calculate a feature vector for an HTML document that has been preprocessed by the preprocessing unit 432, instead of the HTML document processed by the morphological analysis unit 433.
- the vector calculation unit 434 may calculate a feature vector for each of the regular HTML documents acquired in step S11 and the inspection target HTML document acquired in step S15 instead of the HTML document processed by the morphological analysis unit 433. Good. For example, when the HTML document is described in a language such as English that is separated and written for each word, the vector calculation unit 434 separates the input HTML document by a break in the HTML tag and a space between words.
- the feature vector may be calculated based on a plurality of character strings.
- the determination unit 436 may determine whether or not the number of unauthorized Web pages determined to have a similarity equal to or greater than the threshold in step S35 is equal to or greater than a predetermined number. For example, the determination unit 436 determines that the inspection target Web page is the unauthorized Web page when the number of the unauthorized Web pages determined to have the similarity equal to or higher than the threshold value is the predetermined number or more, and determines that the inspection target Web page is not the predetermined number or more. In this case, it may be determined that the inspection target Web page is not an unauthorized Web page.
- steps S37 to S43 are omitted, and the determination unit 436 determines that the inspection target Web page is a regular Web page when the maximum value of each similarity calculated in step S34 is less than the threshold. It may be determined.
- the timing at which the determination unit 436 executes the processing of steps S37 to S38 may be changed before the processing of step S31, and the processing may advance to step S40 in the case of step S35-N.
- the determination unit 436 performs the processing of steps S37 to S38 at the beginning of the inspection processing.
- the determination unit 436 determines that the inspection target Web page belongs to the legitimate Web site and is not an unauthorized Web page, and ends a series of processes, as in step S39.
- the determination unit 436 advances the processing to step S31.
- the storage unit 42 may further store URL information indicating which authorized URL corresponds to the unauthorized HTML document that executes the phishing scam, in association with each unauthorized HTML document in the unauthorized Web page table.
- the similarity calculation unit 435 further calculates the similarity between the feature vector of the inspection target HTML document and each of the feature vectors of the plurality of normal HTML documents. Then, the similarity calculating unit 435 calculates an average value of the similarity of each unauthorized HTML document and the similarity of the regular HTML document associated with the regular URL indicated by the URL information of each malicious HTML document.
- the determination unit 436 determines in step S35 whether the maximum value of the average values calculated by the similarity calculation unit 435 is equal to or greater than a threshold value, thereby determining whether the inspection target Web page is an unauthorized Web page. Determine whether or not.
- the unauthorized Web page detection device 4 may acquire a URL of a new unauthorized Web page or a legitimate Web page during operation, and calculate a feature vector corresponding to each Web page.
- the obtaining unit 431 specifies the obtained URL, obtains an invalid HTML document or a regular HTML document, and registers the obtained URL and HTML document in the illegal Web page table or the regular Web page table.
- the preprocessing unit 432, the morphological analysis unit 433, and the vector calculation unit 434 execute the initial process of step S12 on the newly acquired HTML document, and calculate a feature vector.
- the unauthorized Web page detection device 4 can calculate the similarity between the feature vector of the inspection target HTML document and the feature vector of the new HTML document without causing the existing learning device to learn the new HTML document.
- the unauthorized Web page detection device 4 can execute the determination using the new HTML document without re-learning the learning device using the entire existing HTML document and the new HTML document. The processing load can be reduced.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
La présente invention concerne un dispositif de détection de page Web illicite, un procédé de commande de dispositif de détection de page Web illicite, et un programme de commande, permettant de déterminer avec une précision élevée si une page Web est une page Web illicite. Ce dispositif de détection de page Web illicite comprend : une partie de mémoire destinée à mémoriser des vecteurs de caractéristiques sur la base d'états d'association d'une pluralité de chaînes de caractères dans chaque document d'une pluralité de documents HTML illicites qui configurent chaque page Web de la pluralité des pages Web illicites ;une partie d'acquisition destinée à acquérir un document HTML à inspecter qui configure une page Web à inspecter ; une partie de calcul de vecteur destinée à calculer un vecteur de caractéristiques d'un document HTML soumis à une inspection ; une partie de calcul de similarité destinée à calculer des similarités entre le vecteur de caractéristiques du document HTML soumis à une inspection et chaque vecteur de caractéristiques de la pluralité des documents HTML illicites ; une partie de détermination destinée à, sur la base des similarités calculées et d'un seuil, déterminer si la page Web soumise à une inspection est une page Web illicite ; et une partie de sortie de résultat de détermination destinée à délivrer le résultat de la détermination effectué par la partie de détermination.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2018/031993 WO2020044469A1 (fr) | 2018-08-29 | 2018-08-29 | Dispositif de détection de page web illicite, procédé de commande de dispositif de détection de page web illicite, et programme de commande |
| JP2020539928A JP7182764B2 (ja) | 2018-08-29 | 2018-08-29 | 不正Webページ検出装置、不正Webページ検出装置の制御方法及び制御プログラム |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2018/031993 WO2020044469A1 (fr) | 2018-08-29 | 2018-08-29 | Dispositif de détection de page web illicite, procédé de commande de dispositif de détection de page web illicite, et programme de commande |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020044469A1 true WO2020044469A1 (fr) | 2020-03-05 |
Family
ID=69643425
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2018/031993 Ceased WO2020044469A1 (fr) | 2018-08-29 | 2018-08-29 | Dispositif de détection de page web illicite, procédé de commande de dispositif de détection de page web illicite, et programme de commande |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JP7182764B2 (fr) |
| WO (1) | WO2020044469A1 (fr) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111597107A (zh) * | 2020-04-22 | 2020-08-28 | 北京字节跳动网络技术有限公司 | 信息输出方法、装置和电子设备 |
| KR20220080703A (ko) * | 2020-12-07 | 2022-06-14 | 주식회사 앰진시큐러스 | 스크립트 내 키워드 기반 웹 사이트의 유사도 평가 방법 |
| JP7138279B1 (ja) * | 2022-02-17 | 2022-09-16 | 株式会社ファイブドライブ | 通信システム、ゲートウェイ装置、端末装置及びプログラム |
| KR102595595B1 (ko) * | 2023-07-24 | 2023-10-31 | (주)에잇스니핏 | 웹사이트의 구조 정보를 이용한 불법·유해정보 사이트차단 방법 및 장치 |
| JP2025017319A (ja) * | 2023-07-24 | 2025-02-05 | エイトスニペット カンパニー リミテッド | ウェブサイトソース分析を用いた違法・有害情報サイト遮断方法及び装置並びにコンピュータプログラム |
| WO2025115968A1 (fr) * | 2023-11-29 | 2025-06-05 | 日本電信電話株式会社 | Dispositif de détermination, procédé de détermination, et programme de détermination |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH07319897A (ja) * | 1994-05-20 | 1995-12-08 | Canon Inc | 情報処理方法及び装置 |
| US20130086677A1 (en) * | 2010-12-31 | 2013-04-04 | Huawei Technologies Co., Ltd. | Method and device for detecting phishing web page |
| US20160352772A1 (en) * | 2015-05-27 | 2016-12-01 | Cisco Technology, Inc. | Domain Classification And Routing Using Lexical and Semantic Processing |
| US20180013789A1 (en) * | 2016-07-11 | 2018-01-11 | Bitdefender IPR Management Ltd. | Systems and Methods for Detecting Online Fraud |
-
2018
- 2018-08-29 WO PCT/JP2018/031993 patent/WO2020044469A1/fr not_active Ceased
- 2018-08-29 JP JP2020539928A patent/JP7182764B2/ja active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH07319897A (ja) * | 1994-05-20 | 1995-12-08 | Canon Inc | 情報処理方法及び装置 |
| US20130086677A1 (en) * | 2010-12-31 | 2013-04-04 | Huawei Technologies Co., Ltd. | Method and device for detecting phishing web page |
| US20160352772A1 (en) * | 2015-05-27 | 2016-12-01 | Cisco Technology, Inc. | Domain Classification And Routing Using Lexical and Semantic Processing |
| US20180013789A1 (en) * | 2016-07-11 | 2018-01-11 | Bitdefender IPR Management Ltd. | Systems and Methods for Detecting Online Fraud |
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111597107A (zh) * | 2020-04-22 | 2020-08-28 | 北京字节跳动网络技术有限公司 | 信息输出方法、装置和电子设备 |
| CN111597107B (zh) * | 2020-04-22 | 2023-04-28 | 北京字节跳动网络技术有限公司 | 信息输出方法、装置和电子设备 |
| KR20220080703A (ko) * | 2020-12-07 | 2022-06-14 | 주식회사 앰진시큐러스 | 스크립트 내 키워드 기반 웹 사이트의 유사도 평가 방법 |
| WO2022124573A1 (fr) * | 2020-12-07 | 2022-06-16 | 주식회사 앰진시큐러스 | Procédé d'évaluation de similarité de site web sur la base d'une structure de menu et d'un mot-clé dans un script |
| KR102705181B1 (ko) * | 2020-12-07 | 2024-09-11 | 주식회사 앰진 | 스크립트 내 키워드 기반 웹 사이트의 유사도 평가 방법 |
| JP7138279B1 (ja) * | 2022-02-17 | 2022-09-16 | 株式会社ファイブドライブ | 通信システム、ゲートウェイ装置、端末装置及びプログラム |
| WO2023157191A1 (fr) * | 2022-02-17 | 2023-08-24 | 株式会社ファイブドライブ | Système de communication, dispositif de passerelle, dispositif terminal, et programme |
| KR102595595B1 (ko) * | 2023-07-24 | 2023-10-31 | (주)에잇스니핏 | 웹사이트의 구조 정보를 이용한 불법·유해정보 사이트차단 방법 및 장치 |
| JP2025017319A (ja) * | 2023-07-24 | 2025-02-05 | エイトスニペット カンパニー リミテッド | ウェブサイトソース分析を用いた違法・有害情報サイト遮断方法及び装置並びにコンピュータプログラム |
| WO2025115968A1 (fr) * | 2023-11-29 | 2025-06-05 | 日本電信電話株式会社 | Dispositif de détermination, procédé de détermination, et programme de détermination |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2020044469A1 (ja) | 2021-08-26 |
| JP7182764B2 (ja) | 2022-12-05 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7182764B2 (ja) | 不正Webページ検出装置、不正Webページ検出装置の制御方法及び制御プログラム | |
| US20090285444A1 (en) | Web-Based Content Detection in Images, Extraction and Recognition | |
| CN109768992B (zh) | 网页恶意扫描处理方法及装置、终端设备、可读存储介质 | |
| JP2010516007A5 (fr) | ||
| CN104156490A (zh) | 基于文字识别检测可疑钓鱼网页的方法及装置 | |
| CN107957872A (zh) | 一种完整网站源码获取方法及非法网站检测方法、系统 | |
| CN108566399A (zh) | 钓鱼网站识别方法及系统 | |
| JP6936459B1 (ja) | 商標使用検知装置、商標使用検知方法及び商標使用検知プログラム | |
| CN115801455B (zh) | 一种基于网站指纹的仿冒网站检测方法及装置 | |
| CN111597490A (zh) | Web指纹识别方法、装置、设备及计算机存储介质 | |
| CN108270754B (zh) | 一种钓鱼网站的检测方法及装置 | |
| CN113992390A (zh) | 一种钓鱼网站的检测方法及装置、存储介质 | |
| US20110282978A1 (en) | Browser plug-in | |
| KR100917458B1 (ko) | 추천검색어 제공 방법 및 시스템 | |
| CN107786529B (zh) | 网站的检测方法、装置及系统 | |
| CN103390128A (zh) | 页面的标注方法、装置与终端设备 | |
| CN112417003A (zh) | 基于网络搜索的近义词挖掘方法、装置、设备及存储介质 | |
| CN104978423A (zh) | 网站类型的检测方法及装置 | |
| CN109657472B (zh) | Sql注入漏洞检测方法、装置、设备及可读存储介质 | |
| CN109495471B (zh) | 一种对web攻击结果判定方法、装置、设备及可读存储介质 | |
| WO2014203573A1 (fr) | Système d'analyse d'informations numériques, procédé d'analyse d'informations numériques et programme d'analyse d'informations numériques | |
| WO2017054716A1 (fr) | Procédé pour reconnaître un navigateur détourné et navigateur | |
| CN111382383A (zh) | 网页内容敏感类型确定方法、装置、介质和计算机设备 | |
| US20130230248A1 (en) | Ensuring validity of the bookmark reference in a collaborative bookmarking system | |
| CN110120898B (zh) | 远程网页资源变更监测及有害性检测识别方法 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18931290 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2020539928 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 18931290 Country of ref document: EP Kind code of ref document: A1 |