Detailed Description
The following description provides specific details of various embodiments of the disclosure so that those skilled in the art may fully understand and practice the various embodiments of the disclosure. It should be understood that the technical solutions of the present disclosure may be practiced without some of these details. In some instances, well-known structures or functions have not been shown or described in detail to avoid obscuring the description of embodiments of the present disclosure with such unnecessary description. The terminology used in the present disclosure should be understood in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.
First, some terms related to the embodiments of the present application will be described so as to be easily understood by those skilled in the art.
HTML hypertext markup language (english: hyperTextMarkup Language, abbreviated HTML) is a standard markup language for creating web pages. HTML runs on a browser and is parsed by the browser, which includes a series of tags by which the format of documents on the network can be unified, allowing the distributed internet resources to be connected as a logical entity. HTML text is descriptive text composed of HTML commands that can specify words, graphics, animations, sounds, tables, links, etc.
JavaScript, simply JS, is a lightweight, interpreted or just-in-time compiled programming language with function priority. Its interpreter is called JavaScript engine, which is a part of the browser and widely used in the scripting language of clients, and is usually used on HTML pages to add dynamic functions to the HTML pages.
Document object model (Document Object Model, DOM) document object model (Document ObjectModel, DOM) is a platform and language independent model that can be used to represent HTML documents. The logical structure of the document, and the manner in which the program accesses and manipulates the document, is defined in the document object model. When a web page is loaded, the browser automatically creates a Document Object Model (DOM) for the current page. In the DOM, all parts of the document (e.g., elements, attributes, text, etc.) are organized into a logical tree structure (similar to a genealogy), and the end of each branch in the tree is called a node, each node being an object. JavaScript can be used by the DOM to access, modify, delete or add any content in the HTML document.
Shadow document object model Shadow DOM, also known as Shadow DOM. The Shadow DOM is a specification of HTML that allows a subtree of DOM elements to be inserted as the document is rendered, but this subtree is not in the main DOM tree. It allows browser developers to package their own HTML tags, css styles, and specific javascript code, while the developers can also create custom-like < input >, < video >, < audio >, etc., primary tags. The method is used for realizing encapsulation of the DOM tree and isolating the internal structure of the component from external codes. By means of the Shadow DOM, an independent DOM subtree can be created, the style and behavior of which will not affect the external DOM, nor will it be affected by the external style and behavior. When creating the Shadow DOM, two modes, open and closed, can be selected. The open mode allows access to the Shadow DOM through JavaScript. The closed mode does not allow external JavaScript to access the Shadow DOM.
Host element ShadowDOM container element. It is an element in the common DOM, and may be referred to as a host element. The hosting element may be a custom Web component, such as a custom tag, video tag, or any other custom element.
Fig. 1 illustrates an exemplary application scenario 100 in which a technical solution according to an embodiment of the present disclosure may be implemented. As shown in fig. 1, the application scenario shown includes a terminal 110, a server 120, the terminal 110 being communicatively coupled with the server 120 via a network 130.
Terminal 110 may be an intelligent terminal device with web page access capabilities. By way of example, a browser application 140 may be running on terminal 110 to access a web page. The terminal 110 and browser application 140 support HTML and JS scripts.
The browser application 140 may send a request to the server 120 to access the target web page based on the web address entered by the user. The server 120 stores various web page resources corresponding to the web address, such as a hypertext markup language file for the target web page and a corresponding JavaScript file, etc., provided by a web page or a content provider. The server 120 may return the hypertext markup language file for the target web page and the corresponding JavaScript file to the browser application based on the request. The browser application can acquire the information of the webpage corresponding to the website based on the hypertext markup language file and render the information, so that the content (such as text content; multimedia content such as icons and pictures) of the webpage can be displayed to a user, and corresponding JavaScript files (such as corresponding JavaScript scripts) can be executed.
The JavaScript file may include, for example, an anti-crawler mechanism by which the browser application 140 may obtain at least part of text data in the hypertext markup language file to be protected and host elements of the protected text data, determine whether code of an application programming interface method for creating a shadow document object model is native code, create the at least part of text data in the hypertext markup language file to be protected into a shadow document object model in a closed mode by using the application programming interface method in response to the code of the application programming interface method for creating the shadow document object model being native code, and mount the shadow document object model on the host elements specified by a user. The host element and shadow document object model may then be rendered for presentation to a user.
It should be noted that the terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein. The network 130 may be, for example, a Wide Area Network (WAN), a Local Area Network (LAN), a wireless network, a public telephone network, an intranet, and any other type of network known to those skilled in the art.
The scenario described above is merely one example in which embodiments of the present disclosure may be implemented and is not limiting.
Fig. 2 illustrates a schematic flow diagram of a text data protection method 200 according to one embodiment of the present disclosure. The text data protection method may be implemented, for example, at the terminal 110 as shown in fig. 1, and specifically may be implemented, for example, at a browser on the terminal 110. As shown in fig. 2, the method 200 includes the following steps.
At step 210, an anti-crawler mechanism is obtained. The anticreep mechanism may be, for example, a predefined anticreep policy, which may be expressed as a anticreep policy function. In some embodiments, the anticreeper mechanism may be obtained from a web page or a server of the content provider. As an example, a request to access a target web page may first be sent, for example, to a server. Then, a hypertext markup language file for the target web page and a corresponding JavaScript file are obtained, e.g., from a server, wherein the JavaScript file includes an anti-crawler mechanism.
At step 220, the anticreeper mechanism is executed. In performing the implementation of the anticreeper mechanism, in particular, the following steps 2201-2203 may be implemented.
At step 2201, at least a portion of text data to be protected and host elements of the text data after the protection process in the hypertext markup language file are obtained. In the case that the anticreeper mechanism is an anticreeper policy function, at least part of the text data to be protected and the host element of the text data after the protection in the hypertext markup language file may serve as two parameters of the anticreeper policy function. At least a portion of the text data in the hypertext markup language file to be processed for protection may be determined by the provider of the web page, indicating that the at least portion of the text data is prohibited from being obtained by various crawler software. The text data to be protected may be part of text data or all of text data in the hypertext markup language file, which is not limited herein.
In some embodiments, at least a portion of the text data to be protected for processing and a host element of the text data after protection processing in the hypertext markup language file may be included in a JavaScript file that is transmitted by a server to a browser. In this case, at least part of the text data to be protected and the host element of the text data after the protection processing in the hypertext markup language file can be directly obtained from the JavaScript file on the browser side, which can effectively reduce the complexity of communication and save network resources, which is not limitative of course.
At step 2202, it is determined whether code of an application programming interface method used to create the shadow document object model is native code. The determination of whether the code of the application programming interface method used to create the shadow document object model is native code may be made in any suitable manner.
FIG. 3 illustrates a schematic diagram of a method 300 of determining whether code of an application programming interface method for creating a shadow document object model is native code. As shown in fig. 3, the method 300 includes steps 310-330. In step 310, code of an application programming interface method for creating a shadow document object model is parsed into character strings to obtain parsed content. As an example, toString methods may be used to parse code for application programming interface methods used to create the shadow document object model and similarly parse native code described below. In JavaScript, a string is a common data type. toString is one of methods of manipulating strings for converting other data types into string types. The method is widely used, and can convert numbers, arrays, objects, functions and the like into character string types. In step 320, it is determined whether the parsed content is identical to the content of the native code. In response to the parsed content being identical to the native code, then the code of the application programming interface method used to create the shadow document object model is determined to be the native code at step 330. In response to the parsed content being different from the content of the native code, it is determined at step 340 that the code of the application programming interface method used to create the shadow document object model is not native code.
In some embodiments, when parsing code of an application programming interface method for creating a shadow document object model into character strings, the application programming interface method for creating a shadow document object model by the host element may be first obtained. For example, an application programming interface method for creating a shadow document object model for the host element may be obtained using an attchshadow. In JavaScript attachShadow is a method for creating a shadow DOM for a specified element (host element). valueOf is a method that returns the original value of the specified object. Then, it is checked whether the parsing method for the application programming interface method is tampered. For example, in the case of parsing with toString methods, it is checked toString whether the method is tampered with. If the parsing method is not tampered, the parsing method is utilized to parse codes of an application programming interface method for creating the shadow document object model into character strings so as to obtain parsed contents. For example, if the toString method passes the verification, i.e. there is no tampering, the toString method is used for the application programming interface method to obtain the content of the character string type of the application programming interface method, and the blank space and the line feed character in the character string are removed, so as to obtain the parsed content. And if the parsed content is the same as the character string type content of the native code, determining that the application programming interface method is not tampered. Otherwise, the application programming interface method has been tampered, rendering is stopped, and the user behavior log is reported, for example, to a server or a manager. This can enhance the security of the text data and the crawler mechanism.
In step 2203, in response to the code of the application programming interface method used to create the shadow document object model being native code, at least a portion of the text data in the hypertext markup language file to be protected is created into a shadow document object model in a closed mode using the application programming interface method, and the shadow document object model in the closed mode is mounted to a host element specified by a user. The shadow document object model in the closed mode does not allow external JavaScript access, and text data to be protected can be effectively protected. The mounting of the shadow document object model in the closed mode on the host element appointed by the user is the basis for rendering text data by a subsequent browser. In some embodiments, the identity information and the host element information of the current browser are collected in response to the code of the application programming interface method used to create the shadow document object model not being native code. The crawler information may be reported, which can enhance the security of the text data.
At step 230, the host element and shadow document object model are rendered. Rendering results obtained by rendering the host element and the shadow document object model can be presented in a browser. In some embodiments, the rendering result obtained by rendering the host element and the shadow document object model may be determined, and then, if the host element of the shadow document object model is not the host element of the text data to be protected, the mode of the shadow document object model is not a closed mode, or the content of the shadow document object model is not the at least part of the text data to be protected in the hypertext markup language file, the rendering result is deleted and the rendering result is not presented. And if the host element of the shadow document object model is the host element of the text data to be protected, the mode of the shadow document object model is a closed mode, and the content of the shadow document object model is the at least part of the text data to be protected in the hypertext markup language file, the rendering result is presented. Alternatively or additionally, if the host element of the shadow document object model is not a host element of the text data to be protected, the schema of the shadow document object model is not a closed schema, or the content of the shadow document object model is not the at least part of the text data to be protected in a hypertext markup language file, the creation and mounting of the shadow document object model may be terminated if a crawler is considered to be cracking the anti-crawler mechanism or policy. If the content is already installed, the content in the host element is deleted immediately, so that the crawler cannot acquire any content. And then, collecting the identity information and the host element information of the current browser system, and completing the report of the crawler information. This can enhance the security of the text data.
In the text data protection method claimed in the present disclosure, an anti-crawler mechanism is first acquired, at least part of text data to be protected and host elements of the text data after protection processing in a hypertext markup language file are acquired as input of the anti-crawler mechanism, and when determining that code of an application programming interface method for creating a shadow document object model is native code, the anti-crawler mechanism creates the at least part of text data to be protected in the hypertext markup language file into a shadow document object model in a closed mode by using the application programming interface method, and mounts the shadow document object model in the closed mode on host elements designated by a user for rendering. In this way, whether the code of the application programming interface method for creating the shadow document object model is a native code is monitored in real time, so that the falsification of the application programming interface method by an external crawler is prevented, the safety is enhanced, and when the code of the application programming interface method for creating the shadow document object model is a native code, at least part of text data to be protected is dynamically created into a shadow document object model in a closed mode, thereby preventing a crawler from acquiring real text data, and the shadow document object model in the closed mode is mounted on a host element designated by a user for rendering, so that a target to be cracked by the crawler can be accurately acquired when the crawler exists.
Fig. 4 illustrates an exemplary block diagram of a text data protection device 400 according to one embodiment of the present disclosure. As shown in fig. 4, the text data protection apparatus includes an acquisition module 410, an execution module 420, and a rendering module 430.
The acquisition module 410 is configured to acquire an anti-crawler mechanism. The anticreep mechanism may be, for example, a predefined anticreep policy, which may be expressed as a anticreep policy function. In some embodiments, the anticreeper mechanism may be obtained from a web page or a server of the content provider. As an example, a request to access a target web page may first be sent, for example, to a server. Then, a hypertext markup language file for the target web page and a corresponding JavaScript file are obtained, e.g., from a server, wherein the JavaScript file includes an anti-crawler mechanism.
The execution module 420 is configured to execute the anticreeper mechanism to implement the steps of obtaining at least a portion of text data to be protected in a hypertext markup language file and host elements of the protected processed text data, determining if code of an application programming interface method for creating a shadow document object model is native code, creating the at least a portion of text data to be protected in the hypertext markup language file into a shadow document object model in a closed mode using the application programming interface method in response to the code of the application programming interface method for creating the shadow document object model being native code, and mounting the shadow document object model in the closed mode on the host elements specified by a user.
In some embodiments, at least a portion of the text data to be protected for processing and a host element of the text data after protection processing in the hypertext markup language file may be included in a JavaScript file that is transmitted by a server to a browser. In this case, at least part of the text data to be protected and the host element of the text data after the protection processing in the hypertext markup language file can be directly obtained from the JavaScript file on the browser side, which can effectively reduce the complexity of communication and save network resources, which is not limitative of course.
Rendering module 430 is configured to render the host element and shadow document object model. Rendering results obtained by rendering the host element and the shadow document object model can be presented in a browser. In some embodiments, the rendering result obtained by rendering the host element and the shadow document object model may be determined, and then, if the host element of the shadow document object model is not the host element of the text data to be protected, the mode of the shadow document object model is not a closed mode, or the content of the shadow document object model is not the at least part of the text data to be protected in the hypertext markup language file, the rendering result is deleted and the rendering result is not presented.
In the text data protection device claimed in the present disclosure, an anti-crawler mechanism is first acquired, then an execution module acquires at least part of text data to be protected and host elements of the text data after protection processing in a hypertext markup language file as input of the anti-crawler mechanism, and when determining that code of an application programming interface method for creating a shadow document object model is native code, the anti-crawler mechanism creates the at least part of text data to be protected in the hypertext markup language file into a shadow document object model in a closed mode by using the application programming interface method, and mounts the shadow document object model in the closed mode on host elements designated by a user for rendering. In this way, whether the code of the application programming interface method for creating the shadow document object model is a native code is monitored in real time, so that the falsification of the application programming interface method by an external crawler is prevented, the safety is enhanced, and when the code of the application programming interface method for creating the shadow document object model is a native code, at least part of text data to be protected is dynamically created into a shadow document object model in a closed mode, thereby preventing a crawler from acquiring real text data, and the shadow document object model in the closed mode is mounted on a host element designated by a user for rendering, so that a target to be cracked by the crawler can be accurately acquired when the crawler exists.
FIG. 5 illustrates an example system 500 that includes an example computing device 510 that represents one or more systems and/or devices that can implement the various techniques described herein. Computing device 510 may be, for example, a server of a service provider, a device associated with a server, a system-on-chip, and/or any other suitable computing device or computing system. The text data protection 400 described above with reference to fig. 4 may take the form of a computing device 510. Alternatively, the text data protection 400 may be implemented as a computer program in the form of an application 516.
The example computing device 510 as illustrated includes a processing system 511, one or more computer-readable media 512, and one or more I/O interfaces 513 communicatively coupled to each other. Although not shown, computing device 510 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus may include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.
Processing system 511 represents functionality that performs one or more operations using hardware. Thus, the processing system 511 is illustrated as including hardware elements 514 that may be configured as processors, functional blocks, and the like. This may include implementation in hardware as application specific integrated circuits or other logic devices formed using one or more semiconductors. The hardware element 514 is not limited by the material from which it is formed or the processing mechanism employed therein. For example, the processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, the processor-executable instructions may be electronically-executable instructions.
Computer-readable medium 512 is illustrated as including memory/storage 515. Memory/storage 515 represents memory/storage capacity associated with one or more computer-readable media. Memory/storage 515 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). Memory/storage 515 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) and removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 512 may be configured in a variety of other ways as described further below.
One or more I/O interfaces 513 represent functionality that allows a user to input commands and information to the computing device 510 using various input devices, and optionally also allows information to be presented to the user and/or other components or devices using various output devices. Examples of input devices include keyboards, cursor control devices (e.g., mice), microphones (e.g., for voice input), scanners, touch functions (e.g., capacitive or other sensors configured to detect physical touches), cameras (e.g., motion that does not involve touches may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, a haptic response device, and so forth. Accordingly, the computing device 510 may be configured in a variety of ways to support user interaction, as described further below.
Computing device 510 also includes application 516. Application 516 may be, for example, a software instance of text data protection 400 and implement the techniques described herein in combination with other elements in computing device 510.
Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer-readable media can include a variety of media that are accessible by computing device 510. By way of example, and not limitation, computer readable media may comprise "computer readable storage media" and "computer readable signal media".
"Computer-readable storage medium" refers to a medium and/or device that can permanently store information and/or a tangible storage device, as opposed to a mere signal transmission, carrier wave, or signal itself. Thus, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in methods or techniques suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of a computer-readable storage medium may include, but are not limited to RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical storage, hard disk, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage devices, tangible media, or articles of manufacture adapted to store the desired information and which may be accessed by a computer.
"Computer-readable signal medium" refers to a signal bearing medium configured to transmit instructions to hardware of computing device 510, such as via a network. Signal media may typically be embodied in computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, data signal, or other transport mechanism. Signal media also include any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
As previously described, the hardware elements 514 and computer-readable media 512 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware that may be used in some embodiments to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or components of a system on a chip, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), complex Programmable Logic Devices (CPLDs), and other implementations in silicon or other hardware devices. In this context, the hardware elements may be implemented as processing devices that perform program tasks defined by instructions, modules, and/or logic embodied by the hardware elements, as well as hardware devices that store instructions for execution, such as the previously described computer-readable storage media.
Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Accordingly, software, hardware, or program modules, and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 514. Computing device 510 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Thus, for example, by using the computer-readable storage medium of the processing system and/or the hardware element 514, a module may be implemented at least in part in hardware as a module executable by the computing device 510 as software. The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 510 and/or processing systems 511) to implement the techniques, modules, and examples described herein.
In various implementations, the computing device 510 may take on a variety of different configurations. For example, computing device 510 may be implemented as a computer-like device including a personal computer, desktop computer, multi-screen computer, laptop computer, netbook, and the like. Computing device 510 may also be implemented as a mobile appliance-like device that includes mobile devices such as mobile phones, portable music players, portable gaming devices, tablet computers, multi-screen computers, and the like. Computing device 510 may also be implemented as a television-like device that includes devices having or connected to generally larger screens in casual viewing environments. Such devices include televisions, set-top boxes, gaming machines, and the like.
The techniques described herein may be supported by these various configurations of computing device 510 and are not limited to the specific examples of techniques described herein. The functionality may also be implemented in whole or in part on the "cloud" 520 through the use of a distributed system, such as through the platform 522 as described below.
Cloud 520 includes and/or represents a platform 522 for resources 524. Platform 522 abstracts underlying functionality of hardware (e.g., servers) and software resources of cloud 520. The resources 524 may include applications and/or data that may be used when executing computer processing on a server remote from the computing device 510. The resources 524 may also include services provided over the internet and/or over subscriber networks such as cellular or Wi-Fi networks.
Platform 522 may abstract resources and functionality to connect computing device 510 with other computing devices. Platform 522 may also be used to abstract a hierarchy of resources to provide a corresponding level of hierarchy of encountered demand for resources 524 implemented via platform 522. Thus, in an interconnect device embodiment, implementation of the functionality described herein may be distributed throughout system 500. For example, the functionality may be implemented in part on computing device 510 and by platform 522 abstracting the functionality of cloud 520.
The present disclosure provides a computer readable storage medium having stored thereon computer readable instructions that when executed implement any of the methods described above.
The present disclosure provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computing device, and executed by the processor, cause the computing device to perform any of the methods provided in the various alternative implementations described above.
It should be understood that for clarity, embodiments of the present disclosure have been described with reference to different functional units. However, it will be apparent that the functionality of each functional unit may be implemented in a single unit, in a plurality of units or as part of other functional units without departing from the present disclosure. For example, functionality illustrated to be performed by a single unit may be performed by multiple different units. Thus, references to specific functional units are only to be seen as references to suitable units for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single unit or may be physically and functionally distributed between different units and circuits.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, components or sections, these devices, elements, components or sections should not be limited by these terms. These terms are only used to distinguish one device, element, component, or section from another device, element, component, or section.
Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the appended claims. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the term "a" or "an" does not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.