CN117395236A

CN117395236A - HTTP proxy service method and system

Info

Publication number: CN117395236A
Application number: CN202311281370.2A
Authority: CN
Inventors: 聂彦超; 邱春武
Original assignee: Sina Technology China Co Ltd
Current assignee: Sina Technology China Co Ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-01-12

Abstract

The embodiment of the invention provides a method and a system for HTTP proxy service, wherein the method comprises the following steps: receiving a first request initiated by a user and using the HTTP proxy service as a crawler, and distributing an HTTP proxy service as a target HTTP proxy service for the first request; the target HTTP proxy service converts the first request into a second request based on the HTTP protocol, and forwards the second request to a target server corresponding to each target URL, and in the process that the target server grabs the webpage data corresponding to each target URL, the webpage data meeting the proxy rule and the corresponding target URL are used as response data of the target server; response data is returned to the user. After receiving a first request initiated by a user and using the HTTP proxy service as a crawler, reasonably and dynamically scheduling the existing HTTP proxy services can provide an optimized proxy for the crawler, reduce waiting time of the crawler for accessing a target URL, and further improve efficiency of the crawler for capturing webpage data.

Description

HTTP proxy service method and system

Technical Field

The invention relates to the field of Internet, in particular to a method and a system for HTTP proxy service.

Background

The internet has a huge number of resources, and in order to effectively extract and utilize these resources, a crawler system is generated. The crawler is a basic component of search engine technology, the crawler starts from the URL (Uniform ResourceLocator ) of one or a plurality of initial web pages to obtain the URL on the initial web page, and in the process of capturing web page data, new URL is continuously extracted from the current page and put into a queue according to a preset web page capturing strategy until the captured web page data is stored in a server of the search engine after meeting a certain stop condition, so that the purpose of accelerating the search speed of a user is achieved.

The crawler system is used to automatically crawl specific resources from the internet, it starts crawling from one or a batch of uniform resource locators (URLs, uniform Resource Locator), and further extracts new URLs from the acquired web resources according to predetermined rules to be added to the crawling queue until some stop condition is met.

Many websites have limitations on frequent access to the crawler system and enable certain anticreeper techniques, for example, when an IP address is identified as being accessed more frequently than a certain number of times for a website within a specified period of time, the request for the IP address will be denied or a jump to a verification code page will be made. For this reason, crawler systems often employ distributed techniques, such as using multiple IP addresses to simulate the behavior of real users. In the case of limited IP address resources, some crawler systems simulate requests for multiple IP addresses by utilizing agents.

In carrying out the present invention, the applicant has found that at least the following problems exist in the prior art:

the number of agents is very small compared to the number of massive network resources requested by the crawler system. Therefore, how to utilize the agent with high speed and good stability to improve the network resource grabbing efficiency of the crawler system is a problem to be solved at present.

Disclosure of Invention

The embodiment of the invention provides a method and a system for HTTP proxy service, which can solve the technical problem that a crawler system is difficult to deal with the number of massive network resources corresponding to a request in the prior art.

To achieve the above object, in one aspect, an embodiment of the present invention provides a method for HTTP proxy service, including:

receiving a first request initiated by a user and using an HTTP proxy service as a crawler, and distributing an HTTP proxy service as a target HTTP proxy service for the first request; wherein each HTTP proxy service independently operates in a containerized deployed virtual environment;

the target HTTP proxy service converts the first request into a second request based on the HTTP protocol, forwards the second request to a target server corresponding to each target URL, and takes the webpage data meeting the proxy rule and the corresponding target URL as response data of the target server in the process that the target server grabs the webpage data corresponding to each target URL; wherein the target URL is a URL for which the corresponding web page data satisfies the first request;

And returning the response data to the user.

In another aspect, an embodiment of the present invention provides a system for HTTP proxy service, including an API gateway, a registry service center, and at least one HTTP proxy service, where:

the API gateway is used for receiving a first request initiated by a user and using the HTTP proxy service as a crawler and returning response data to the user;

the registration service center is configured to allocate an HTTP proxy service as a target HTTP proxy service for the first request; wherein each HTTP proxy service independently operates in a containerized deployed virtual environment;

the target HTTP proxy service is used for converting the first request into a second request based on an HTTP protocol, forwarding the second request to a target server corresponding to each target URL, and taking the webpage data meeting the proxy rule and the corresponding target URL as response data of the target server in the process that the target server grabs the webpage data corresponding to each target URL; wherein the target URL is a URL for which the corresponding web page data satisfies the first request.

The technical scheme has the following beneficial effects: after receiving a first request initiated by a user and using the HTTP proxy service as a crawler, reasonably and dynamically scheduling a plurality of existing HTTP proxy services to determine a target HTTP proxy service, providing an optimized proxy for the crawler, reducing waiting time of the crawler for accessing a target URL, effectively avoiding limitation of a preset access time interval of a website, improving efficiency of the crawler for accessing the target URL, and further improving efficiency of the crawler for capturing webpage data.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of a first HTTP proxy service according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of a second HTTP proxy service according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method of a third HTTP proxy service according to an embodiment of the present invention;

fig. 4 is a block diagram of a system of a first HTTP proxy service according to an embodiment of the present invention;

FIG. 5 is a block diagram of a system for a second HTTP proxy service according to an embodiment of the present invention;

FIG. 6 is a block diagram of a system for a third HTTP proxy service according to an embodiment of the present invention;

FIG. 7 is a block diagram of a system for a fourth HTTP proxy service according to an embodiment of the present invention;

fig. 8 is a flow chart of a method of a fourth HTTP proxy service according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The technology of the container comprises the following steps: dock, a lightweight, container-based virtualization technology, can quickly build, deploy, and manage applications.

Kubernetes: a container orchestration platform that enables automatic deployment, expansion, and management of containers. The container technology can effectively divide the resources of a single operating system into isolated groups so as to better balance conflicting resource use requirements among the isolated groups, and the technology is the container technology.

Service registration and discovery: a service grid solution provides functions such as service registration and discovery, distributed configuration management, etc.

HTTP proxy: and after the proxy is connected with a server between the browser and the web server, the browser does not directly fetch the web page to the web server any more, but sends a request signal to the proxy server, the proxy server sends a request to the web server again, and the data returned by the web server is fed back to the browser. CI/CD: methods for delivering applications to clients frequently by introducing automation during the application development phase. The core concepts of CI/CD are continuous integration, continuous delivery, and continuous deployment.

IP address: an address specific to the internet protocol is a unified address format provided by the IP protocol. The IP address assigns a logical address to each network and each host on the internet, thereby masking the difference in physical addresses.

The port: a port in a physical sense, for example, an ADSL Modem, a hub, a switch, an interface for connecting other network devices, such as an RJ-45 port, an SC port, and the like; the second is a port in a logical sense, generally referred to as a port in the TCP/IP protocol, the port number ranges from 0 to 65535, such as 80 ports for browsing web services, 21 ports for FTP services, and so on.

IP expiration date: the effective duration of the HTTP proxy resource.

As shown in fig. 1, in combination with an embodiment of the present invention, there is provided a method for HTTP proxy service, including:

s101: receiving a first request initiated by a user and using an HTTP proxy service as a crawler, and distributing an HTTP proxy service as a target HTTP proxy service for the first request; wherein each HTTP proxy service independently operates in a containerized deployed virtual environment;

s102: the target HTTP proxy service converts the first request into a second request based on the HTTP protocol, forwards the second request to a target server corresponding to each target URL, and takes the webpage data meeting the proxy rule and the corresponding target URL as response data of the target server in the process that the target server grabs the webpage data corresponding to each target URL; wherein the target URL is a URL for which the corresponding web page data satisfies the first request;

S103: and returning the response data to the user.

After receiving a first request initiated by a user and using the HTTP proxy service as a crawler, reasonably and dynamically scheduling a plurality of existing HTTP proxy services to determine a target HTTP proxy service, providing an optimized proxy for the crawler, reducing waiting time of the crawler for accessing a target URL, effectively avoiding limitation of a preset access time interval of a website, improving efficiency of the crawler for accessing the target URL, and further improving efficiency of the crawler for capturing webpage data.

Preferably, as shown in fig. 2, the method for HTTP proxy service according to the embodiment of the present invention further includes:

s104: receiving registration information of HTTP proxy services provided by at least one proxy service provider through a service registration center, judging the HTTP proxy services which pass account verification and are in an IP white list to pass registration verification, taking the HTTP proxy services which pass registration verification as registered HTTP proxy services, and distributing unique identifiers for each proxy service provider; the process of registering with the service registry can associate the proxy servers and HTTP proxy services provided by the proxy servers with the system and assign each proxy server a unique identifier after registration is passed for differentiation and management.

S105: based on heartbeat information periodically transmitted by each registered HTTP proxy service, registration information and running state information of all registered HTTP proxy services are stored and maintained. Heartbeat information refers to signals periodically sent by a system or application, similar to a heartbeat in biology, for confirming the activity and availability of the system, and includes service link state information (representing network reachability) and proxy state (representing the operational state of a proxy service, normal or abnormal, for indicating its normal operation and connection state with other components, for confirming its active state in a service list).

Preferably, S101: the method for receiving the first request using the HTTP proxy service as the crawler initiated by the user, distributing the HTTP proxy service as the target HTTP proxy service for the first request comprises the following steps:

receiving a first request initiated by a user and using an HTTP proxy service as a crawler through an API gateway, distributing a registered HTTP proxy service as a target HTTP proxy service for the first request based on a load balancing strategy and registration information of each HTTP proxy service, and forwarding the first request to the target HTTP proxy service; dynamic scheduling of HTTP proxy services can be achieved through policies of service registry and load balancing.

S103: the step of returning the response data to the user specifically comprises the following steps:

and receiving response data from the target server returned by the target HTTP proxy service through an API gateway, processing the response data into set format data, and returning the set format data to the user.

Receiving a first request initiated by a user and using an HTTP proxy service as a crawler through an API gateway, receiving response data from a target server returned by the target HTTP proxy service, processing the response data into set format data, and returning the set format data to the user; the same management of the access of the data is realized.

Preferably, as shown in fig. 3, the method for HTTP proxy service according to the embodiment of the present invention further includes:

s106: when the API gateway receives the first request, creating a tracking context of the first request, wherein the tracking context comprises a request ID of the first request and an initial span ID, and the tracking context is transferred along with the processing flow of the first request;

s107: creating a corresponding span and span ID for each processing step in the processing flow of the first request, and collecting span data of each span, wherein the span data comprises: start time, end time, process annotation of span; the span is a unit for measuring and recording time and events of each processing step in the processing flow;

S108: service link tracking information is formed, including the request ID, all span IDs, and span data.

By recording and analyzing the information of each HTTP proxy service processing request, so as to facilitate subsequent performance analysis and problem investigation, potential performance bottlenecks can be rapidly located.

Preferably, the method for HTTP proxy service according to the embodiment of the present invention further includes:

s109: detecting each HTTP proxy service, periodically collecting performance index data and running state information of each HTTP proxy service, and displaying and analyzing the performance index data and the running state information;

s110: when a failed HTTP proxy service is detected, a new HTTP proxy service is created and the failed HTTP proxy service is replaced with the newly created HTTP proxy service.

The running condition of the whole system can be monitored in real time. When detecting that a certain HTTP proxy service has faults, the fault service can be automatically removed, and a new HTTP proxy service is created for automatic restarting and replacement, so that the stability is good.

s120: when the configuration information of any HTTP proxy service needs to be modified, a unified configuration interface is adopted to modify the corresponding configuration item, and a configuration change event is issued; wherein the configuration change event comprises: a changed configuration item and a changed configuration value;

S130: after the HTTP proxy service monitors the configuration change event of the HTTP proxy service, the HTTP proxy service acquires a changed configuration value and updates corresponding configuration information in an atomic operation mode according to the changed configuration value; the atomic operation mode refers to: and continuing to use the configuration value before modification for the first request which is being processed by the HTTP proxy service until the first request is processed, and using the configuration value after modification for a new first request.

Since the update of the configuration value is an atomic operation, the first request being processed is not affected; the first request being processed is completed by using the old configuration, and the new first request is completed by using the new configuration, and flexible control and optimization of the whole HTTP proxy service system can be realized by adopting the configuration management.

As shown in fig. 4, in connection with an embodiment of the present invention, there is provided a system for HTTP proxy service, including an API gateway 21, a registry service center 22, at least one HTTP proxy service 23, wherein:

the API gateway 21 is configured to receive a first request initiated by a user and using the HTTP proxy service as a crawler, and return response data to the user;

The registry 22 is configured to allocate an HTTP proxy service 23 as a target HTTP proxy service for the first request; wherein each HTTP proxy service independently operates in a containerized deployed virtual environment;

The method has the advantages that the existing HTTP proxy services are reasonably and dynamically scheduled, optimized proxy can be provided for the crawlers, waiting time of the crawlers for accessing the target URL is reduced, limitation of access time interval limitation preset by websites can be effectively avoided, efficiency of the crawlers for accessing the target URL is improved, and further efficiency of the crawlers for capturing webpage data is improved.

Preferably, the service registry 22 is further configured to:

receiving registration information of HTTP proxy services provided by at least one proxy service provider, judging the HTTP proxy services which pass account verification and are in an IP white list to pass registration verification, taking the HTTP proxy services which pass registration verification as registered HTTP proxy services, and distributing unique identifiers for each proxy service provider; the process of registering with the service registry can associate the proxy servers and HTTP proxy services provided by the proxy servers with the system and assign each proxy server a unique identifier after registration is passed for differentiation and management.

Based on heartbeat information periodically transmitted by each registered HTTP proxy service, registration information and running state information of all registered HTTP proxy services are stored and maintained. Heartbeat information refers to signals periodically sent by a system or application, similar to a heartbeat in biology, for confirming the activity and availability of the system, and includes service link state information (representing network reachability) and proxy state (representing the operational state of a proxy service, normal or abnormal, for indicating its normal operation and connection state with other components, for confirming its active state in a service list).

Preferably, the API gateway is specifically configured to receive a first request initiated by a user and using an HTTP proxy service as a crawler, allocate a registered HTTP proxy service as a target HTTP proxy service to the first request based on a load balancing policy and registration information of each HTTP proxy service, and forward the first request to the target HTTP proxy service; dynamic scheduling of HTTP proxy services can be achieved through policies of service registry and load balancing.

And the API gateway is specifically used for receiving response data from the target server returned by the target HTTP proxy service, processing the response data into set format data, and returning the set format data to the user.

Preferably, as shown in fig. 5, the service link tracking unit 24 is further included, and is configured to:

when the API gateway receives the first request, creating a tracking context of the first request, wherein the tracking context comprises a request ID of the first request and an initial span ID, and the tracking context is transferred along with the processing flow of the first request;

creating a corresponding span and span ID for each processing step in the processing flow of the first request, and collecting span data of each span, wherein the span data comprises: start time, end time, process annotation of span; the span is a unit for measuring and recording time and events of each processing step in the processing flow;

service link tracking information is formed, including the request ID, all span IDs, and span data.

Preferably, as shown in fig. 6, the system for HTTP proxy service according to the embodiment of the present invention further includes:

a monitoring unit 25, configured to detect each HTTP proxy service, periodically collect performance index data and running state information of each HTTP proxy service, and display and analyze the performance index data and the running state information;

the failure processing unit 26 is configured to create a new HTTP proxy service when a failed HTTP proxy service is detected, and replace the failed HTTP proxy service with the newly created HTTP proxy service.

Preferably, the HTTP proxy service system according to the embodiment of the present invention further includes:

a configuration management unit 27, configured to modify corresponding configuration items using a unified configuration interface and issue a configuration change event when configuration information of any HTTP proxy service needs to be modified; wherein the configuration change event comprises: a changed configuration item and a changed configuration value;

The HTTP proxy service 23 is configured to obtain a changed configuration value after monitoring a configuration change event of the HTTP proxy service, and update corresponding configuration information in an atomic operation manner according to the changed configuration value; the atomic operation mode refers to: and continuing to use the configuration value before modification for the first request which is being processed by the HTTP proxy service until the first request is processed, and using the configuration value after modification for a new first request.

The foregoing technical solutions of the embodiments of the present invention will be described in detail with reference to specific application examples, and reference may be made to the foregoing related description for details of the implementation process that are not described.

The embodiment of the invention relates to a method and a corresponding system for HTTP proxy service, wherein each HTTP proxy service is respectively deployed in an independent container environment, and by reasonably and dynamically scheduling a plurality of existing HTTP proxy services, an optimized proxy can be provided for a crawler, so that waiting time of the crawler for accessing URL (uniform resource locator) is reduced, the limitation of limiting access time interval preset by a website can be effectively avoided, the URL access efficiency of the crawler is improved, and the web page data capturing efficiency of the crawler is further improved.

HTTP proxy services are deployed using containerization techniques (e.g., docker). Containerized deployments rely on container technologies (e.g., docker) and container orchestration tools (e.g., kubernetes). Dockerfile defines how to build a container image of a proxy service, including a base image, a dependency library, and an application. Kubernetes defines the deployment size, resource limitations, access policies, etc. of HTTP proxy services through profiles (e.g., YAML). The containerized deployment enables each HTTP proxy service to independently operate in a lightweight virtual environment, is easy to expand and migrate, has good realization stability and high availability, and improves the deployment and management efficiency of the HTTP proxy service.

Different proxy service providers register with a service registration center respectively, each proxy service provider can provide a plurality of HTTP proxy services, registration information of each HTTP proxy service is registered with the service registration center, HTTP proxy services which pass account verification and are in an IP white list are judged to pass registration verification, and HTTP proxy services which pass registration verification are regarded as registered HTTP proxy services; the process of registering with the service registry can associate the proxy servers and HTTP proxy services provided by the proxy servers with the system and assign each proxy server a unique identifier after registration is passed for differentiation and management. When a user initiates a first request for using the HTTP proxy service, the first request is dispatched to one of the HTTP proxy services (including a proxy address, a port, a container ID, etc.) as a target HTTP proxy service according to a policy of load balancing. Dynamic scheduling of HTTP proxy services can be achieved through policies of service registry and load balancing. And the user directly accesses the target HTTP proxy service by using the proxy address to acquire the webpage content.

The specific sequence flow of the HTTP proxy service method is shown in FIG. 8, and comprises the following steps:

1. the client initiates a request: the user initiates a first request to use the HTTP proxy service as a crawler, such as through a client (which may include a browser, mobile application, etc.) to an API gateway of the HTTP proxy service. The first request may include information such as a query parameter, a request header, a request body, etc. The first request specifies search content or conditions for requesting specific data or resources from the service to be served.

2. The API gateway processes the request: after receiving the first request, the API gateway performs related processing, such as authentication, on the first request; such as current limiting, fusing, etc. The API gateway can be realized by using open source or commercial products such as Kong, tyk or Ambasador, and is used as a unified entry point, is responsible for receiving a request of a user, and forwarding the request to a corresponding HTTP proxy service, so that the stability of the system is ensured. Meanwhile, the API gateway supports plug-in or middleware to realize functions of limiting current, fusing, authenticating and the like of the request, and improves the stability and safety of system operation.

3. The API gateway will forward the request by registering the service center with a suitable HTTP proxy service based on the service list of the service registry (for recording the proxy first request for each HTTP proxy service) and the load balancing policy. In dynamic scheduling, an appropriate load balancing strategy is needed to ensure the load balancing of each HTTP proxy service, so that the realization is flexible and high in availability. Common load balancing strategies are polling, weighted polling, minimum number of connections, etc. Wherein the implementation of the load balancing policy depends on the scheduler of the HTTP proxy service. The scheduler selects the appropriate HTTP proxy service based on the service registry information and the selected load balancing policy (e.g., polling, weighted polling, or minimum number of connections). For example, the polling policy may distribute requests to each HTTP proxy service in turn, while the minimum connection policy may send the first request to the HTTP proxy service with the least current connection.

4. The HTTP proxy service processes the request: converting the first request into a second request based on the HTTP protocol, forwarding the second request to a target server corresponding to each target URL, and taking the webpage data meeting the proxy rule and the corresponding target URL as response data of the target server in the process that the target server captures the webpage data corresponding to each target URL; the target URL is the URL of the corresponding web page data meeting the first request, the target URL is the URL of the original request of the user, the web page data of the target URL includes the data corresponding to the first request, which is the original request of the user, for example, the web page data includes search content, or the web page data includes data matched with the search condition.

5. After receiving the first request from the API gateway, the selected target HTTP proxy service parses the first request to parse request information, such as a target URL, a request method, a request header, and the like. The HTTP proxy service would then forward the request to the target server where the target URL is located, according to the proxy rules and policies.

6. The target server responds to the request: after receiving the second request of the target HTTP proxy service, the target server carries out corresponding processing and generates response data; the response data may be sent back to the target HTTP proxy service.

7. The target HTTP proxy service forwards the response data: the proxy service instance receives response data of the target server: after the target URL and the webpage data returned by the target server, the response data is processed to a certain extent, such as modifying the response header, compressing the response content, and the like. The target HTTP proxy service then forwards the processed response data back to the API gateway. Among these, the purpose of the target HTTP proxy service to process response data is several:

response head modification: the target HTTP proxy service modifies the response header to meet specific needs or to add additional information. For example, a cache control header, a security header, a cross-domain resource sharing header, etc. may be added.

In response to content compression: in order to reduce the size of the data transmission, the target HTTP proxy service compresses the response content, such as using a Gzip or Deflate compression algorithm. The network transmission efficiency can be improved, and the bandwidth occupation and response time can be reduced.

Data encryption/decryption: the target HTTP proxy service may need to encrypt or decrypt the response data to ensure the security of the data. Encryption may protect confidentiality of data, for example, when processing sensitive information or communicating with external systems. Specifically, in order to ensure security of proxy services, various security policies are adopted in the embodiments of the present invention, such as using TLS/SSL encrypted transmissions, setting access control lists, and the like. The use of TLS/SSL encrypted transmissions ensures the security of data during transmission, and Access Control Lists (ACLs) may restrict access to HTTP proxy services for specific IPs or users. The containerized deployment realizes the resource isolation and the security restriction between proxy services through the Linux namespaces, cgroups and other technologies, and prevents potential security risks.

Content filtering and modification: the target HTTP proxy service may perform content filtering, replacement or modification of the response data as desired. For example, sensitive vocabulary may be filtered, URL links replaced, HTML tags modified, and so forth.

After processing is complete, the target HTTP proxy service forwards the modified response data back to the API gateway for further processing and return to the user by the API gateway. By doing the above processing of the response data, the target HTTP proxy service is able to customize and optimize the transmission and content of the data to provide a better user experience, enhance security, or meet specific business needs.

8. The API gateway returns a response: after receiving the response data from the target HTTP proxy service, the API gateway performs final processing on the response data, such as adding a response header, recording a log and the like, and processes the response data into the data in the set format. The API gateway returns the formatted data to the user.

9. In the whole request flow, the target HTTP proxy service and the API gateway can record relevant link tracking information. By recording and analyzing the time consumption, calling relation and other information of each HTTP proxy service processing request, the potential performance bottleneck can be rapidly positioned so as to facilitate subsequent performance analysis and problem investigation.

Specifically, service link tracking may be implemented by Zipkin, jaeger or opentelemet tools. In the HTTP proxy service, it is necessary to record trace information including a request ID, a time stamp, a processing time length, and the like, respectively at the start and end of the first request processing. This information is sent to the link tracking system and queried and analyzed through a Web interface or API. Wherein the request ID is a unique identifier for identifying the identity of the particular request; may be a randomly generated string, a globally unique number, or other form of unique identifier; each first request is assigned a unique request ID to track and identify the request in the system. By request ID, the system can track the flow of processing of a particular request in the system. The request ID may be used to uniquely identify the request and correlate its associated operations with events throughout the process from receipt of the request to the final response. The request ID plays a key role in logging and troubleshooting. By adding a request ID to the log, log entries associated with a particular request can be conveniently looked up and filtered. When a problem or failure occurs in the system, the request ID may be used to track and locate the root cause of the problem. By requesting an ID, the system can analyze and monitor the performance of the request. For example, indicators of processing time, resource consumption, etc. of the requests may be recorded, and aggregated and analyzed by the request IDs to learn about the performance status of the system. By associating a request ID with all operations and events related to the request, a complete request tracking context can be constructed. This context can be passed to different components and services to ensure traceability and consistency of the overall request handling process.

Specific steps of service link tracking:

creating and delivering tracking context: when the API gateway receives a new first request, a new trace context is created that contains a unique trace ID (i.e., request ID) and an initial span ID. The trace context needs to be passed down the whole request processing, from the API gateway to the HTTP proxy service, and to other services.

Creating a corresponding span and span ID for each processing step in the processing flow of the first request, and collecting span data of each span, wherein the span data comprises: start time, end time, process annotation of span; service link tracking information is formed, including the request ID, all span IDs, and span numbers. A span is a unit used to measure and record the time and events of each process step in a process flow.

The span data were collected as follows: at each step of request processing (e.g., authentication, routing, request processing, etc.), it is necessary to create a new span and collect data on the start time, end time, process comments, etc. of the span. Wherein, the processing annotation refers to marking or annotating the processing step to provide additional information about the execution of the step. The process annotations may contain important details about the step, operational descriptions, state changes, exception information, etc. They are used to record and describe critical events or operations of the process steps for subsequent analysis and understanding of the request processing. The role of processing annotations includes: (1) recording the operation behavior: the process annotations may record important operations or events that occur in each process step. This facilitates subsequent troubleshooting and auditing to learn about the specific operation and execution of each step. (2) marking key points: the process annotations may mark key points in the request process flow, such as key code blocks, important calculation steps, or decision points. This helps to quickly locate and understand the execution of the key step when tracking and analyzing the service link. (3) recording anomaly information: if an exception or error occurs in a processing step, the processing annotation may be used to record relevant exception information, such as exception type, error message, stack trace, etc. This is very useful for troubleshooting and error analysis. Therefore, by adding process annotations in each process step, rich information can be collected for each span, including start time, end time, and process annotations for that step. These span information can be used to construct service link tracking to comprehensively record and monitor the various links of request processing and provide valuable data for performance optimization, troubleshooting, and system analysis.

Transmitting tracking data: when the request is processed or at some critical step, the collected trace data (service link trace information and other context information) needs to be sent to the trace system.

Querying and analyzing trace data: at the console of the tracking system, the collected tracking data may be queried and analyzed. For example, the overall processing time of the request, or the execution time of the individual steps, may be reviewed. From these data, performance bottlenecks and anomalies in the system can be found.

10. The HTTP proxy service instance periodically sends heartbeat information to the service registry, the heartbeat information including service link state information (representing network reachability) and proxy state (representing the operational state of the proxy service, normal or abnormal, for indicating its normal operation and connection state with other components, for confirming its active state in the service list.

11. The service registration center can regularly receive heartbeat information sent by the HTTP proxy service and is responsible for storing and maintaining state information of all the HTTP proxy service so as to maintain and update the state of the HTTP proxy service and dynamically schedule the HTTP proxy service, thereby realizing strong data consistency and high availability of data.

12. Service monitoring and fault handling: and detecting each HTTP proxy service, and periodically collecting the running state data and the performance index data of each HTTP proxy service, so that the running state of the whole system can be monitored in real time. When detecting that a certain HTTP proxy service has faults, the fault service can be automatically removed, and a new HTTP proxy service is created for replacement, so that the stability is good. Specifically, the monitoring unit is realized by means of Prometheus, grafana and the like; prometaus periodically collects performance index data and running state data of HTTP proxy service, such as response time, error rate, heartbeat information, etc.; grafana is used to display and analyze these index data; the fault handling unit may be implemented by the automatic restart and copy replacement functions of Kubernetes, and when an HTTP proxy service fault is detected, the system automatically creates a new HTTP proxy service for replacement.

13. Dynamic expansion and contraction capacity: according to the actual service demand and load condition, the number of HTTP proxy services can be automatically adjusted. When the load is higher, more HTTP proxy services can be automatically created to improve the performance; when the load is lower, the number of HTTP proxy services can be automatically reduced, so that resources are saved, and the stability is good. One of the key technologies to achieve dynamic expansion and contraction is the container orchestration tool (e.g., kubernetes). By setting rules of automatic capacity expansion and contraction (such as indexes of CPU (central processing unit) utilization rate, memory utilization rate or request amount), the number of HTTP proxy services can be dynamically adjusted by Kubernetes according to the current load. Another key technology is Horizontal Pod Autoscaler (HPA) for monitoring metrics and triggering a scaling operation.

14. Automated deployment and continuous integration/continuous deployment (CI/CD): in order to improve the deployment efficiency and reduce the operation and maintenance cost, an automatic deployment tool (such as Kubernetes) is adopted to deploy and manage the HTTP proxy service. By means of a continuous integration/continuous deployment flow, fast iteration and high quality release of code can be ensured.

Specifically, CI/CD tools such as Jenkins, gitLab CI or Github Actions may enable automatic construction, testing and deployment of code. By configuring the CI/CD pipeline, each code commit triggers the build and test process. If the test is successful, the code will be automatically deployed to a production environment or pre-release environment.

Specific steps of Continuous Integration (CI):

version control: after writing the code locally, the developer submits the code to a main branch or development branch in a version control system (e.g., git).

Triggering and constructing: once a new code is submitted, the CI tool (e.g., jenkins, gitLab CI, or Github Actions) automatically triggers the build process.

Code construction: the CI tool will obtain the latest code and then compile it.

Running and testing: after the code construction is completed, the CI tool automatically runs predefined unit tests, integration tests, etc., to ensure that the new code does not destroy existing functionality.

Generating a report: after the test is finished, the CI tool generates a report which comprises the result of the test, coverage rate and other information.

Notifying the result: based on the results of the build and test, the CI tool will notify the developer. If the build or test fails, the developer needs to fix the problem and resubmit the code.

Specific steps of Continuous Deployment (CD):

the verification is as follows: the continued deployment process is triggered only if all tests in the CI process pass.

Configuration management: the CD tool may obtain configuration information for the application such as database connections, service ports, etc. Such configuration information is typically stored in a configuration management tool (e.g., spring Cloud Config or ZooKeeper).

Deployment to a pre-production environment: the CD tool will first deploy the application to the pre-production environment and then run a series of acceptance tests to ensure the behavior of the application in the production environment.

And (3) verification test: if the acceptance test passes, the CD tool will confirm that the application is ready for deployment in the production environment.

Deployment to production environment: the CD tool will deploy the application to the production environment and conduct the necessary health checks.

Monitoring and rollback: once a problem is found in the production environment, the CD tool should support an automatic or semi-automatic rollback mechanism to quickly restore service. Furthermore, the behavior of the application in the production environment should be continuously monitored.

15. Configuration management: supporting dynamic updating of configuration information (e.g., proxy rules, access control policies, etc.) of HTTP proxy services at runtime. Configuration management may be implemented through Spring Cloud Config, consul KV or etcd tools, etc. These tools provide a unified configuration storage and access interface that supports dynamic updating of configuration information at runtime. The HTTP proxy service needs to monitor configuration change events and reload configuration data upon receipt of the event. The method comprises the following steps:

configuration change: when the configuration of the service instance needs to be modified, an operator or other systems can modify corresponding configuration items by adopting a unified configuration interface in a configuration management tool; for example, concurrent connection restrictions, timeout times, etc. of the HTTP proxy service may be modified.

Issuing an event: after detecting the configuration change, the configuration management tool issues a configuration change event; the content of an event typically includes changed configuration items and new configuration values.

Monitoring events: the HTTP proxy service needs to monitor configuration change events; this may be accomplished by registering an event handling function or callback function. When a configuration change event is received, an event handling function or callback function may be automatically invoked.

Updating configuration: in the event processing function or the callback function, the HTTP proxy service firstly changes the configuration items and the changed configuration values; the own internal state is then updated to reflect the changed configuration. This update process should be an atomic operation to avoid handling new requests during the update process. The way atomic operations refer to: the HTTP proxy service continues to use the configuration value before modification until the first request is processed, and uses the configuration value after modification for a new first request.

Processing a first request: after the configuration update, the HTTP proxy service processes the new first request according to the changed configuration. Since the update of the configuration is an atomic operation, this does not affect the request being processed. The request being processed will be completed using the old configuration, while the new request will use the new configuration.

Through the configuration management, the flexible control and optimization of the whole HTTP proxy service system can be realized.

The beneficial effects obtained by the embodiment of the invention are as follows:

1. compared with the traditional mode of designating proxy requests, the method and the device can intelligently complete the selection and direct use of a plurality of HTTP proxy services through the scheduling strategy, the HTTP proxy service selection has flexibility, and the number of the HTTP proxy services can be expanded.

Since deployed in the container, HTTP proxy services can be easily deployed, managed, and extended; the method supports automatic perception of resource bottleneck, and can rapidly respond to the resources required by dynamic capacity expansion and capacity contraction of the demands when the service instance needs to be expanded or contracted, thereby improving the utilization rate of the resources. The method has high performance, expandability and stability, is beneficial to reducing operation and maintenance cost and improving the competitiveness of enterprises, and is suitable for modern cloud computing and micro-service architecture.

2. By using the service registration center and the load balancing strategy, the request can be distributed among a plurality of HTTP proxy services, so that the request can be processed by other HTTP proxy services when a certain HTTP proxy service fails, and the high availability and fault tolerance of the whole system can be improved.

3. Configuration information of the HTTP proxy service can be dynamically updated at runtime through configuration management without restarting the service. This makes the system more flexible and can quickly adapt to configuration change requirements.

4. By taking the API gateway as a unified entrance of the request, the functions of authentication, current limiting and the like of the request can be realized, and the safety and stability of the system are improved. Meanwhile, the API gateway provides a plug-in or middleware mechanism, so that the functions of extension and customization are convenient.

5. The service link tracking can monitor and analyze the performance bottleneck and abnormal conditions in the request processing process in real time, and is beneficial to optimizing the system performance and troubleshooting.

6. The CI/CD can be integrated automatically and continuously to ensure the rapid and stable development to production of codes, and the working efficiency of development and operation and maintenance teams is improved.

It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate preferred embodiment of this invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. As will be apparent to those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, as used in the specification or claims, the term "comprising" is intended to be inclusive in a manner similar to the term "comprising," as interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean "non-exclusive or".

Those of skill in the art will further appreciate that the various illustrative logical blocks (illustrative logical block), units, and steps described in connection with the embodiments of the invention may be implemented by electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components (illustrative components), elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation is not to be understood as beyond the scope of the embodiments of the present invention.

The various illustrative logical blocks or units described in the embodiments of the invention may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described. A general purpose processor may be a microprocessor, but in the alternative, the general purpose processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. In an example, a storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may reside in a user terminal. In the alternative, the processor and the storage medium may reside as distinct components in a user terminal.

In one or more exemplary designs, the above-described functions of embodiments of the present invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on a computer-readable medium or transmitted as one or more instructions or code on the computer-readable medium. Computer readable media includes both computer storage media and communication media that facilitate transfer of computer programs from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media may include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store program code in the form of instructions or data structures and other data structures that may be read by a general or special purpose computer, or a general or special purpose processor. Further, any connection is properly termed a computer-readable medium, e.g., if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless such as infrared, radio, and microwave, and is also included in the definition of computer-readable medium. The disks (disks) and disks (disks) include compact disks, laser disks, optical disks, DVDs, floppy disks, and blu-ray discs where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included within the computer-readable media.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method of HTTP proxy service, comprising:

And returning the response data to the user.

2. The method of HTTP proxy service according to claim 1, further comprising:

receiving registration information of HTTP proxy services provided by at least one proxy service provider through a service registration center, judging the HTTP proxy services which pass account verification and are in an IP white list to pass registration verification, taking the HTTP proxy services which pass registration verification as registered HTTP proxy services, and distributing unique identifiers for each proxy service provider; wherein,

based on heartbeat information periodically transmitted by each registered HTTP proxy service, registration information and running state information of all registered HTTP proxy services are stored and maintained.

3. The method of claim 2, wherein receiving a first request initiated by a user using an HTTP proxy service as a crawler, assigning an HTTP proxy service to the first request as a target HTTP proxy service, comprises:

receiving a first request initiated by a user and using an HTTP proxy service as a crawler through an API gateway, distributing a registered HTTP proxy service as a target HTTP proxy service for the first request based on a load balancing strategy and registration information of each HTTP proxy service, and forwarding the first request to the target HTTP proxy service;

The step of returning the response data to the user specifically comprises the following steps:

4. A method of HTTP proxy service according to claim 3, further comprising:

5. The method of HTTP proxy service according to claim 1, further comprising:

detecting each HTTP proxy service, periodically collecting performance index data and running state information of each HTTP proxy service, and displaying and analyzing the performance index data and the running state information;

when a failed HTTP proxy service is detected, a new HTTP proxy service is created and the failed HTTP proxy service is replaced with the newly created HTTP proxy service.

6. The method of HTTP proxy service according to claim 1, further comprising:

when the configuration information of any HTTP proxy service needs to be modified, a unified configuration interface is adopted to modify the corresponding configuration item, and a configuration change event is issued; wherein the configuration change event comprises: a changed configuration item and a changed configuration value;

after the HTTP proxy service monitors the configuration change event of the HTTP proxy service, the HTTP proxy service acquires a changed configuration value and updates corresponding configuration information in an atomic operation mode according to the changed configuration value; the atomic operation mode refers to: and continuing to use the configuration value before modification for the first request which is being processed by the HTTP proxy service until the first request is processed, and using the configuration value after modification for a new first request.

7. A system of HTTP proxy services comprising an API gateway, a registry of services, at least one HTTP proxy service, wherein:

8. The HTTP proxy service system of claim 7, wherein the service registry is further configured to:

receiving registration information of HTTP proxy services provided by at least one proxy service provider, judging the HTTP proxy services which pass account verification and are in an IP white list to pass registration verification, taking the HTTP proxy services which pass registration verification as registered HTTP proxy services, and distributing unique identifiers for each proxy service provider; wherein,

9. The system for HTTP proxy service of claim 8, wherein,

the API gateway is specifically configured to receive a first request initiated by a user and using an HTTP proxy service as a crawler, allocate a registered HTTP proxy service as a target HTTP proxy service for the first request based on a load balancing policy and registration information of each HTTP proxy service, and forward the first request to the target HTTP proxy service; and

and receiving response data from the target server returned by the target HTTP proxy service, processing the response data into set format data, and returning the set format data to the user.

10. The system of HTTP proxy service according to claim 9, further comprising a service link tracking unit configured to:

11. The HTTP proxy service system according to claim 7, further comprising:

the monitoring unit is used for detecting each HTTP proxy service, periodically collecting performance index data and running state information of each HTTP proxy service, and displaying and analyzing the performance index data and the running state information;

and the fault processing unit is used for creating a new HTTP proxy service when detecting the HTTP proxy service with faults and replacing the HTTP proxy service with the newly created HTTP proxy service.

12. The HTTP proxy service system according to claim 7, further comprising:

the configuration management unit is used for adopting a unified configuration interface to modify corresponding configuration items and issuing configuration change events when the configuration information of any HTTP proxy service needs to be modified; wherein the configuration change event comprises: a changed configuration item and a changed configuration value;

The HTTP proxy service is used for acquiring the changed configuration value after monitoring the configuration change event of the HTTP proxy service, and updating the corresponding configuration information in an atomic operation mode according to the changed configuration value; the atomic operation mode refers to: and continuing to use the configuration value before modification for the first request which is being processed by the HTTP proxy service until the first request is processed, and using the configuration value after modification for a new first request.