[go: up one dir, main page]

CN107957999A - A kind of web crawlers obtains the method and device of website data - Google Patents

A kind of web crawlers obtains the method and device of website data Download PDF

Info

Publication number
CN107957999A
CN107957999A CN201610899608.1A CN201610899608A CN107957999A CN 107957999 A CN107957999 A CN 107957999A CN 201610899608 A CN201610899608 A CN 201610899608A CN 107957999 A CN107957999 A CN 107957999A
Authority
CN
China
Prior art keywords
agent
queue
website data
extracted
failure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610899608.1A
Other languages
Chinese (zh)
Inventor
张祎博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201610899608.1A priority Critical patent/CN107957999A/en
Publication of CN107957999A publication Critical patent/CN107957999A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the method and device that a kind of web crawlers obtains website data, it is related to field of computer technology, the Agent IP that main purpose is to ensure largely to use is effective and repeatedly utilizes, and when existing Agent IP fails, by the new Agent IP of dynamic access, the Agent IP that failure is replaced in screening is carried out.The main technical solution of the present invention is:Agent IP in extraction first queue is used to obtain website data;Void Agency IP in the first queue is removed according to the validity of the Agent IP;When the Agent IP quantity in the first queue is less than lowest threshold, Agent IP is extracted from second queue and is used to obtain website data;The Agent IP that the website data can be effectively extracted in the second queue is added in the first queue, until when the Agent IP quantity in the first queue reaches highest threshold value, then extract Agent IP from the first queue and be used to obtain website data.Present invention is mainly used for web crawlers to crawl network data.

Description

A kind of web crawlers obtains the method and device of website data
Technical field
The present invention relates to field of computer technology, more particularly to a kind of web crawlers to obtain the method and dress of website data Put.
Background technology
Web crawlers is a kind of instrument for being used to obtain data from website automatically.For website, the number of web crawlers The consumption that site resource can be brought as the access of real user according to obtaining, is climbed for the big network of some data grabber amounts Worm, its resource consumption are even accessed much larger than normal user.Therefore, generally can be to doubtful for the designer of many websites Rate limitation is carried out for the access of web crawlers, identity, or even the visit of some IP address of shielding are verified by modes such as identifying codes Ask.The above is properly termed as the anti-reptile strategy of website, the data of web crawlers can all be crawled and bring problem.
In the web crawlers being widely used at present, the anti-reptile strategy of reply website there are ways to.In general, The website of speed is crawled for limitation, the access frequency of website can be crawled by reducing.But once network is climbed The IP of worm is shielded by some websites, then the data of website can only be crawled by using the mode of Agent IP.Agent IP generally can be with By Agent IP, service provider obtains, and Agent IP service provider can dynamically provide substantial amounts of Agent IP in a short time and supply web crawlers Selection.Crawling without being shielded to website can be realized by effective Agent IP, still, when the Agent IP that service provider provides When of low quality, can there is a situation where that substantial amounts of Agent IP can not normal use.
In the realization of existing technology, when the Agent IP obtained from Agent IP service provider is of low quality, it can generally pass through The mode repeatedly retried is crawled, i.e., after the once access failure to webpage, obtains a new Agent IP and to the net Stand and crawl again, until webpage successfully crawls or reach certain number of retries.This method does not solve Agent IP matter The problem of amount is not high, repeatedly retries the efficiency that can influence web page crawl, while does not utilize effective Agent IP sufficiently.Separately A kind of outer method is that Agent IP is screened.The screening of Agent IP needs to carry out before crawling, and usually obtains first big The Agent IP of amount, then verifies the validity of these Agent IPs, and effective Agent IP after verification finally is supplied to web crawlers Use.This method can effectively improve the validity of the Agent IP used, but since the Agent IP that this method obtains is Limited, fixed, it is impossible to dynamic tracking agent IP service providers provide effective Agent IP, i.e., are provided in Agent IP service provider new Agent IP after, these Agent IPs can not be supplied to net in real time due to not can determine that the validity of these new Agent IPs Network reptile uses.This causes in time longer web crawlers is crawled, and these IP fixed is had agency IP fails, even with excessive and shielded problem.
The content of the invention
In view of this, the present invention provides the method and device that a kind of web crawlers obtains website data, and main purpose is The Agent IP for ensureing largely to use is effective and repeatedly utilizes, and when existing Agent IP fails, new by dynamic access Agent IP, carries out the Agent IP that failure is replaced in screening.
To reach above-mentioned purpose, present invention generally provides following technical solution:
On the one hand, the present invention provides a kind of method that web crawlers obtains website data, this method to include:
Agent IP in extraction first queue is used to obtain website data;
Void Agency IP in the first queue is removed according to the validity of the Agent IP;
When the Agent IP quantity in the first queue is less than lowest threshold, extraction Agent IP is used for from second queue Obtain website data;
The Agent IP that the website data can be effectively extracted in the second queue is added in the first queue, Until when the Agent IP quantity in the first queue reaches highest threshold value, then extract Agent IP from the first queue and be used for Obtain website data.
Preferably, before the Agent IP in extracting first queue is used to obtain website data, the described method includes:
The first queue is created using static acquisition Agent IP mode, and the Agent IP in the first queue is provided with Continuously crawl the frequency of failure.
Preferably, removing the Void Agency IP in the first queue according to the validity of the Agent IP includes:
When obtaining website data success using the Agent IP, the Agent IP is added in first queue, and will The continuous frequency of failure that crawls is reset;
When obtaining website data failure using the Agent IP, record the continuous of the Agent IP and crawl the frequency of failure;
When the Agent IP continuous crawls the frequency of failure and reach preset value, the Agent IP is deleted.
Preferably, it is described from second queue extract Agent IP be used for obtain website data before, the described method includes:
Screen service acquisition Agent IP using Agent IP, Agent IP screening service be with fixed frequency it is lasting from Agent IP obtains in service and screens effective Agent IP;
The Agent IP that service acquisition is screened according to the Agent IP creates second queue;
When the Agent IP quantity in the second queue reaches upper limit value, the Agent IP newly added is replaced described second The Agent IP added at first in queue.
Preferably, the Agent IP of the website data that will can effectively be extracted in the second queue is described in First queue includes:
When obtaining website data success using the Agent IP in the second queue, by the Agent IP described in First queue;
Judge whether the Agent IP quantity in the first queue reaches highest threshold value;
If reaching, Agent IP is extracted from the first queue to obtain website data;
If Agent IP not up to, is extracted from the second queue to obtain website data.
Preferably, the Agent IP that extracted from second queue includes for obtaining website data:
When obtaining website data using the Agent IP in the second queue and failing, the Agent IP is deleted, and from institute State and new Agent IP is extracted in second queue to obtain website data.
On the other hand, present invention also offers the device that a kind of web crawlers obtains website data, the device to include:
First extraction unit, is used to obtain website data for extracting the Agent IP in first queue;
Unit is deleted, the validity of the Agent IP for being extracted according to first extraction unit removes the first queue In Void Agency IP;
Second extraction unit, for when the Agent IP quantity in the first queue is less than lowest threshold, from the second team Agent IP is extracted in row to be used to obtain website data;
Adding device, for the Agent IP that the website data can be effectively extracted in second extraction unit to be added To in the first queue, until when the Agent IP quantity in the first queue reaches highest threshold value, then from the first team Agent IP is extracted in row to be used to obtain website data.
Preferably, described device includes:
First creating unit, is used to obtain website for the Agent IP in extracting first queue in first extraction unit Before data, the first queue is created using static acquisition Agent IP mode, the Agent IP in the first queue is set Have and continuously crawl the frequency of failure.
Preferably, the deletion unit includes:
Add module, for when obtaining website data success using the Agent IP, by the Agent IP added to the In one queue, and the continuous frequency of failure that crawls is reset;
Logging modle, for when obtaining website data failure using the Agent IP, recording the continuous of the Agent IP Crawl the frequency of failure;
Removing module, continuous for the Agent IP that is recorded when the logging modle crawl the frequency of failure and reach preset value When, delete the Agent IP.
Preferably, described device includes:
Acquiring unit, is used to obtain website data for extracting Agent IP from second queue in second extraction unit Before, service acquisition Agent IP is screened using Agent IP, the Agent IP screening service is lasting from generation with fixed frequency Reason IP, which is obtained in service, screens effective Agent IP;
Second creating unit, the Agent IP for being obtained according to the acquiring unit create second queue;
Updating block, for reaching upper limit value when the Agent IP quantity in the second queue that second creating unit creates When, the Agent IP newly added is replaced to the Agent IP added at first in the second queue.
Preferably, the adding device includes:
Add module, for when obtaining website data success using the Agent IP in the second queue, by the generation Reason IP is added to the first queue;
Judgment module, whether the Agent IP quantity for judging in the first queue reaches highest threshold value, if reaching, Agent IP is extracted from the first queue to obtain website data;If agency not up to, is extracted from the second queue IP is to obtain website data.
Preferably, second extraction unit is additionally operable to, and website number is obtained when using the Agent IP in the second queue During according to failure, the Agent IP is deleted, and new Agent IP is extracted to obtain website data from the second queue.
The web crawlers proposed according to the invention described above obtains the method and device of website data, and web crawlers can lead to Cross and extract the Agent IP in first queue to obtain website data, when the Agent IP quantity in first queue is reduced, Ke Yitong The Agent IP of website data can effectively be extracted in extraction second queue by, which crossing, is supplemented in first queue, to ensure in first queue Agent IP with high quality.For existing anti-reptile strategy, especially agency service business provide Agent IP matter When amount is not high, the present invention can screen Agent IP using first queue, effective Agent IP be recycled, together When, using the effective Agent IP of second queue dynamic access, when the Agent IP quantity in first queue is reduced to certain threshold value New, effective Agent IP is provided to first queue, so as to ensure that web crawlers carries out long-time, effective data crawl.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, it is various other the advantages of and benefit it is common for this area Technical staff will be clear understanding.Attached drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention Limitation.And in whole attached drawing, identical component is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 shows that a kind of web crawlers that the embodiment of the present invention proposes obtains the flow chart of the method for website data;
Fig. 2 shows that another web crawlers that the embodiment of the present invention proposes obtains the flow chart of the method for website data;
Fig. 3 shows that a kind of web crawlers that the embodiment of the present invention proposes obtains the device composition frame chart of website data;
Fig. 4 shows that another web crawlers that the embodiment of the present invention proposes obtains the device composition frame chart of website data.
Embodiment
The exemplary embodiment of the present invention is more fully described below with reference to accompanying drawings.Although the present invention is shown in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the present invention without should be by embodiments set forth here Limited.Conversely, there is provided these embodiments are to be able to be best understood from the present invention, and can be by the scope of the present invention Completely it is communicated to those skilled in the art.
An embodiment of the present invention provides a kind of method that web crawlers obtains website data, as shown in Figure 1, this method should During being crawled for web crawlers by Agent IP access website progress data, especially for the screening of Agent IP Journey, its specific steps include:
101st, the Agent IP in first queue is extracted to be used to obtain website data.
First queue is the queue being made of Agent IP, and the Agent IP in first queue is by static acquisition generation Manage the obtained Agent IP of method of IP.Wherein, the mode of static acquisition Agent IP is relative to dynamic in second queue The mode of Agent IP is obtained, static state here mainly includes mode that is non-dynamic, disposably obtaining current Agent IP.One As be by the substantial amounts of Agent IP of acquisition service acquisition of Agent IP service provider for web crawlers access website carry out data climb Take.Further, acquired a large amount of Agent IPs can be screened again, the available agent IP of successful access website will be capable of It is added in first queue.
The embodiment of the present invention preserves Agent IP using the data structure of queue, is due to the characteristic of queue first in first out. In the usage scenario of Agent IP, in order to tackle the anti-reptile strategy of website, current way be when Agent IP quantity is more, Different Agent IPs is all used when accessing website every time as far as possible, however, this does not obviously give full play to the work of effective Agent IP With.In actual use, the criterion observed should be that the longer the better using the time interval of same Agent IP.For queue Structure, uses the Agent IP of queue heads, can all be added again rear of queue after use every time, institute when next time uses before it There are other Agent IPs all to be used, the time interval used this guarantees Agent IP is most long.
102nd, the Void Agency IP in the first queue is removed according to the validity of Agent IP.
The validity of Agent IP is embodied in whether web crawlers by the Agent IP successful access website and can get Website data.After web crawlers extracts Agent IP from first queue, according to using the Agent IP result it may determine that its Effectively whether, when Agent IP can not be used to access website, it is Void Agency IP to decide that the Agent IP.And in order to avoid accidental Property presence, that is, the reason for being likely to be website or server-side caused wink hair, the discrete situation for accessing failure, Can be further by adding up the access frequency of failure of certain number, will after Agent IP connected reference fails certain number The Agent IP is determined as Void Agency IP.
After an Agent IP is confirmed as Void Agency IP, which can be removed, and no longer add it to In one queue.
It should be noted that web crawlers when accessing website, often accesses once can all extract in first queue one New Agent IP accesses, and after failure is accessed, can equally extract a new Agent IP again and access, rather than use The Agent IP of last visit is retried, so avoid using same Agent IP access repeatedly one it is problematic and can not The website of access, causes the false judgment to the Agent IP.
103rd, when the Agent IP quantity in first queue is less than lowest threshold, extraction Agent IP is used for from second queue Obtain website data.
With the failure of the Agent IP in first queue, the quantity of effective Agent IP gradually decreases, and will cause first team The usage time interval of effective Agent IP shortens in row, so as to accelerate the progress that Agent IP fails.In order to avoid this situation Occur, it is necessary to new Agent IP is added into first queue to increase the usage time interval of Agent IP.The embodiment of the present invention It is one lowest threshold of setting in first queue, when the Agent IP quantity in first queue is reduced to the threshold value, network is climbed Worm will extract Agent IP from second queue and access website and crawl website data.Wherein, lowest threshold is an empirical value, can With self-defined setting, the size of its value will determine the frequency size that Agent IP is recycled in first queue.And second queue In Agent IP be the Agent IP provided by dynamically obtaining Agent IP service provider, that is to say, that the generation in second queue It is the newest Agent IP provided in real time by Agent IP service provider to manage IP.
104th, the Agent IP that website data can be effectively extracted in second queue is added in first queue.
Web crawlers accesses website by using the Agent IP that is extracted in second queue and crawls website data, access into The Agent IP can be added to after work(in first queue, the Agent IP quantity in first queue is supplemented with this.
It should be noted that when the Agent IP quantity in first queue reaches highest threshold value, web crawlers will no longer from Agent IP is extracted in second queue, but uses Agent IP in first queue.Wherein, highest threshold value is set in first queue The empirical value put, self-defined can be set, when first queue is with a queue with upper limit quantity, the highest threshold value It can be the upper limit value of queue, can also be set according to the Agent IP quantity in second queue, ensure highest threshold value and minimum threshold The difference of value is less than the Agent IP quantity in second queue, ensures that second queue can supplement enough generations to first queue with this IP is managed, so that web crawlers can extract Agent IP to first queue again, completes the circulation conversion between two queues.
Web crawlers in the embodiment of the present invention is exactly the main extraction source using first queue as Agent IP, works as first queue In Agent IP lazy weight when, effective Agent IP will be extracted from second queue and will be added in first queue, when by first After Agent IP quantity supplement in queue is sufficient, Agent IP will be extracted from first queue again, and second queue at this time By the dynamic Agent IP for obtaining Agent IP service provider and providing, to supplement the Agent IP quantity being extracted, and web crawlers is waited Extraction next time.And so on circulate so that the Agent IP in first queue can be supplemented effectively, so as to reality Existing web crawlers crawls network data for a long time.Also, by be supplemented in come Agent IP be in second queue effectively Agent IP, also so that being improved by the quality of Agent IP in the first queue after supplementing, solve Agent IP service The problem of Agent IP that business provides is of low quality.
Web crawlers obtains website data used by can be seen that the embodiment of the present invention with reference to above-mentioned implementation Method, web crawlers can obtain website data by extracting the Agent IP in first queue, as the agency in first queue When IP quantity is reduced, it can be supplemented to by extracting effective Agent IP in second queue in first queue, to ensure first team There is the Agent IP of high quality in row.For existing anti-reptile strategy, especially agency service business provide agency When IP is of low quality, the present invention can screen Agent IP using first queue, and effective Agent IP is carried out circulation makes With, meanwhile, using the effective Agent IP of second queue dynamic access, the Agent IP quantity in first queue is reduced to certain threshold New, effective Agent IP is provided to first queue during value, so as to ensure that network of network reptile carries out long-time, effective data Crawl.
Below in order to which the method that a kind of web crawlers proposed by the present invention obtains website data is explained in more detail, especially It is to extract the processing procedure after Agent IP from first queue and second queue respectively to web crawlers, the embodiment of the present invention also carries A kind of method that web crawlers obtains website data is gone out, as shown in Fig. 2, the step included by this method is:
201st, first queue and second queue are created.
Wherein, first queue is the static existing Agent IP for obtaining Agent IP service provider and providing, further can be with Existing Agent IP is screened, selects effective Agent IP, is i.e. web crawlers can pass through the Agent IP successful access net Stand, effective Agent IP is added in first queue.In addition, each Agent IP preserved in first queue is additionally provided with It is corresponding continuously to crawl the frequency of failure.It is that web crawlers uses the Agent IP connected reference that this, which continuously crawls the frequency of failure, The number of web failure, is a mark for judging Agent IP failure, Agent IP mistake is judged as when number reaches preset value Imitate and delete it from first queue.If first queue to be regarded as to the queue of a standard, each node of queue Comprising two fields, first character section is the value of Agent IP, and second field is the continuous failure time crawled using this Agent IP Number.At this time, need to initialize it in the Agent IP in using first queue for the first time, by the continuous failure of each Agent IP Number initialization value is 0.Meanwhile the setting for first queue further includes a highest threshold value and on a rare occasion a lowest threshold, Lowest threshold is used for the lower limit for determining Agent IP quantity in first queue, when less than or equal to the value, it is necessary to from second queue The new Agent IP of middle supplement, and highest threshold value is then used for the quantity for determining the new Agent IP of supplement, when reaching the threshold value no longer New Agent IP is supplemented, but transfers to be supplied to web crawlers using the Agent IP in first queue.
It is the new Agent IP provided in real time by dynamically obtaining Agent IP service provider, and screen for second queue Wherein effective Agent IP.Specifically service can be screened by Agent IP and obtain effective Agent IP from Agent IP service provider, Wherein, Agent IP screening service is a service routine independently of web crawlers, is lasting with fixed and relatively low frequency Obtained from Agent IP in service and obtain Agent IP, and these Agent IPs are screened, effective Agent IP will be selected and be added to In second queue.In embodiments of the present invention, second queue is a buffer queue, is a queue with finite length, That is the Agent IP quantity wherein preserved is limited.When the Agent IP quantity in second queue reaches upper limit value, If there is the Agent IP newly added to add second queue, wherein the Agent IP added at first will be dropped, i.e., new Agent IP When being added to the rear of queue of second queue, that Agent IP of its queue heads will be deleted.Namely new Agent IP is replaced Change Agent IP old in queue.Second queue ensure that the Agent IP in caching is all newest using the buffer queue of finite length , this also ensures that being supplied to web crawlers to carry out network data crawls, and the Agent IP of supplement first queue is effective It is and newest.
202nd, extract the Agent IP in first queue to be used to obtain website data, and the is removed according to the validity of Agent IP Void Agency IP in one queue.
After completion first queue and second queue is created, it is possible to start web crawlers, and have first queue to network Reptile provides Agent IP.In the embodiment of the present invention, web crawlers use extracts Agent IP from first queue and accesses website and obtain Website data is taken generally to have situation in following three:
1) website data content, is successfully captured:Such case illustrates that Agent IP is effective, therefore the Agent IP is rejoined The queue tail of first queue, while its continuous frequency of failure is reset to 0, this is because during crawling before, has There may be failure scenarios caused by other reasons to cause the continuous frequency of failure to there is the not value for 0.
2) failure of website data content, is captured, while the continuous frequency of failure of this Agent IP is not up to the preset value to fail: This Agent IP is rejoined the queue tail of first queue by such case, while the continuous frequency of failure is added 1.
3) web page contents failure, is captured, while the continuous frequency of failure of this Agent IP reaches the preset value of failure:This feelings Condition can determine that Agent IP has failed for this, and no longer this Agent IP is re-added in first queue.
By being recycled to the continuous of the Agent IP in first queue, with the increase of number of use, partial agency IP will fail, and be excluded from first queue, cause the quantity of Agent IP to reduce, wherein the use of effective Agent IP Frequency will increase, the speed for accelerating it to fail.
203rd, when the Agent IP quantity in first queue is less than lowest threshold, extraction Agent IP is used for from second queue Obtain website data.
Web crawlers can first judge the quantity of Agent IP in first queue before Agent IP is extracted from first queue, When the quantity is less than lowest threshold, web crawlers will be extracted no longer from first queue, but generation is extracted from second queue IP is managed to access website.On opportunity for judging Agent IP quantity in first queue, carrying except above-mentioned from first queue Take before Agent IP or after judging that Agent IP fails and is deleted out first queue, which is relative to preceding Person need not be judged in each extraction, and simply be judged when Agent IP is reduced, and can so be saved certain Computing resource.
And when web crawlers uses the Agent IP in second queue to obtain website data and fails, which will be direct It is determined as the Agent IP that fails, and extracts an Agent IP again from second queue and carry out website visiting.As it can be seen that implement in the present invention It is that need not circulate weight for the Agent IP in second queue since second queue is that dynamic updates Agent IP therein in example Utilize again.When the Agent IP extracted can not access website, will be directly deleted.Certainly in the number of dynamic access Agent IP When measuring less, the Agent IP occupation mode in second queue can also be arranged to recycle, reach certain continuous mistake Second queue is deleted out again after losing number.
204th, the Agent IP that the website data can be effectively extracted in second queue is added in first queue.
When web crawlers uses the Agent IP in second queue to obtain website data success, which will be added Into first queue.
, can be by effective Agent IP during the Agent IP during web crawlers uses second queue obtains website data Constantly add in first queue, meanwhile, before Agent IP is extracted from second queue, the generation in first queue can be calculated Whether reason IP quantity reaches highest threshold value, when the Agent IP quantity in first queue is not up to highest threshold value, continues to from the Agent IP is extracted in two queues to obtain website data, and when the Agent IP quantity in first queue reaches highest threshold value, just Start from the queue heads extraction Agent IP of first queue to obtain website data.
Clearly illustrate that first queue replaces to web crawlers with second queue by the description of above-mentioned steps to carry , can then can be more clear by defining its two kinds of working statuses for the working status of Agent IP, and for web crawlers The above-mentioned flow of clear, vivid explanation.By defining web crawlers when extracting Agent IP, existing two states are respectively:Disappear Consumption state and supplement state.When entering consumption state, web crawlers obtains Agent IP from first queue, the under this state The length of one queue, which only subtracts, not to be increased.And when the length of first queue is less than lowest threshold, web crawlers enters supplement state, Under supplement state, web crawlers obtains Agent IP from second queue, and effective Agent IP is added in first queue, this The length of first queue only increases under state.When the length of first queue reaches highest threshold value, web crawlers reenters Consumption state, such iterative cycles constantly provide effective Agent IP to web crawlers and website acquisition are accessed so that it is prolonged Website data.
Further, as the realization to the above method, an embodiment of the present invention provides a kind of web crawlers to obtain website The device of data, the device embodiment is corresponding with preceding method embodiment, and for ease of reading, present apparatus embodiment is no longer to foregoing Detail content in embodiment of the method is repeated one by one, it should be understood that before the device in the present embodiment can correspond to realization State the full content in embodiment of the method.The device is used in the Network Data Capture equipment of application network reptile, specific such as Fig. 3 Shown, which includes:
First extraction unit 31, is used to obtain website data for extracting the Agent IP in first queue;
Unit 32 is deleted, the validity of the Agent IP for being extracted according to first extraction unit 31 removes described first Void Agency IP in queue;
Second extraction unit 33, for when the Agent IP quantity in the first queue is less than lowest threshold, from second Agent IP is extracted in queue to be used to obtain website data;
Adding device 34, for can will effectively extract the Agent IP of the website data in second extraction unit 33 Added in the first queue, until the Agent IP quantity in the first queue is when reaching highest threshold value, then from described the Agent IP is extracted in one queue to be used to obtain website data.
Further, as shown in figure 4, described device includes:
First creating unit 35, is used to obtain for the Agent IP in extracting first queue in first extraction unit 31 Before website data, the first queue is created using static acquisition Agent IP mode, the Agent IP in the first queue It is provided with and continuously crawls the frequency of failure.
Further, as shown in figure 4, the deletion unit 32 includes:
Add module 321, for when obtaining website data success using the Agent IP, the Agent IP to be added to In first queue, and the continuous frequency of failure that crawls is reset;
Logging modle 322, for when obtaining website data failure using the Agent IP, recording the company of the Agent IP It is continuous to crawl the frequency of failure;
Removing module 323, continuous for the Agent IP that is recorded when the logging modle 322 crawl the frequency of failure and reach pre- When putting, the Agent IP is deleted.
Further, as shown in figure 4, described device includes:
Acquiring unit 36, is used to obtain website for extracting Agent IP from second queue in second extraction unit 33 Before data, service acquisition Agent IP is screened using Agent IP, the Agent IP screening service is lasting with fixed frequency Obtained from Agent IP in service and screen effective Agent IP;
Second creating unit 37, the Agent IP for being obtained according to the acquiring unit 36 create second queue;
Updating block 38, for reaching when the Agent IP quantity in the second queue that second creating unit 37 creates During limit value, the Agent IP newly added is replaced to the Agent IP added at first in the second queue.
Further, as shown in figure 4, the adding device 34 includes:
Add module 341, described in when the Agent IP in the use second queue obtains website data success, inciting somebody to action Agent IP is added to the first queue;
Judgment module 342, whether the Agent IP quantity for judging in the first queue reaches highest threshold value, if reaching Arrive, then Agent IP is extracted from the first queue to obtain website data;If not up to, extracted from the second queue Agent IP is to obtain website data.
Further, as shown in figure 4, second extraction unit 33 is additionally operable to, the generation in the use second queue When managing IP acquisition website data failures, the Agent IP is deleted, and new Agent IP is extracted to obtain from the second queue Website data.
In conclusion web crawlers obtains the method and device of website data used by the embodiment of the present invention, network is climbed Worm can obtain website data by extracting the Agent IP in first queue, when the Agent IP quantity in first queue is reduced When, it can be supplemented to by extracting effective Agent IP in second queue in first queue, to ensure that there is height in first queue The Agent IP of quality.For existing anti-reptile strategy, the Agent IP that especially agency service business provides is of low quality When, the present invention can screen Agent IP using first queue, and effective Agent IP is recycled, meanwhile, make With the effective Agent IP of second queue dynamic access, to first when the Agent IP quantity in first queue is reduced to certain threshold value Queue provides new, effective Agent IP, so as to ensure that web crawlers carries out long-time, effective data crawl.
The device that the web crawlers obtains website data includes processor and memory, and above-mentioned first extraction unit, delete Except unit, the second extraction unit and adding device etc. store in memory as program unit, storage is performed by processor Above procedure unit in memory realizes corresponding function.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can set one Or more, utilized effectively and repeatedly the Agent IP that ensures largely to use by adjusting kernel parameter, and in existing agency When IP fails, by the new Agent IP of dynamic access, the Agent IP that failure is replaced in screening is carried out.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/ Or the form such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM), memory includes at least one deposit Store up chip.
Present invention also provides a kind of computer program product, when being performed on data processing equipment, is adapted for carrying out just The program code of beginningization there are as below methods step:Agent IP in extraction first queue is used to obtain website data;According to described The validity of Agent IP removes the Void Agency IP in the first queue;When the Agent IP quantity in the first queue is less than During lowest threshold, Agent IP is extracted from second queue and is used to obtain website data;To can effectively it be carried in the second queue The Agent IP of the website data is taken to be added in the first queue, until the Agent IP quantity in the first queue reaches During highest threshold value, then extract Agent IP from the first queue and be used to obtain website data.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the application can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the application can use the computer for wherein including computer usable program code in one or more The computer program production that usable storage medium is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.
The application is with reference to the flow according to the method for the embodiment of the present application, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or square frame in journey and/or square frame and flowchart and/or the block diagram.These computer programs can be provided The processors of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices, which produces, to be used in fact The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided and is used for realization in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a square frame or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/ Or the form such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of elements not only include those key elements, but also wrapping Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment it is intrinsic will Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including key element Also there are other identical element in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product. Therefore, the application can be using the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Form.Deposited moreover, the application can use to can use in one or more computers for wherein including computer usable program code The shape for the computer program product that storage media is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
It these are only embodiments herein, be not limited to the application.To those skilled in the art, The application can have various modifications and variations.All any modifications made within spirit herein and principle, equivalent substitution, Improve etc., it should be included within the scope of claims hereof.

Claims (10)

1. a kind of method that web crawlers obtains website data, it is characterised in that the described method includes:
Agent IP in extraction first queue is used to obtain website data;
Void Agency IP in the first queue is removed according to the validity of the Agent IP;
When the Agent IP quantity in the first queue is less than lowest threshold, Agent IP is extracted from second queue and is used to obtain Website data;
The Agent IP that the website data can be effectively extracted in the second queue is added in the first queue, until When Agent IP quantity in the first queue reaches highest threshold value, then extract Agent IP from the first queue and be used to obtain Website data.
2. according to the method described in claim 1, it is characterized in that, the Agent IP in first queue is extracted is used to obtain website Before data, the described method includes:
The first queue is created using static acquisition Agent IP mode, and the Agent IP in the first queue is provided with continuously Crawl the frequency of failure.
3. according to the method described in claim 2, it is characterized in that, the first team is removed according to the validity of the Agent IP Void Agency IP in row includes:
When obtaining website data success using the Agent IP, the Agent IP is added in first queue, and by described in It is continuous to crawl frequency of failure clearing;
When obtaining website data failure using the Agent IP, record the continuous of the Agent IP and crawl the frequency of failure;
When the Agent IP continuous crawls the frequency of failure and reach preset value, the Agent IP is deleted.
4. according to the method described in claim 1, it is characterized in that, it is used to obtain in the Agent IP that extracts from second queue Before website data, the described method includes:
Service acquisition Agent IP is screened using Agent IP, the Agent IP screening service is lasting from agency with fixed frequency IP is obtained in service and is screened effective Agent IP;
The Agent IP that service acquisition is screened according to the Agent IP creates second queue;
When the Agent IP quantity in the second queue reaches upper limit value, the Agent IP newly added is replaced into the second queue In an Agent IP adding at first.
5. according to the method described in claim 1, it is characterized in that, it is described will can effectively be extracted in the second queue it is described The Agent IP of website data includes added to the first queue:
When obtaining website data success using the Agent IP in the second queue, the Agent IP is added to described first Queue;
Judge whether the Agent IP quantity in the first queue reaches highest threshold value;
If reaching, Agent IP is extracted from the first queue to obtain website data;
If Agent IP not up to, is extracted from the second queue to obtain website data.
6. according to the method described in claim 1, it is characterized in that, the Agent IP that extracted from second queue is used to obtain net Data of standing include:
When obtaining website data using the Agent IP in the second queue and failing, the Agent IP is deleted, and from described the New Agent IP is extracted in two queues to obtain website data.
7. a kind of web crawlers obtains the device of website data, it is characterised in that described device includes:
First extraction unit, is used to obtain website data for extracting the Agent IP in first queue;
Unit is deleted, the validity of the Agent IP for being extracted according to first extraction unit is removed in the first queue Void Agency IP;
Second extraction unit, for when the Agent IP quantity in the first queue is less than lowest threshold, from second queue Extraction Agent IP is used to obtain website data;
Adding device, for the Agent IP that the website data can be effectively extracted in second extraction unit to be added to institute State in first queue, until when the Agent IP quantity in the first queue reaches highest threshold value, then from the first queue Extraction Agent IP is used to obtain website data.
8. device according to claim 7, it is characterised in that described device includes:
First creating unit, is used to obtain website data for the Agent IP in extracting first queue in first extraction unit Before, the first queue is created using static acquisition Agent IP mode, the Agent IP in the first queue is provided with company It is continuous to crawl the frequency of failure.
9. device according to claim 8, it is characterised in that the deletion unit includes:
Add module, for when obtaining website data success using the Agent IP, the Agent IP to be added to first team In row, and the continuous frequency of failure that crawls is reset;
Logging modle, for when obtaining website data failure using the Agent IP, recording the continuous of the Agent IP and crawling The frequency of failure;
Removing module, when crawling the frequency of failure for the Agent IP that is recorded when the logging modle continuous and reach preset value, is deleted Except the Agent IP.
10. device according to claim 7, it is characterised in that described device includes:
Acquiring unit, for second extraction unit extracted from second queue Agent IP be used for obtain website data it Before, service acquisition Agent IP is screened using Agent IP, the Agent IP screening service is lasting from agency with fixed frequency IP is obtained in service and is screened effective Agent IP;
Second creating unit, the Agent IP for being obtained according to the acquiring unit create second queue;
Updating block, for when the Agent IP quantity in the second queue that second creating unit creates reaches upper limit value, The Agent IP newly added is replaced to the Agent IP added at first in the second queue.
CN201610899608.1A 2016-10-14 2016-10-14 A kind of web crawlers obtains the method and device of website data Pending CN107957999A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610899608.1A CN107957999A (en) 2016-10-14 2016-10-14 A kind of web crawlers obtains the method and device of website data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610899608.1A CN107957999A (en) 2016-10-14 2016-10-14 A kind of web crawlers obtains the method and device of website data

Publications (1)

Publication Number Publication Date
CN107957999A true CN107957999A (en) 2018-04-24

Family

ID=61953679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610899608.1A Pending CN107957999A (en) 2016-10-14 2016-10-14 A kind of web crawlers obtains the method and device of website data

Country Status (1)

Country Link
CN (1) CN107957999A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109274782A (en) * 2018-08-24 2019-01-25 北京创鑫旅程网络技术有限公司 A kind of method and device acquiring website data
CN109413153A (en) * 2018-09-26 2019-03-01 深圳壹账通智能科技有限公司 Data crawling method, device, computer equipment and storage medium
CN109743411A (en) * 2018-12-10 2019-05-10 厦门市美亚柏科信息股份有限公司 A kind of method, apparatus and storage medium of the dynamic dispatching IP agent pool under distributed environment
CN109873882A (en) * 2019-02-19 2019-06-11 上海七印信息科技有限公司 A kind of IP agent pool management system and its management method
CN110034979A (en) * 2019-04-23 2019-07-19 恒安嘉新(北京)科技股份公司 A kind of proxy resources monitoring method, device, electronic equipment and storage medium
CN110708395A (en) * 2019-10-24 2020-01-17 深圳前海环融联易信息科技服务有限公司 Data acquisition method and device, computer equipment and storage medium
CN111125478A (en) * 2018-10-30 2020-05-08 北京国双科技有限公司 Data crawling method and device
CN111741141A (en) * 2020-06-15 2020-10-02 重庆帮企科技集团有限公司 Method and system for realizing efficient IP proxy pool and data acquisition method
CN113905092A (en) * 2021-09-28 2022-01-07 盐城金堤科技有限公司 Method, device, terminal and storage medium for determining reusable agent queue
CN113923260A (en) * 2021-09-28 2022-01-11 盐城金堤科技有限公司 Method, device, terminal and storage medium for processing proxy environment
CN114143290A (en) * 2021-11-19 2022-03-04 国家计算机网络与信息安全管理中心广东分中心 System and method for constructing IP proxy pool for multi-website parallel crawling
CN115801735A (en) * 2022-12-15 2023-03-14 江苏物润船联网络股份有限公司 Method for calling third-party interface by self-healing function

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
US20160050176A1 (en) * 2006-10-13 2016-02-18 Yahoo! Inc Systems and methods for establishing or maintaining a personalized trusted social network
CN105740384A (en) * 2016-01-27 2016-07-06 浪潮软件集团有限公司 A crawler agent automatic switching method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160050176A1 (en) * 2006-10-13 2016-02-18 Yahoo! Inc Systems and methods for establishing or maintaining a personalized trusted social network
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN105740384A (en) * 2016-01-27 2016-07-06 浪潮软件集团有限公司 A crawler agent automatic switching method and device

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109274782A (en) * 2018-08-24 2019-01-25 北京创鑫旅程网络技术有限公司 A kind of method and device acquiring website data
CN109413153A (en) * 2018-09-26 2019-03-01 深圳壹账通智能科技有限公司 Data crawling method, device, computer equipment and storage medium
CN109413153B (en) * 2018-09-26 2022-09-02 深圳壹账通智能科技有限公司 Data crawling method and device, computer equipment and storage medium
CN111125478B (en) * 2018-10-30 2023-05-12 北京国双科技有限公司 Data crawling method and device
CN111125478A (en) * 2018-10-30 2020-05-08 北京国双科技有限公司 Data crawling method and device
CN109743411A (en) * 2018-12-10 2019-05-10 厦门市美亚柏科信息股份有限公司 A kind of method, apparatus and storage medium of the dynamic dispatching IP agent pool under distributed environment
CN109873882B (en) * 2019-02-19 2022-07-29 上海七印信息科技有限公司 IP proxy pool management system and management method thereof
CN109873882A (en) * 2019-02-19 2019-06-11 上海七印信息科技有限公司 A kind of IP agent pool management system and its management method
CN110034979A (en) * 2019-04-23 2019-07-19 恒安嘉新(北京)科技股份公司 A kind of proxy resources monitoring method, device, electronic equipment and storage medium
CN110708395A (en) * 2019-10-24 2020-01-17 深圳前海环融联易信息科技服务有限公司 Data acquisition method and device, computer equipment and storage medium
CN111741141A (en) * 2020-06-15 2020-10-02 重庆帮企科技集团有限公司 Method and system for realizing efficient IP proxy pool and data acquisition method
CN113923260A (en) * 2021-09-28 2022-01-11 盐城金堤科技有限公司 Method, device, terminal and storage medium for processing proxy environment
CN113905092A (en) * 2021-09-28 2022-01-07 盐城金堤科技有限公司 Method, device, terminal and storage medium for determining reusable agent queue
CN113923260B (en) * 2021-09-28 2024-01-09 盐城天眼察微科技有限公司 Method, device, terminal and storage medium for processing agent environment
CN113905092B (en) * 2021-09-28 2024-03-22 盐城天眼察微科技有限公司 Method, device, terminal and storage medium for determining reusable agent queue
CN114143290A (en) * 2021-11-19 2022-03-04 国家计算机网络与信息安全管理中心广东分中心 System and method for constructing IP proxy pool for multi-website parallel crawling
CN114143290B (en) * 2021-11-19 2024-01-30 国家计算机网络与信息安全管理中心广东分中心 System and method for constructing IP proxy pool of multi-website parallel crawling
CN115801735A (en) * 2022-12-15 2023-03-14 江苏物润船联网络股份有限公司 Method for calling third-party interface by self-healing function
CN115801735B (en) * 2022-12-15 2025-06-20 江苏物润船联网络股份有限公司 A method for calling a third-party interface using a self-healing function

Similar Documents

Publication Publication Date Title
CN107957999A (en) A kind of web crawlers obtains the method and device of website data
US11184241B2 (en) Topology-aware continuous evaluation of microservice-based applications
KR101781339B1 (en) Method and device for updating client
KR102452250B1 (en) Method and apparatus for storing offchain data
CN105488078B (en) A kind of web data caching method and equipment
US20120159099A1 (en) Distributed Storage System
CN110062025A (en) Method, apparatus, server and the storage medium of data acquisition
CN108551452A (en) Web crawlers method, terminal and storage medium
CN107122410A (en) A kind of buffering updating method and device
CN109428913B (en) A storage expansion method and device
KR102229742B1 (en) Method and device for previewing a dynamic image, and method and device for displaying a presentation package
JP6359190B2 (en) Computer system and computer system control method
CN107783770A (en) Page configuration update method, device, server and medium
CN107102896A (en) A kind of operating method of multi-level buffer, device and electronic equipment
US11080909B2 (en) Image layer processing method and computing device
CN108182662A (en) Image processing method and device, computer readable storage medium
CN109359263A (en) A kind of user behavior feature extraction method and system
CN111328394A (en) Locally secure rendering of WEB content
CN107193834A (en) Computing device, device and method for browsing pages
CN106815232A (en) Catalog management method, apparatus and system
CN112835578A (en) A bundle file generation method and storage medium
CN105095352B (en) Data processing method and device applied to distributed system
CN113869016A (en) Chart configuration method, device and computer program product
CN114564456B (en) Distributed storage file recovery method and device
US20220207075A1 (en) Method and apparatus for generating unordered list, method for managing images and terminal device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20180424

RJ01 Rejection of invention patent application after publication