CN107957999A - A kind of web crawlers obtains the method and device of website data - Google Patents
A kind of web crawlers obtains the method and device of website data Download PDFInfo
- Publication number
- CN107957999A CN107957999A CN201610899608.1A CN201610899608A CN107957999A CN 107957999 A CN107957999 A CN 107957999A CN 201610899608 A CN201610899608 A CN 201610899608A CN 107957999 A CN107957999 A CN 107957999A
- Authority
- CN
- China
- Prior art keywords
- agent
- queue
- website data
- extracted
- failure
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses the method and device that a kind of web crawlers obtains website data, it is related to field of computer technology, the Agent IP that main purpose is to ensure largely to use is effective and repeatedly utilizes, and when existing Agent IP fails, by the new Agent IP of dynamic access, the Agent IP that failure is replaced in screening is carried out.The main technical solution of the present invention is:Agent IP in extraction first queue is used to obtain website data;Void Agency IP in the first queue is removed according to the validity of the Agent IP;When the Agent IP quantity in the first queue is less than lowest threshold, Agent IP is extracted from second queue and is used to obtain website data;The Agent IP that the website data can be effectively extracted in the second queue is added in the first queue, until when the Agent IP quantity in the first queue reaches highest threshold value, then extract Agent IP from the first queue and be used to obtain website data.Present invention is mainly used for web crawlers to crawl network data.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of web crawlers to obtain the method and dress of website data
Put.
Background technology
Web crawlers is a kind of instrument for being used to obtain data from website automatically.For website, the number of web crawlers
The consumption that site resource can be brought as the access of real user according to obtaining, is climbed for the big network of some data grabber amounts
Worm, its resource consumption are even accessed much larger than normal user.Therefore, generally can be to doubtful for the designer of many websites
Rate limitation is carried out for the access of web crawlers, identity, or even the visit of some IP address of shielding are verified by modes such as identifying codes
Ask.The above is properly termed as the anti-reptile strategy of website, the data of web crawlers can all be crawled and bring problem.
In the web crawlers being widely used at present, the anti-reptile strategy of reply website there are ways to.In general,
The website of speed is crawled for limitation, the access frequency of website can be crawled by reducing.But once network is climbed
The IP of worm is shielded by some websites, then the data of website can only be crawled by using the mode of Agent IP.Agent IP generally can be with
By Agent IP, service provider obtains, and Agent IP service provider can dynamically provide substantial amounts of Agent IP in a short time and supply web crawlers
Selection.Crawling without being shielded to website can be realized by effective Agent IP, still, when the Agent IP that service provider provides
When of low quality, can there is a situation where that substantial amounts of Agent IP can not normal use.
In the realization of existing technology, when the Agent IP obtained from Agent IP service provider is of low quality, it can generally pass through
The mode repeatedly retried is crawled, i.e., after the once access failure to webpage, obtains a new Agent IP and to the net
Stand and crawl again, until webpage successfully crawls or reach certain number of retries.This method does not solve Agent IP matter
The problem of amount is not high, repeatedly retries the efficiency that can influence web page crawl, while does not utilize effective Agent IP sufficiently.Separately
A kind of outer method is that Agent IP is screened.The screening of Agent IP needs to carry out before crawling, and usually obtains first big
The Agent IP of amount, then verifies the validity of these Agent IPs, and effective Agent IP after verification finally is supplied to web crawlers
Use.This method can effectively improve the validity of the Agent IP used, but since the Agent IP that this method obtains is
Limited, fixed, it is impossible to dynamic tracking agent IP service providers provide effective Agent IP, i.e., are provided in Agent IP service provider new
Agent IP after, these Agent IPs can not be supplied to net in real time due to not can determine that the validity of these new Agent IPs
Network reptile uses.This causes in time longer web crawlers is crawled, and these IP fixed is had agency
IP fails, even with excessive and shielded problem.
The content of the invention
In view of this, the present invention provides the method and device that a kind of web crawlers obtains website data, and main purpose is
The Agent IP for ensureing largely to use is effective and repeatedly utilizes, and when existing Agent IP fails, new by dynamic access
Agent IP, carries out the Agent IP that failure is replaced in screening.
To reach above-mentioned purpose, present invention generally provides following technical solution:
On the one hand, the present invention provides a kind of method that web crawlers obtains website data, this method to include:
Agent IP in extraction first queue is used to obtain website data;
Void Agency IP in the first queue is removed according to the validity of the Agent IP;
When the Agent IP quantity in the first queue is less than lowest threshold, extraction Agent IP is used for from second queue
Obtain website data;
The Agent IP that the website data can be effectively extracted in the second queue is added in the first queue,
Until when the Agent IP quantity in the first queue reaches highest threshold value, then extract Agent IP from the first queue and be used for
Obtain website data.
Preferably, before the Agent IP in extracting first queue is used to obtain website data, the described method includes:
The first queue is created using static acquisition Agent IP mode, and the Agent IP in the first queue is provided with
Continuously crawl the frequency of failure.
Preferably, removing the Void Agency IP in the first queue according to the validity of the Agent IP includes:
When obtaining website data success using the Agent IP, the Agent IP is added in first queue, and will
The continuous frequency of failure that crawls is reset;
When obtaining website data failure using the Agent IP, record the continuous of the Agent IP and crawl the frequency of failure;
When the Agent IP continuous crawls the frequency of failure and reach preset value, the Agent IP is deleted.
Preferably, it is described from second queue extract Agent IP be used for obtain website data before, the described method includes:
Screen service acquisition Agent IP using Agent IP, Agent IP screening service be with fixed frequency it is lasting from
Agent IP obtains in service and screens effective Agent IP;
The Agent IP that service acquisition is screened according to the Agent IP creates second queue;
When the Agent IP quantity in the second queue reaches upper limit value, the Agent IP newly added is replaced described second
The Agent IP added at first in queue.
Preferably, the Agent IP of the website data that will can effectively be extracted in the second queue is described in
First queue includes:
When obtaining website data success using the Agent IP in the second queue, by the Agent IP described in
First queue;
Judge whether the Agent IP quantity in the first queue reaches highest threshold value;
If reaching, Agent IP is extracted from the first queue to obtain website data;
If Agent IP not up to, is extracted from the second queue to obtain website data.
Preferably, the Agent IP that extracted from second queue includes for obtaining website data:
When obtaining website data using the Agent IP in the second queue and failing, the Agent IP is deleted, and from institute
State and new Agent IP is extracted in second queue to obtain website data.
On the other hand, present invention also offers the device that a kind of web crawlers obtains website data, the device to include:
First extraction unit, is used to obtain website data for extracting the Agent IP in first queue;
Unit is deleted, the validity of the Agent IP for being extracted according to first extraction unit removes the first queue
In Void Agency IP;
Second extraction unit, for when the Agent IP quantity in the first queue is less than lowest threshold, from the second team
Agent IP is extracted in row to be used to obtain website data;
Adding device, for the Agent IP that the website data can be effectively extracted in second extraction unit to be added
To in the first queue, until when the Agent IP quantity in the first queue reaches highest threshold value, then from the first team
Agent IP is extracted in row to be used to obtain website data.
Preferably, described device includes:
First creating unit, is used to obtain website for the Agent IP in extracting first queue in first extraction unit
Before data, the first queue is created using static acquisition Agent IP mode, the Agent IP in the first queue is set
Have and continuously crawl the frequency of failure.
Preferably, the deletion unit includes:
Add module, for when obtaining website data success using the Agent IP, by the Agent IP added to the
In one queue, and the continuous frequency of failure that crawls is reset;
Logging modle, for when obtaining website data failure using the Agent IP, recording the continuous of the Agent IP
Crawl the frequency of failure;
Removing module, continuous for the Agent IP that is recorded when the logging modle crawl the frequency of failure and reach preset value
When, delete the Agent IP.
Preferably, described device includes:
Acquiring unit, is used to obtain website data for extracting Agent IP from second queue in second extraction unit
Before, service acquisition Agent IP is screened using Agent IP, the Agent IP screening service is lasting from generation with fixed frequency
Reason IP, which is obtained in service, screens effective Agent IP;
Second creating unit, the Agent IP for being obtained according to the acquiring unit create second queue;
Updating block, for reaching upper limit value when the Agent IP quantity in the second queue that second creating unit creates
When, the Agent IP newly added is replaced to the Agent IP added at first in the second queue.
Preferably, the adding device includes:
Add module, for when obtaining website data success using the Agent IP in the second queue, by the generation
Reason IP is added to the first queue;
Judgment module, whether the Agent IP quantity for judging in the first queue reaches highest threshold value, if reaching,
Agent IP is extracted from the first queue to obtain website data;If agency not up to, is extracted from the second queue
IP is to obtain website data.
Preferably, second extraction unit is additionally operable to, and website number is obtained when using the Agent IP in the second queue
During according to failure, the Agent IP is deleted, and new Agent IP is extracted to obtain website data from the second queue.
The web crawlers proposed according to the invention described above obtains the method and device of website data, and web crawlers can lead to
Cross and extract the Agent IP in first queue to obtain website data, when the Agent IP quantity in first queue is reduced, Ke Yitong
The Agent IP of website data can effectively be extracted in extraction second queue by, which crossing, is supplemented in first queue, to ensure in first queue
Agent IP with high quality.For existing anti-reptile strategy, especially agency service business provide Agent IP matter
When amount is not high, the present invention can screen Agent IP using first queue, effective Agent IP be recycled, together
When, using the effective Agent IP of second queue dynamic access, when the Agent IP quantity in first queue is reduced to certain threshold value
New, effective Agent IP is provided to first queue, so as to ensure that web crawlers carries out long-time, effective data crawl.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, it is various other the advantages of and benefit it is common for this area
Technical staff will be clear understanding.Attached drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention
Limitation.And in whole attached drawing, identical component is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 shows that a kind of web crawlers that the embodiment of the present invention proposes obtains the flow chart of the method for website data;
Fig. 2 shows that another web crawlers that the embodiment of the present invention proposes obtains the flow chart of the method for website data;
Fig. 3 shows that a kind of web crawlers that the embodiment of the present invention proposes obtains the device composition frame chart of website data;
Fig. 4 shows that another web crawlers that the embodiment of the present invention proposes obtains the device composition frame chart of website data.
Embodiment
The exemplary embodiment of the present invention is more fully described below with reference to accompanying drawings.Although the present invention is shown in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the present invention without should be by embodiments set forth here
Limited.Conversely, there is provided these embodiments are to be able to be best understood from the present invention, and can be by the scope of the present invention
Completely it is communicated to those skilled in the art.
An embodiment of the present invention provides a kind of method that web crawlers obtains website data, as shown in Figure 1, this method should
During being crawled for web crawlers by Agent IP access website progress data, especially for the screening of Agent IP
Journey, its specific steps include:
101st, the Agent IP in first queue is extracted to be used to obtain website data.
First queue is the queue being made of Agent IP, and the Agent IP in first queue is by static acquisition generation
Manage the obtained Agent IP of method of IP.Wherein, the mode of static acquisition Agent IP is relative to dynamic in second queue
The mode of Agent IP is obtained, static state here mainly includes mode that is non-dynamic, disposably obtaining current Agent IP.One
As be by the substantial amounts of Agent IP of acquisition service acquisition of Agent IP service provider for web crawlers access website carry out data climb
Take.Further, acquired a large amount of Agent IPs can be screened again, the available agent IP of successful access website will be capable of
It is added in first queue.
The embodiment of the present invention preserves Agent IP using the data structure of queue, is due to the characteristic of queue first in first out.
In the usage scenario of Agent IP, in order to tackle the anti-reptile strategy of website, current way be when Agent IP quantity is more,
Different Agent IPs is all used when accessing website every time as far as possible, however, this does not obviously give full play to the work of effective Agent IP
With.In actual use, the criterion observed should be that the longer the better using the time interval of same Agent IP.For queue
Structure, uses the Agent IP of queue heads, can all be added again rear of queue after use every time, institute when next time uses before it
There are other Agent IPs all to be used, the time interval used this guarantees Agent IP is most long.
102nd, the Void Agency IP in the first queue is removed according to the validity of Agent IP.
The validity of Agent IP is embodied in whether web crawlers by the Agent IP successful access website and can get
Website data.After web crawlers extracts Agent IP from first queue, according to using the Agent IP result it may determine that its
Effectively whether, when Agent IP can not be used to access website, it is Void Agency IP to decide that the Agent IP.And in order to avoid accidental
Property presence, that is, the reason for being likely to be website or server-side caused wink hair, the discrete situation for accessing failure,
Can be further by adding up the access frequency of failure of certain number, will after Agent IP connected reference fails certain number
The Agent IP is determined as Void Agency IP.
After an Agent IP is confirmed as Void Agency IP, which can be removed, and no longer add it to
In one queue.
It should be noted that web crawlers when accessing website, often accesses once can all extract in first queue one
New Agent IP accesses, and after failure is accessed, can equally extract a new Agent IP again and access, rather than use
The Agent IP of last visit is retried, so avoid using same Agent IP access repeatedly one it is problematic and can not
The website of access, causes the false judgment to the Agent IP.
103rd, when the Agent IP quantity in first queue is less than lowest threshold, extraction Agent IP is used for from second queue
Obtain website data.
With the failure of the Agent IP in first queue, the quantity of effective Agent IP gradually decreases, and will cause first team
The usage time interval of effective Agent IP shortens in row, so as to accelerate the progress that Agent IP fails.In order to avoid this situation
Occur, it is necessary to new Agent IP is added into first queue to increase the usage time interval of Agent IP.The embodiment of the present invention
It is one lowest threshold of setting in first queue, when the Agent IP quantity in first queue is reduced to the threshold value, network is climbed
Worm will extract Agent IP from second queue and access website and crawl website data.Wherein, lowest threshold is an empirical value, can
With self-defined setting, the size of its value will determine the frequency size that Agent IP is recycled in first queue.And second queue
In Agent IP be the Agent IP provided by dynamically obtaining Agent IP service provider, that is to say, that the generation in second queue
It is the newest Agent IP provided in real time by Agent IP service provider to manage IP.
104th, the Agent IP that website data can be effectively extracted in second queue is added in first queue.
Web crawlers accesses website by using the Agent IP that is extracted in second queue and crawls website data, access into
The Agent IP can be added to after work(in first queue, the Agent IP quantity in first queue is supplemented with this.
It should be noted that when the Agent IP quantity in first queue reaches highest threshold value, web crawlers will no longer from
Agent IP is extracted in second queue, but uses Agent IP in first queue.Wherein, highest threshold value is set in first queue
The empirical value put, self-defined can be set, when first queue is with a queue with upper limit quantity, the highest threshold value
It can be the upper limit value of queue, can also be set according to the Agent IP quantity in second queue, ensure highest threshold value and minimum threshold
The difference of value is less than the Agent IP quantity in second queue, ensures that second queue can supplement enough generations to first queue with this
IP is managed, so that web crawlers can extract Agent IP to first queue again, completes the circulation conversion between two queues.
Web crawlers in the embodiment of the present invention is exactly the main extraction source using first queue as Agent IP, works as first queue
In Agent IP lazy weight when, effective Agent IP will be extracted from second queue and will be added in first queue, when by first
After Agent IP quantity supplement in queue is sufficient, Agent IP will be extracted from first queue again, and second queue at this time
By the dynamic Agent IP for obtaining Agent IP service provider and providing, to supplement the Agent IP quantity being extracted, and web crawlers is waited
Extraction next time.And so on circulate so that the Agent IP in first queue can be supplemented effectively, so as to reality
Existing web crawlers crawls network data for a long time.Also, by be supplemented in come Agent IP be in second queue effectively
Agent IP, also so that being improved by the quality of Agent IP in the first queue after supplementing, solve Agent IP service
The problem of Agent IP that business provides is of low quality.
Web crawlers obtains website data used by can be seen that the embodiment of the present invention with reference to above-mentioned implementation
Method, web crawlers can obtain website data by extracting the Agent IP in first queue, as the agency in first queue
When IP quantity is reduced, it can be supplemented to by extracting effective Agent IP in second queue in first queue, to ensure first team
There is the Agent IP of high quality in row.For existing anti-reptile strategy, especially agency service business provide agency
When IP is of low quality, the present invention can screen Agent IP using first queue, and effective Agent IP is carried out circulation makes
With, meanwhile, using the effective Agent IP of second queue dynamic access, the Agent IP quantity in first queue is reduced to certain threshold
New, effective Agent IP is provided to first queue during value, so as to ensure that network of network reptile carries out long-time, effective data
Crawl.
Below in order to which the method that a kind of web crawlers proposed by the present invention obtains website data is explained in more detail, especially
It is to extract the processing procedure after Agent IP from first queue and second queue respectively to web crawlers, the embodiment of the present invention also carries
A kind of method that web crawlers obtains website data is gone out, as shown in Fig. 2, the step included by this method is:
201st, first queue and second queue are created.
Wherein, first queue is the static existing Agent IP for obtaining Agent IP service provider and providing, further can be with
Existing Agent IP is screened, selects effective Agent IP, is i.e. web crawlers can pass through the Agent IP successful access net
Stand, effective Agent IP is added in first queue.In addition, each Agent IP preserved in first queue is additionally provided with
It is corresponding continuously to crawl the frequency of failure.It is that web crawlers uses the Agent IP connected reference that this, which continuously crawls the frequency of failure,
The number of web failure, is a mark for judging Agent IP failure, Agent IP mistake is judged as when number reaches preset value
Imitate and delete it from first queue.If first queue to be regarded as to the queue of a standard, each node of queue
Comprising two fields, first character section is the value of Agent IP, and second field is the continuous failure time crawled using this Agent IP
Number.At this time, need to initialize it in the Agent IP in using first queue for the first time, by the continuous failure of each Agent IP
Number initialization value is 0.Meanwhile the setting for first queue further includes a highest threshold value and on a rare occasion a lowest threshold,
Lowest threshold is used for the lower limit for determining Agent IP quantity in first queue, when less than or equal to the value, it is necessary to from second queue
The new Agent IP of middle supplement, and highest threshold value is then used for the quantity for determining the new Agent IP of supplement, when reaching the threshold value no longer
New Agent IP is supplemented, but transfers to be supplied to web crawlers using the Agent IP in first queue.
It is the new Agent IP provided in real time by dynamically obtaining Agent IP service provider, and screen for second queue
Wherein effective Agent IP.Specifically service can be screened by Agent IP and obtain effective Agent IP from Agent IP service provider,
Wherein, Agent IP screening service is a service routine independently of web crawlers, is lasting with fixed and relatively low frequency
Obtained from Agent IP in service and obtain Agent IP, and these Agent IPs are screened, effective Agent IP will be selected and be added to
In second queue.In embodiments of the present invention, second queue is a buffer queue, is a queue with finite length,
That is the Agent IP quantity wherein preserved is limited.When the Agent IP quantity in second queue reaches upper limit value,
If there is the Agent IP newly added to add second queue, wherein the Agent IP added at first will be dropped, i.e., new Agent IP
When being added to the rear of queue of second queue, that Agent IP of its queue heads will be deleted.Namely new Agent IP is replaced
Change Agent IP old in queue.Second queue ensure that the Agent IP in caching is all newest using the buffer queue of finite length
, this also ensures that being supplied to web crawlers to carry out network data crawls, and the Agent IP of supplement first queue is effective
It is and newest.
202nd, extract the Agent IP in first queue to be used to obtain website data, and the is removed according to the validity of Agent IP
Void Agency IP in one queue.
After completion first queue and second queue is created, it is possible to start web crawlers, and have first queue to network
Reptile provides Agent IP.In the embodiment of the present invention, web crawlers use extracts Agent IP from first queue and accesses website and obtain
Website data is taken generally to have situation in following three:
1) website data content, is successfully captured:Such case illustrates that Agent IP is effective, therefore the Agent IP is rejoined
The queue tail of first queue, while its continuous frequency of failure is reset to 0, this is because during crawling before, has
There may be failure scenarios caused by other reasons to cause the continuous frequency of failure to there is the not value for 0.
2) failure of website data content, is captured, while the continuous frequency of failure of this Agent IP is not up to the preset value to fail:
This Agent IP is rejoined the queue tail of first queue by such case, while the continuous frequency of failure is added 1.
3) web page contents failure, is captured, while the continuous frequency of failure of this Agent IP reaches the preset value of failure:This feelings
Condition can determine that Agent IP has failed for this, and no longer this Agent IP is re-added in first queue.
By being recycled to the continuous of the Agent IP in first queue, with the increase of number of use, partial agency
IP will fail, and be excluded from first queue, cause the quantity of Agent IP to reduce, wherein the use of effective Agent IP
Frequency will increase, the speed for accelerating it to fail.
203rd, when the Agent IP quantity in first queue is less than lowest threshold, extraction Agent IP is used for from second queue
Obtain website data.
Web crawlers can first judge the quantity of Agent IP in first queue before Agent IP is extracted from first queue,
When the quantity is less than lowest threshold, web crawlers will be extracted no longer from first queue, but generation is extracted from second queue
IP is managed to access website.On opportunity for judging Agent IP quantity in first queue, carrying except above-mentioned from first queue
Take before Agent IP or after judging that Agent IP fails and is deleted out first queue, which is relative to preceding
Person need not be judged in each extraction, and simply be judged when Agent IP is reduced, and can so be saved certain
Computing resource.
And when web crawlers uses the Agent IP in second queue to obtain website data and fails, which will be direct
It is determined as the Agent IP that fails, and extracts an Agent IP again from second queue and carry out website visiting.As it can be seen that implement in the present invention
It is that need not circulate weight for the Agent IP in second queue since second queue is that dynamic updates Agent IP therein in example
Utilize again.When the Agent IP extracted can not access website, will be directly deleted.Certainly in the number of dynamic access Agent IP
When measuring less, the Agent IP occupation mode in second queue can also be arranged to recycle, reach certain continuous mistake
Second queue is deleted out again after losing number.
204th, the Agent IP that the website data can be effectively extracted in second queue is added in first queue.
When web crawlers uses the Agent IP in second queue to obtain website data success, which will be added
Into first queue.
, can be by effective Agent IP during the Agent IP during web crawlers uses second queue obtains website data
Constantly add in first queue, meanwhile, before Agent IP is extracted from second queue, the generation in first queue can be calculated
Whether reason IP quantity reaches highest threshold value, when the Agent IP quantity in first queue is not up to highest threshold value, continues to from the
Agent IP is extracted in two queues to obtain website data, and when the Agent IP quantity in first queue reaches highest threshold value, just
Start from the queue heads extraction Agent IP of first queue to obtain website data.
Clearly illustrate that first queue replaces to web crawlers with second queue by the description of above-mentioned steps to carry
, can then can be more clear by defining its two kinds of working statuses for the working status of Agent IP, and for web crawlers
The above-mentioned flow of clear, vivid explanation.By defining web crawlers when extracting Agent IP, existing two states are respectively:Disappear
Consumption state and supplement state.When entering consumption state, web crawlers obtains Agent IP from first queue, the under this state
The length of one queue, which only subtracts, not to be increased.And when the length of first queue is less than lowest threshold, web crawlers enters supplement state,
Under supplement state, web crawlers obtains Agent IP from second queue, and effective Agent IP is added in first queue, this
The length of first queue only increases under state.When the length of first queue reaches highest threshold value, web crawlers reenters
Consumption state, such iterative cycles constantly provide effective Agent IP to web crawlers and website acquisition are accessed so that it is prolonged
Website data.
Further, as the realization to the above method, an embodiment of the present invention provides a kind of web crawlers to obtain website
The device of data, the device embodiment is corresponding with preceding method embodiment, and for ease of reading, present apparatus embodiment is no longer to foregoing
Detail content in embodiment of the method is repeated one by one, it should be understood that before the device in the present embodiment can correspond to realization
State the full content in embodiment of the method.The device is used in the Network Data Capture equipment of application network reptile, specific such as Fig. 3
Shown, which includes:
First extraction unit 31, is used to obtain website data for extracting the Agent IP in first queue;
Unit 32 is deleted, the validity of the Agent IP for being extracted according to first extraction unit 31 removes described first
Void Agency IP in queue;
Second extraction unit 33, for when the Agent IP quantity in the first queue is less than lowest threshold, from second
Agent IP is extracted in queue to be used to obtain website data;
Adding device 34, for can will effectively extract the Agent IP of the website data in second extraction unit 33
Added in the first queue, until the Agent IP quantity in the first queue is when reaching highest threshold value, then from described the
Agent IP is extracted in one queue to be used to obtain website data.
Further, as shown in figure 4, described device includes:
First creating unit 35, is used to obtain for the Agent IP in extracting first queue in first extraction unit 31
Before website data, the first queue is created using static acquisition Agent IP mode, the Agent IP in the first queue
It is provided with and continuously crawls the frequency of failure.
Further, as shown in figure 4, the deletion unit 32 includes:
Add module 321, for when obtaining website data success using the Agent IP, the Agent IP to be added to
In first queue, and the continuous frequency of failure that crawls is reset;
Logging modle 322, for when obtaining website data failure using the Agent IP, recording the company of the Agent IP
It is continuous to crawl the frequency of failure;
Removing module 323, continuous for the Agent IP that is recorded when the logging modle 322 crawl the frequency of failure and reach pre-
When putting, the Agent IP is deleted.
Further, as shown in figure 4, described device includes:
Acquiring unit 36, is used to obtain website for extracting Agent IP from second queue in second extraction unit 33
Before data, service acquisition Agent IP is screened using Agent IP, the Agent IP screening service is lasting with fixed frequency
Obtained from Agent IP in service and screen effective Agent IP;
Second creating unit 37, the Agent IP for being obtained according to the acquiring unit 36 create second queue;
Updating block 38, for reaching when the Agent IP quantity in the second queue that second creating unit 37 creates
During limit value, the Agent IP newly added is replaced to the Agent IP added at first in the second queue.
Further, as shown in figure 4, the adding device 34 includes:
Add module 341, described in when the Agent IP in the use second queue obtains website data success, inciting somebody to action
Agent IP is added to the first queue;
Judgment module 342, whether the Agent IP quantity for judging in the first queue reaches highest threshold value, if reaching
Arrive, then Agent IP is extracted from the first queue to obtain website data;If not up to, extracted from the second queue
Agent IP is to obtain website data.
Further, as shown in figure 4, second extraction unit 33 is additionally operable to, the generation in the use second queue
When managing IP acquisition website data failures, the Agent IP is deleted, and new Agent IP is extracted to obtain from the second queue
Website data.
In conclusion web crawlers obtains the method and device of website data used by the embodiment of the present invention, network is climbed
Worm can obtain website data by extracting the Agent IP in first queue, when the Agent IP quantity in first queue is reduced
When, it can be supplemented to by extracting effective Agent IP in second queue in first queue, to ensure that there is height in first queue
The Agent IP of quality.For existing anti-reptile strategy, the Agent IP that especially agency service business provides is of low quality
When, the present invention can screen Agent IP using first queue, and effective Agent IP is recycled, meanwhile, make
With the effective Agent IP of second queue dynamic access, to first when the Agent IP quantity in first queue is reduced to certain threshold value
Queue provides new, effective Agent IP, so as to ensure that web crawlers carries out long-time, effective data crawl.
The device that the web crawlers obtains website data includes processor and memory, and above-mentioned first extraction unit, delete
Except unit, the second extraction unit and adding device etc. store in memory as program unit, storage is performed by processor
Above procedure unit in memory realizes corresponding function.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can set one
Or more, utilized effectively and repeatedly the Agent IP that ensures largely to use by adjusting kernel parameter, and in existing agency
When IP fails, by the new Agent IP of dynamic access, the Agent IP that failure is replaced in screening is carried out.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/
Or the form such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM), memory includes at least one deposit
Store up chip.
Present invention also provides a kind of computer program product, when being performed on data processing equipment, is adapted for carrying out just
The program code of beginningization there are as below methods step:Agent IP in extraction first queue is used to obtain website data;According to described
The validity of Agent IP removes the Void Agency IP in the first queue;When the Agent IP quantity in the first queue is less than
During lowest threshold, Agent IP is extracted from second queue and is used to obtain website data;To can effectively it be carried in the second queue
The Agent IP of the website data is taken to be added in the first queue, until the Agent IP quantity in the first queue reaches
During highest threshold value, then extract Agent IP from the first queue and be used to obtain website data.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program
Product.Therefore, the application can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the application can use the computer for wherein including computer usable program code in one or more
The computer program production that usable storage medium is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The application is with reference to the flow according to the method for the embodiment of the present application, equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram
The combination of flow and/or square frame in journey and/or square frame and flowchart and/or the block diagram.These computer programs can be provided
The processors of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices, which produces, to be used in fact
The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to
Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or
The instruction performed on other programmable devices is provided and is used for realization in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in a square frame or multiple square frames.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/
Or the form such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Memory is computer-readable Jie
The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus
Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein
Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability
Comprising so that process, method, commodity or equipment including a series of elements not only include those key elements, but also wrapping
Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment it is intrinsic will
Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including key element
Also there are other identical element in process, method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product.
Therefore, the application can be using the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Form.Deposited moreover, the application can use to can use in one or more computers for wherein including computer usable program code
The shape for the computer program product that storage media is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
It these are only embodiments herein, be not limited to the application.To those skilled in the art,
The application can have various modifications and variations.All any modifications made within spirit herein and principle, equivalent substitution,
Improve etc., it should be included within the scope of claims hereof.
Claims (10)
1. a kind of method that web crawlers obtains website data, it is characterised in that the described method includes:
Agent IP in extraction first queue is used to obtain website data;
Void Agency IP in the first queue is removed according to the validity of the Agent IP;
When the Agent IP quantity in the first queue is less than lowest threshold, Agent IP is extracted from second queue and is used to obtain
Website data;
The Agent IP that the website data can be effectively extracted in the second queue is added in the first queue, until
When Agent IP quantity in the first queue reaches highest threshold value, then extract Agent IP from the first queue and be used to obtain
Website data.
2. according to the method described in claim 1, it is characterized in that, the Agent IP in first queue is extracted is used to obtain website
Before data, the described method includes:
The first queue is created using static acquisition Agent IP mode, and the Agent IP in the first queue is provided with continuously
Crawl the frequency of failure.
3. according to the method described in claim 2, it is characterized in that, the first team is removed according to the validity of the Agent IP
Void Agency IP in row includes:
When obtaining website data success using the Agent IP, the Agent IP is added in first queue, and by described in
It is continuous to crawl frequency of failure clearing;
When obtaining website data failure using the Agent IP, record the continuous of the Agent IP and crawl the frequency of failure;
When the Agent IP continuous crawls the frequency of failure and reach preset value, the Agent IP is deleted.
4. according to the method described in claim 1, it is characterized in that, it is used to obtain in the Agent IP that extracts from second queue
Before website data, the described method includes:
Service acquisition Agent IP is screened using Agent IP, the Agent IP screening service is lasting from agency with fixed frequency
IP is obtained in service and is screened effective Agent IP;
The Agent IP that service acquisition is screened according to the Agent IP creates second queue;
When the Agent IP quantity in the second queue reaches upper limit value, the Agent IP newly added is replaced into the second queue
In an Agent IP adding at first.
5. according to the method described in claim 1, it is characterized in that, it is described will can effectively be extracted in the second queue it is described
The Agent IP of website data includes added to the first queue:
When obtaining website data success using the Agent IP in the second queue, the Agent IP is added to described first
Queue;
Judge whether the Agent IP quantity in the first queue reaches highest threshold value;
If reaching, Agent IP is extracted from the first queue to obtain website data;
If Agent IP not up to, is extracted from the second queue to obtain website data.
6. according to the method described in claim 1, it is characterized in that, the Agent IP that extracted from second queue is used to obtain net
Data of standing include:
When obtaining website data using the Agent IP in the second queue and failing, the Agent IP is deleted, and from described the
New Agent IP is extracted in two queues to obtain website data.
7. a kind of web crawlers obtains the device of website data, it is characterised in that described device includes:
First extraction unit, is used to obtain website data for extracting the Agent IP in first queue;
Unit is deleted, the validity of the Agent IP for being extracted according to first extraction unit is removed in the first queue
Void Agency IP;
Second extraction unit, for when the Agent IP quantity in the first queue is less than lowest threshold, from second queue
Extraction Agent IP is used to obtain website data;
Adding device, for the Agent IP that the website data can be effectively extracted in second extraction unit to be added to institute
State in first queue, until when the Agent IP quantity in the first queue reaches highest threshold value, then from the first queue
Extraction Agent IP is used to obtain website data.
8. device according to claim 7, it is characterised in that described device includes:
First creating unit, is used to obtain website data for the Agent IP in extracting first queue in first extraction unit
Before, the first queue is created using static acquisition Agent IP mode, the Agent IP in the first queue is provided with company
It is continuous to crawl the frequency of failure.
9. device according to claim 8, it is characterised in that the deletion unit includes:
Add module, for when obtaining website data success using the Agent IP, the Agent IP to be added to first team
In row, and the continuous frequency of failure that crawls is reset;
Logging modle, for when obtaining website data failure using the Agent IP, recording the continuous of the Agent IP and crawling
The frequency of failure;
Removing module, when crawling the frequency of failure for the Agent IP that is recorded when the logging modle continuous and reach preset value, is deleted
Except the Agent IP.
10. device according to claim 7, it is characterised in that described device includes:
Acquiring unit, for second extraction unit extracted from second queue Agent IP be used for obtain website data it
Before, service acquisition Agent IP is screened using Agent IP, the Agent IP screening service is lasting from agency with fixed frequency
IP is obtained in service and is screened effective Agent IP;
Second creating unit, the Agent IP for being obtained according to the acquiring unit create second queue;
Updating block, for when the Agent IP quantity in the second queue that second creating unit creates reaches upper limit value,
The Agent IP newly added is replaced to the Agent IP added at first in the second queue.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610899608.1A CN107957999A (en) | 2016-10-14 | 2016-10-14 | A kind of web crawlers obtains the method and device of website data |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610899608.1A CN107957999A (en) | 2016-10-14 | 2016-10-14 | A kind of web crawlers obtains the method and device of website data |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN107957999A true CN107957999A (en) | 2018-04-24 |
Family
ID=61953679
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201610899608.1A Pending CN107957999A (en) | 2016-10-14 | 2016-10-14 | A kind of web crawlers obtains the method and device of website data |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN107957999A (en) |
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109274782A (en) * | 2018-08-24 | 2019-01-25 | 北京创鑫旅程网络技术有限公司 | A kind of method and device acquiring website data |
| CN109413153A (en) * | 2018-09-26 | 2019-03-01 | 深圳壹账通智能科技有限公司 | Data crawling method, device, computer equipment and storage medium |
| CN109743411A (en) * | 2018-12-10 | 2019-05-10 | 厦门市美亚柏科信息股份有限公司 | A kind of method, apparatus and storage medium of the dynamic dispatching IP agent pool under distributed environment |
| CN109873882A (en) * | 2019-02-19 | 2019-06-11 | 上海七印信息科技有限公司 | A kind of IP agent pool management system and its management method |
| CN110034979A (en) * | 2019-04-23 | 2019-07-19 | 恒安嘉新(北京)科技股份公司 | A kind of proxy resources monitoring method, device, electronic equipment and storage medium |
| CN110708395A (en) * | 2019-10-24 | 2020-01-17 | 深圳前海环融联易信息科技服务有限公司 | Data acquisition method and device, computer equipment and storage medium |
| CN111125478A (en) * | 2018-10-30 | 2020-05-08 | 北京国双科技有限公司 | Data crawling method and device |
| CN111741141A (en) * | 2020-06-15 | 2020-10-02 | 重庆帮企科技集团有限公司 | Method and system for realizing efficient IP proxy pool and data acquisition method |
| CN113905092A (en) * | 2021-09-28 | 2022-01-07 | 盐城金堤科技有限公司 | Method, device, terminal and storage medium for determining reusable agent queue |
| CN113923260A (en) * | 2021-09-28 | 2022-01-11 | 盐城金堤科技有限公司 | Method, device, terminal and storage medium for processing proxy environment |
| CN114143290A (en) * | 2021-11-19 | 2022-03-04 | 国家计算机网络与信息安全管理中心广东分中心 | System and method for constructing IP proxy pool for multi-website parallel crawling |
| CN115801735A (en) * | 2022-12-15 | 2023-03-14 | 江苏物润船联网络股份有限公司 | Method for calling third-party interface by self-healing function |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103902386A (en) * | 2014-04-11 | 2014-07-02 | 复旦大学 | Multi-thread network crawler processing method based on connection proxy optimal management |
| US20160050176A1 (en) * | 2006-10-13 | 2016-02-18 | Yahoo! Inc | Systems and methods for establishing or maintaining a personalized trusted social network |
| CN105740384A (en) * | 2016-01-27 | 2016-07-06 | 浪潮软件集团有限公司 | A crawler agent automatic switching method and device |
-
2016
- 2016-10-14 CN CN201610899608.1A patent/CN107957999A/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160050176A1 (en) * | 2006-10-13 | 2016-02-18 | Yahoo! Inc | Systems and methods for establishing or maintaining a personalized trusted social network |
| CN103902386A (en) * | 2014-04-11 | 2014-07-02 | 复旦大学 | Multi-thread network crawler processing method based on connection proxy optimal management |
| CN105740384A (en) * | 2016-01-27 | 2016-07-06 | 浪潮软件集团有限公司 | A crawler agent automatic switching method and device |
Cited By (19)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109274782A (en) * | 2018-08-24 | 2019-01-25 | 北京创鑫旅程网络技术有限公司 | A kind of method and device acquiring website data |
| CN109413153A (en) * | 2018-09-26 | 2019-03-01 | 深圳壹账通智能科技有限公司 | Data crawling method, device, computer equipment and storage medium |
| CN109413153B (en) * | 2018-09-26 | 2022-09-02 | 深圳壹账通智能科技有限公司 | Data crawling method and device, computer equipment and storage medium |
| CN111125478B (en) * | 2018-10-30 | 2023-05-12 | 北京国双科技有限公司 | Data crawling method and device |
| CN111125478A (en) * | 2018-10-30 | 2020-05-08 | 北京国双科技有限公司 | Data crawling method and device |
| CN109743411A (en) * | 2018-12-10 | 2019-05-10 | 厦门市美亚柏科信息股份有限公司 | A kind of method, apparatus and storage medium of the dynamic dispatching IP agent pool under distributed environment |
| CN109873882B (en) * | 2019-02-19 | 2022-07-29 | 上海七印信息科技有限公司 | IP proxy pool management system and management method thereof |
| CN109873882A (en) * | 2019-02-19 | 2019-06-11 | 上海七印信息科技有限公司 | A kind of IP agent pool management system and its management method |
| CN110034979A (en) * | 2019-04-23 | 2019-07-19 | 恒安嘉新(北京)科技股份公司 | A kind of proxy resources monitoring method, device, electronic equipment and storage medium |
| CN110708395A (en) * | 2019-10-24 | 2020-01-17 | 深圳前海环融联易信息科技服务有限公司 | Data acquisition method and device, computer equipment and storage medium |
| CN111741141A (en) * | 2020-06-15 | 2020-10-02 | 重庆帮企科技集团有限公司 | Method and system for realizing efficient IP proxy pool and data acquisition method |
| CN113923260A (en) * | 2021-09-28 | 2022-01-11 | 盐城金堤科技有限公司 | Method, device, terminal and storage medium for processing proxy environment |
| CN113905092A (en) * | 2021-09-28 | 2022-01-07 | 盐城金堤科技有限公司 | Method, device, terminal and storage medium for determining reusable agent queue |
| CN113923260B (en) * | 2021-09-28 | 2024-01-09 | 盐城天眼察微科技有限公司 | Method, device, terminal and storage medium for processing agent environment |
| CN113905092B (en) * | 2021-09-28 | 2024-03-22 | 盐城天眼察微科技有限公司 | Method, device, terminal and storage medium for determining reusable agent queue |
| CN114143290A (en) * | 2021-11-19 | 2022-03-04 | 国家计算机网络与信息安全管理中心广东分中心 | System and method for constructing IP proxy pool for multi-website parallel crawling |
| CN114143290B (en) * | 2021-11-19 | 2024-01-30 | 国家计算机网络与信息安全管理中心广东分中心 | System and method for constructing IP proxy pool of multi-website parallel crawling |
| CN115801735A (en) * | 2022-12-15 | 2023-03-14 | 江苏物润船联网络股份有限公司 | Method for calling third-party interface by self-healing function |
| CN115801735B (en) * | 2022-12-15 | 2025-06-20 | 江苏物润船联网络股份有限公司 | A method for calling a third-party interface using a self-healing function |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN107957999A (en) | A kind of web crawlers obtains the method and device of website data | |
| US11184241B2 (en) | Topology-aware continuous evaluation of microservice-based applications | |
| KR101781339B1 (en) | Method and device for updating client | |
| KR102452250B1 (en) | Method and apparatus for storing offchain data | |
| CN105488078B (en) | A kind of web data caching method and equipment | |
| US20120159099A1 (en) | Distributed Storage System | |
| CN110062025A (en) | Method, apparatus, server and the storage medium of data acquisition | |
| CN108551452A (en) | Web crawlers method, terminal and storage medium | |
| CN107122410A (en) | A kind of buffering updating method and device | |
| CN109428913B (en) | A storage expansion method and device | |
| KR102229742B1 (en) | Method and device for previewing a dynamic image, and method and device for displaying a presentation package | |
| JP6359190B2 (en) | Computer system and computer system control method | |
| CN107783770A (en) | Page configuration update method, device, server and medium | |
| CN107102896A (en) | A kind of operating method of multi-level buffer, device and electronic equipment | |
| US11080909B2 (en) | Image layer processing method and computing device | |
| CN108182662A (en) | Image processing method and device, computer readable storage medium | |
| CN109359263A (en) | A kind of user behavior feature extraction method and system | |
| CN111328394A (en) | Locally secure rendering of WEB content | |
| CN107193834A (en) | Computing device, device and method for browsing pages | |
| CN106815232A (en) | Catalog management method, apparatus and system | |
| CN112835578A (en) | A bundle file generation method and storage medium | |
| CN105095352B (en) | Data processing method and device applied to distributed system | |
| CN113869016A (en) | Chart configuration method, device and computer program product | |
| CN114564456B (en) | Distributed storage file recovery method and device | |
| US20220207075A1 (en) | Method and apparatus for generating unordered list, method for managing images and terminal device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd. |
|
| CB02 | Change of applicant information | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180424 |
|
| RJ01 | Rejection of invention patent application after publication |