[go: up one dir, main page]

Skip to content

πŸ’ͺ Ruby / JRuby / TrufflleRuby gem & CLI for dealing with proxy lists from various sources

License

Notifications You must be signed in to change notification settings

nbulaj/proxy_fetcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Ruby / JRuby lib for managing proxies

Gem Version CI Coverage Status Code Climate Inline docs License

This gem can help your Ruby / JRuby application to make HTTP(S) requests using proxy by fetching and validating actual proxy lists from multiple providers.

It gives you a special Manager class that can load proxy lists, validate them and return random or specific proxies. It also has a Client class that encapsulates all the logic for sending HTTP requests using proxies, automatically fetched and validated by the gem. Take a look at the documentation below to find all the gem features.

Also this gem can be used with any other programming language (Go / Python / etc) as standalone solution for downloading and validating proxy lists from the different providers. Checkout examples of usage below.

Documentation valid for master branch

Please check the documentation for the version of doorkeeper you are using in: https://github.com/nbulaj/proxy_fetcher/releases

Table of Contents

Dependencies

ProxyFetcher gem itself requires Ruby >= 2.0.0 (or JRuby > 9.0, but maybe earlier too, see GitHub Actions matrix) and great HTTP.rb gem.

However, it requires an adapter to parse HTML. If you do not specify any specific adapter, then it will use default one - Nokogiri. It's OK for any Ruby on Rails project (because they use it by default).

But if you want to use some specific adapter (for example your application uses Oga, then you need to manually add your dependencies to your project and configure ProxyFetcher to use another adapter. Moreover, you can implement your own adapter if it your use-case. Take a look at the Configuration section for more details.

Installation

If using bundler, first add 'proxy_fetcher' to your Gemfile:

gem 'proxy_fetcher', '~> 0.14'

or if you want to use the latest version (from master branch), then:

gem 'proxy_fetcher', git: 'https://github.com/nbulaj/proxy_fetcher.git'

And run:

bundle install

Otherwise simply install the gem:

gem install proxy_fetcher -v '0.14'

Example of usage

In Ruby application

By default ProxyFetcher uses all the available proxy providers. To get current proxy list without validation you need to initialize an instance of ProxyFetcher::Manager class. By default ProxyFetcher will automatically load and parse all the proxies from all available sources:

manager = ProxyFetcher::Manager.new # will immediately load proxy list from the servers
manager.proxies

 #=> [#<ProxyFetcher::Proxy:0x00000002879680 @addr="97.77.104.22", @port=3128, @country="USA",
 #     @response_time=5217, @type="HTTP", @anonymity="High">, ... ]

You can initialize proxy manager without immediate load of the proxy list from the remote server by passing refresh: false on initialization:

manager = ProxyFetcher::Manager.new(refresh: false) # just initialize class instance
manager.proxies

 #=> []

Also you could use ProxyFetcher to load proxy lists from local files if you have such:

manager = ProxyFetcher::Manager.new(file: "/home/dev/proxies.txt", refresh: false)

# or

manager = ProxyFetcher::Manager.from_file(file: "/home/dev/proxies.txt", refresh: false)

# or

manager = ProxyFetcher::Manager.new(
  files: Dir.glob("/home/dev/proxies/**/*.txt"),
  refresh: false
)
manager.proxies

 #=> [#<ProxyFetcher::Proxy:0x00000002879680 @addr="97.77.104.22", @port=3128, @country="USA",
 #     @response_time=5217, @type="HTTP", @anonymity="High">, ... ]

ProxyFetcher::Manager class is very helpful when you need to manipulate and manager proxies. To get the proxy from the list you can call .get or .pop method that will return first proxy and move it to the end of the list. This methods has some equivalents like get! or aliased pop! that will return first connectable proxy and move it to the end of the list. They both marked as danger methods because all dead proxies will be removed from the list.

If you need just some random proxy then call manager.random_proxy or it's alias manager.random.

To clean current proxy list from the dead entries that does not respond to the requests you need to use cleanup! or validate! method:

manager.cleanup! # or manager.validate!

This action will enumerate proxy list and remove all the entries that doesn't respond by timeout or returns errors.

In order to increase the performance proxy list validation is performed using Ruby threads. By default gem creates a pool with 10 threads, but you can increase this number by changing pool_size configuration option: ProxyFetcher.config.pool_size = 50. Read more in Proxy validation speed section.

If you need raw proxy URLs (like host:port) then you can use raw_proxies methods that will return array of strings:

manager = ProxyFetcher::Manager.new
manager.raw_proxies

 # => ["97.77.104.22:3128", "94.23.205.32:3128", "209.79.65.140:8080",
 #     "91.217.42.2:8080", "97.77.104.22:80", "165.234.102.177:8080", ...]

You don't need to initialize a new manager every time you want to load actual proxy list from the providers. All you need is to refresh the proxy list by calling #refresh_list! (or #fetch!) method for your ProxyFetcher::Manager instance:

manager.refresh_list! # or manager.fetch!

 #=> [#<ProxyFetcher::Proxy:0x00000002879680 @addr="97.77.104.22", @port=3128, @country="USA",
 #     @response_time=5217, @type="HTTP", @anonymity="High">, ... ]

If you need to filter proxy list, for example, by country or response time and selected provider supports filtering with GET params, then you can just pass your filters like a simple Ruby hash to the Manager instance:

ProxyFetcher.config.providers = :xroxy

manager = ProxyFetcher::Manager.new(filters: { country: 'PL', maxtime: '500' })
manager.proxies

 # => [...]

[IMPORTANT]: All the providers have their own filtering params! So you can't just use something like country to filter all the proxies by country. If you are using multiple providers, then you can split your filters by proxy provider names:

ProxyFetcher.config.providers = [:proxy_docker, :xroxy]

manager = ProxyFetcher::Manager.new(filters: {
  hide_my_name: {
    country: 'PL',
    maxtime: '500'
  },
  xroxy: {
    type: 'All_http'
  }
})

manager.proxies

 # => [...]

You can apply different filters every time you calling #refresh_list! (or #fetch!) method:

manager.refresh_list!(country: 'PL', maxtime: '500')

 # => [...]

NOTE: not all the providers support filtering. Take a look at the provider classes to see if it supports custom filters.

Standalone

All you need to use this gem is Ruby >= 2.0 (2.4 is recommended). You can install it in a different ways. If you are using Ubuntu Xenial (16.04LTS) then you already have Ruby 2.3 installed. In other cases you can install it with RVM or rbenv.

After installing Ruby just bundle the gem by running gem install proxy_fetcher in your terminal and now you can run it:

proxy_fetcher >> proxies.txt # Will download proxies from the default provider, validate them and write to file

If you need a list of proxies from some specific provider, then you need to pass it's name with -p option:

proxy_fetcher -p xroxy >> proxies.txt # Will download proxies from the default provider, validate them and write to file

If you need a list of proxies in JSON format just pass a --json option to the command:

proxy_fetcher --json

# Will print:
# {"proxies":["120.26.206.178:80","119.61.13.242:1080","117.40.213.26:80","92.62.72.242:1080","77.53.105.155:3124"
# "58.20.41.172:35923","204.116.192.151:35923","190.5.96.58:1080","170.250.109.97:35923","121.41.82.99:1080"]}

To get all the possible options run:

proxy_fetcher --help

Client

ProxyFetcher gem provides you a ready-to-use HTTP client that made requesting with proxies easy. It does all the work with the proxy lists for you (load, validate, refresh, find proxy by type, follow redirects, etc). All you need it to make HTTP(S) requests:

require 'proxy_fetcher'

ProxyFetcher::Client.get 'https://example.com/resource'

ProxyFetcher::Client.post 'https://example.com/resource', { param: 'value' }

ProxyFetcher::Client.post 'https://example.com/resource', 'Any data'

ProxyFetcher::Client.post 'https://example.com/resource', { param: 'value'}.to_json , headers: { 'Content-Type': 'application/json' }

ProxyFetcher::Client.put 'https://example.com/resource', { param: 'value' }

ProxyFetcher::Client.patch 'https://example.com/resource', { param: 'value' }

ProxyFetcher::Client.delete 'https://example.com/resource'

By default, ProxyFetcher::Client makes 1000 attempts to send a HTTP request in case if proxy is out of order or the remote server returns an error. You can increase or decrease this number for your case or set it to nil if you want to make infinite number of requests (or before your Ruby process will die πŸ’€):

require 'proxy_fetcher'

ProxyFetcher::Client.get 'https://example.com/resource', options: { max_retries: 10_000 }

You can also use your own proxy object when using ProxyFetcher client:

require 'proxy_fetcher'

manager = ProxyFetcher::Manager.new # will immediately load proxy list from the server

#random will return random proxy object from the list
ProxyFetcher::Client.get 'https://example.com/resource', options: { proxy: manager.random }

Btw, if you need support of JavaScript or some other features, you need to implement your own client using, for example, selenium-webdriver.

Configuration

ProxyFetcher is very flexible gem. You can configure the most important parts of the library and use your own solutions.

Default configuration looks as follows:

ProxyFetcher.configure do |config|
  config.logger = Logger.new($stdout)
  config.user_agent = ProxyFetcher::Configuration::DEFAULT_USER_AGENT
  config.pool_size = 10
  config.client_timeout = 3
  config.provider_proxies_load_timeout = 30
  config.proxy_validation_timeout = 3
  config.http_client = ProxyFetcher::HTTPClient
  config.proxy_validator = ProxyFetcher::ProxyValidator
  config.providers = ProxyFetcher::Configuration.registered_providers
  config.adapter = ProxyFetcher::Configuration::DEFAULT_ADAPTER # :nokogiri by default
end

You can change any of the options above.

For example, you can set your custom User-Agent string:

ProxyFetcher.configure do |config|
  config.user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
end

ProxyFetcher uses HTTP.rb gem for dealing with HTTP(S) requests. It is fast enough and has a great chainable API. If you wanna add, for example, your custom provider that was developed as a Single Page Application (SPA) with some JavaScript, then you will need something like selenium-webdriver to properly load the content of the website. For those and other cases you can write your own class for fetching HTML content by the URL and setup it in the ProxyFetcher config:

class MyHTTPClient
  # [IMPORTANT]: below methods are required!
  def self.fetch(url)
    # ... some magic to return proper HTML ...
  end
end

ProxyFetcher.config.http_client = MyHTTPClient

manager = ProxyFetcher::Manager.new
manager.proxies

#=> [#<ProxyFetcher::Proxy:0x00000002879680 @addr="97.77.104.22", @port=3128, @country="USA",
 #     @response_time=5217, @type="HTTP", @anonymity="High">, ... ]

You can take a look at the lib/proxy_fetcher/utils/http_client.rb for an example.

Moreover, you can write your own proxy validator to check if proxy is valid or not:

class MyProxyValidator
  # [IMPORTANT]: below methods are required!
  def self.connectable?(proxy_addr, proxy_port)
    # ... some magic to check if proxy is valid ...
  end
end

ProxyFetcher.config.proxy_validator = MyProxyValidator

manager = ProxyFetcher::Manager.new
manager.proxies

 #=> [#<ProxyFetcher::Proxy:0x00000002879680 @addr="97.77.104.22", @port=3128, @country="USA",
 #     @response_time=5217, @type="HTTP", @anonymity="High">, ... ]

manager.validate!

 #=> [ ... ]

Be default, ProxyFetcher gem uses Nokogiri for parsing HTML. If you want to use Oga instead, then you need to add gem 'oga' to your Gemfile and configure ProxyFetcher as follows:

ProxyFetcher.config.adapter = :oga

Also you can write your own HTML parser implementation and use it, take a look at the abstract class and implementations. Configure it as:

ProxyFetcher.config.adapter = MyHTMLParserClass

Proxy validation speed

There are some tricks to increase proxy list validation performance.

In a few words, ProxyFetcher gem uses threads to validate proxies for availability. Every proxy is checked in a separate thread. By default, ProxyFetcher uses a pool with a maximum of 10 threads. You can increase this number by setting max number of threads in the config:

ProxyFetcher.config.pool_size = 50

You can experiment with the threads pool size to find an optimal number of maximum threads count for you PC and OS. This will definitely give you some performance improvements.

Moreover, the common proxy validation speed depends on ProxyFetcher.config.proxy_validation_timeout option that is equal to 3 by default. It means that gem will wait 3 seconds for the server answer to check if particular proxy is connectable. You can decrease this option to 1, for example, and it will heavily increase proxy validation speed (but remember that some proxies could be connectable, but slow, so with this option you will clear proxy list from the proxies that works, but very slow).

Proxy object

Every proxy is a ProxyFetcher::Proxy object that has next readers (instance variables):

  • addr (IP address)
  • port
  • type (proxy type, can be HTTP, HTTPS, SOCKS4 or/and SOCKS5)
  • country (USA or Brazil for example)
  • response_time (5217 for example)
  • anonymity (Low, Elite proxy or High +KA for example)

Also you can call next instance methods for every Proxy object:

  • connectable? (whether proxy server is available)
  • http? (whether proxy server has a HTTP protocol)
  • https? (whether proxy server has a HTTPS protocol)
  • socks4?
  • socks5?
  • uri (returns URI::Generic object)
  • url (returns a formatted URL like "IP:PORT" or "http://IP:PORT" if scheme: true provided)

Providers

Currently ProxyFetcher can deal with next proxy providers (services):

  • Free Proxy List
  • Free SSL Proxies
  • Free Socks Proxies
  • Free US Proxies
  • HTTP Tunnel Genius
  • Proxy List
  • XRoxy
  • Proxypedia
  • Proxyscrape
  • MTPro.xyz

If you wanna use one of them just setup it in the config:

ProxyFetcher.config.provider = :free_proxy_list

manager = ProxyFetcher::Manager.new
manager.proxies
 #=> ...

You can use multiple providers at the same time:

ProxyFetcher.config.providers = :free_proxy_list, :xroxy, :proxy_docker

manager = ProxyFetcher::Manager.new
manager.proxies
 #=> ...

If you want to use all the possible proxy providers then you can configure ProxyFetcher as follows:

ProxyFetcher.config.providers = ProxyFetcher::Configuration.registered_providers

manager = ProxyFetcher::Manager.new
manager.proxies

 #=> [#<ProxyFetcher::Proxy:0x00000002879680 @addr="97.77.104.22", @port=3128, @country="USA", 
 #     @response_time=5217, @type="HTTP", @anonymity="High">, ... ]

Moreover, you can write your own provider! All you need is to create a class, that would be inherited from the ProxyFetcher::Providers::Base class, and register your provider like this:

ProxyFetcher::Configuration.register_provider(:your_provider, YourProviderClass)

Provider class must implement self.load_proxy_list and #to_proxy(html_element) methods that will load and parse provider HTML page with proxy list. Take a look at the existing providers in the lib/proxy_fetcher/providers directory.

Contributing

You are very welcome to help improve ProxyFetcher if you have suggestions for features that other people can use.

To contribute:

  1. Fork the project.
  2. Create your feature branch (git checkout -b my-new-feature).
  3. Implement your feature or bug fix.
  4. Add documentation for your feature or bug fix.
  5. Run rake doc:yard. If your changes are not 100% documented, go back to step 4.
  6. Add tests for your feature or bug fix.
  7. Run rake spec to make sure all tests pass.
  8. Commit your changes (git commit -am 'Add new feature').
  9. Push to the branch (git push origin my-new-feature).
  10. Create new pull request.

Thanks.

License

proxy_fetcher gem is released under the MIT License.

Copyright (c) 2017 Nikita Bulai (bulajnikita@gmail.com).