Working with http proxy

When crawling infos from some website like google shop, it will detect the source ip and restrict some service to some specific ip address, howerver, scrapy framework can handle this situation by making the request through proxy.

The scrapy has provided HttpProxyMiddleware to support http proxy, if you want to make your web crawler to go through proxy, the first thing you need to do is modify your setting file just like this

    'scrapyproduct.middlewares.ProxyMiddleware': 1,
    'scrapyproduct.middlewares.DelayAfterConnectionRefusedMiddleware': 510,

The setting above will make scrapy use ProxyMiddleware in the middleware, so the next thing need to do is implement the ProxyMiddleware

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = ""

Just add a proxy value meta value to request object, then scrapy will finish the rest for us. Here the http proxy is on my pc and it listen to the 8118 port, so the proxy is In some cases, the proxy server will need your enter usernamd and password to verify your identity, so you have to enter proxy like this http://USERNAME:[email protected]:PROXYPORT

Or you can change your environment variables


Convert socket proxy to http proxy

In some cases, what you got is socket proxy, which can not be used directly by scrapy, so you have to convert the socket proxy to http proxy. Fortunately there are so many existing tool that can do this.

what I use is Privoxy, a very good tool I find in wiki page of Tor, first install it

sudo apt-get install privoxy
#change the config file
sudo vi /etc/privoxy/config

#add one of the following lines depend on your socket type
forward-socks5 / .
forward-socks4a / .

#restart it 
sudo /etc/init.d/privoxy restart

now privoxy will open a http proxy at 8118 at the listen port can also be modified in config file ), it will redirect http requests to socket proxy at, if you want to know more, go do read the doc of privoxy

Now scrapy can work very well with proxy, the website can not detect the source ip because of the existence of proxy, more importantely, it make crwaler extract infos from some website only open to some specific ip

