Intro
This post will talk about some useful tips in the development of scrapy spider which can help you write your spider quickly
Extract infomation
Some newbie developer think the expression supported by scrapy is difficult to learn, but the truth is, compared to the methods like find_by_id, find_by_name in some framework, the xpath, css, and re expression is very extensible and easy to maintain, then work well even in some complex situation.
The xpath and css method is recommended by me to use in your spider since then are easier to write and maintain than re methods.
When you try to use xpath to extract info from web pages, I think XPath Helper is a cool tool you should not miss (a extension for chrome, you can find similar plugins in firefox).
Here is how you use it. Press ctrl+shift+x
to open XPath Helper, and you can see the tool bar. You can type your xpath query string in the toolbar, and the result of the xpath will show on the right side and in the web page the selected content will have a yellow backgroud, which is very easy to check if the xpath expression is right.
If you do not want to install extention to make this done, google chrome have built-in support to query xpath and css expression. Take a look at $()
and $x()
in console.
BTW, at some time , the html you get by your web browser might not the same with the html by your spider. That is normaly caused by two factors, first is useraget. Scrapy use its own useragent by default, which can be detected by web server and send back meaningless html. The other factor is ip address if your browser use some proxy, so you need to make sure spider use the same proxy.
Scrapy Parse
In mose cases you will put the code which handle pagination in one method and put the code which generate the item object in another method.
Is it a good idea to lanch spider directly after you finish you spider? No! You must at least to test each method to make sure it work as expected.
scrapy parse
can help you test your method to make sure it work fine. Here is a example
scrapy parse --spider=googleshop_uk --loglevel=DEBUG -c parse_product_page "http://www.google.co.uk/shopping/product/14720124755692393976?hl=en&q=+test"
the --spider is the name of your spider and -c is the name of the callback method , after downloading html, scrapy will call that method to print out the result like this
Make sure to use this to test your methods and it will save your a lot of time later, trust me!
Scrapy shell
When the spider raise some error in processing specific page, you need to figure out why spider did not work as expected. You can use parse mathod talked above, or you can use scrapy shell
under this circumstance.
Type scrapy shell url_of_page
can open scrapy shell and I will talk the most important commands here, you can check the doc later to find out other usages.
sel
: you can use sel.xpath sel.css to test your expression in this web page, this can quickly find out the errorview
: you can call thisview(response)
to open the page in your browserfetch
fetch another web page and get the new response
How to analyze log
Log is the only way to figure out what happend when scrawl working. So I will give you some suggestion about the log
The spider may raise exception when working due to the different html structure or something else, you need to log the entire html souce code to analyze later. Here is an example.
log.msg('error occur at ' + response.url, level=log.ERROR)
log.msg(str(e), level=log.ERROR)
log.msg(response.body, level=log.ERROR)
After the spider work finished, you can vi the log file and search the error, then copy-paste the html source code to simple html file. For example, the spider raise exception when working on a product page, you get the source html code after checking the log. Now you might ask, how to use it? When can rebuild the env and use scrapy parse to see why it raise exception.
python -m SimpleHTTPServer 10000
can start a simple http server on port 10000 and you can use scrapy parse to recrawl the html page again and see what make it not work as expected.
scrapy parse --spider=googleshop_uk --loglevel=DEBUG -c parse_product_page "http://127.0.0.1:10000/product.html"
or you can use scrapy shell
scrapy shell http://127.0.0.1:10000/product.html
This will make you find the bug quickly and make you see what happend in crawing pages.
Conclution
I hope the tips above can really help you, and after that, you can dive in scrapy doc to find other useful tools.