How to debug your scrapy spider

Last updated on by michaelyin

Intro

This post will talk about some useful tips in the development of scrapy spider which can help you write your spider quickly

Extract infomation

Some newbie developer think the expression supported by scrapy is difficult to learn, but the truth is, compared to the methods like find_by_id, find_by_name in some framework, the xpath, css, and re expression is very extensible and easy to maintain, then work well even in some complex situation.

The xpath and css method is recommended by me to use in your spider since then are easier to write and maintain than re methods.

When you try to use xpath to extract info from web pages, I think XPath Helper is a cool tool you should not miss (a extension for chrome, you can find similar plugins in firefox).

Here is how you use it. Press ctrl+shift+x to open XPath Helper, and you can see the tool bar. You can type your xpath query string in the toolbar, and the result of the xpath will show on the right side and in the web page the selected content will have a yellow backgroud, which is very easy to check if the xpath expression is right.

If you do not want to install extention to make this done, google chrome have built-in support to query xpath and css expression. Take a look at $() and $x() in console.

BTW, at some time , the html you get by your web browser might not the same with the html by your spider. That is normaly caused by two factors, first is useraget. Scrapy use its own useragent by default, which can be detected by web server and send back meaningless html. The other factor is ip address if your browser use some proxy, so you need to make sure spider use the same proxy.

Scrapy Parse

In mose cases you will put the code which handle pagination in one method and put the code which generate the item object in another method.

Is it a good idea to lanch spider directly after you finish you spider? No! You must at least to test each method to make sure it work as expected.

scrapy parse can help you test your method to make sure it work fine. Here is a example

scrapy parse --spider=googleshop_uk --loglevel=DEBUG -c parse_product_page "http://www.google.co.uk/shopping/product/14720124755692393976?hl=en&q=+test"

the --spider is the name of your spider and -c is the name of the callback method , after downloading html, scrapy will call that method to print out the result like this

Make sure to use this to test your methods and it will save your a lot of time later, trust me!

Scrapy shell

When the spider raise some error in processing specific page, you need to figure out why spider did not work as expected. You can use parse mathod talked above, or you can use scrapy shell under this circumstance.

Type scrapy shell url_of_page can open scrapy shell and I will talk the most important commands here, you can check the doc later to find out other usages.

  • sel: you can use sel.xpath sel.css to test your expression in this web page, this can quickly find out the error
  • view: you can call this view(response) to open the page in your browser
  • fetch fetch another web page and get the new response

How to analyze log

Log is the only way to figure out what happend when scrawl working. So I will give you some suggestion about the log

The spider may raise exception when working due to the different html structure or something else, you need to log the entire html souce code to analyze later. Here is an example.

log.msg('error occur at ' + response.url, level=log.ERROR)
log.msg(str(e), level=log.ERROR)
log.msg(response.body, level=log.ERROR)

After the spider work finished, you can vi the log file and search the error, then copy-paste the html source code to simple html file. For example, the spider raise exception when working on a product page, you get the source html code after checking the log. Now you might ask, how to use it? When can rebuild the env and use scrapy parse to see why it raise exception.

python -m SimpleHTTPServer 10000 can start a simple http server on port 10000 and you can use scrapy parse to recrawl the html page again and see what make it not work as expected.

scrapy parse --spider=googleshop_uk --loglevel=DEBUG -c parse_product_page "http://127.0.0.1:10000/product.html"

or you can use scrapy shell

scrapy shell http://127.0.0.1:10000/product.html

This will make you find the bug quickly and make you see what happend in crawing pages.

Conclution

I hope the tips above can really help you, and after that, you can dive in scrapy doc to find other useful tools.

Send Me Message

Tell me more about your project and see if I can help you.

Contact Me