This is the #7 post of my Scrapy Tutorial Series, in this Scrapy tutorial, I will talk about how to use XPath in scrapy to extract info and how to use tools help you quickly write XPath expressions.
Basic points of Xpath
First, we can did some tests on the homepage of Quotes to Scrape to understand the basic points of Xpath.
$ scrapy shell In : fetch("http://quotes.toscrape.com/")
In the code above, first we enter Scrapy shell by using
scrapy shell commands, after that, we can use some built-in commands in scrapy shell to help us. For example, we can use
fetch to help us to send http request and get the response for us. You can get the detail of the HTTP response by accessing property of the response object. There are many useful methods in response object, in the code below, we use the
xpath method to extract info for us.
#If we want to get html node response.xpath("/html").extract() #If we want to get body node, which is the child of html node response.xpath("/html/body").extract() #If you want to get all div descendant of this html response.xpath("/html//div").extract() #we can also drill down without having to start with /html, this expression would extract all div nodes response.xpath("//div").extract()
From the code above, you should know how to use
// to select the node. If you want to filter all div elements which have
response.xpath("//div[@class='quote']").extract() # you can use this syntax to filter nodes response.xpath("//div[@class='quote']/span[@class='text']").extract() # use text() to extract all text inside nodes response.xpath("//div[@class='quote']/span[@class='text']/text()").extract()
You should copy the code to your terminal, check the output, to make sure you really understand how it works.
Many people like to learn XPath by reading the Cheatsheet or online doc, however, I do not think this is a good way because many patterns would not be used in many cases. It is better to search the answer when you have encountered specific problem of XPath, if you still have problem after these steps, you can leave me message here.
Here are some good resources for you to check when you have problem.
How to get XPath in Chrome
To make you quickly get the XPath in Chrome, it is recommended to install Chrome Extension called XPath Helper, I would show you how to use this great extension.
Ctrl+Shift+xto activate it in web page, you will console in page.
Shift, then move your mouse, then the console will show the XPath expression and the right side will show the result.
- In most cases, the XPath expression generated in the console is very long, so you can edit if you like. You can edit the XPath query directly in the console. The results box will immediately reflect your changes, which is the most powerful feature of this Plugin.
How to get XPath in Firefox
FirePath is a FIrebug Extension which can generate XPath for you, it is very easy.
- Install FireBug Firefox extension, which is a prerequisite to install FirePath.
- Install FirePath Firefox extension. Remember to restart firefox after installation.
- Right-click on the element you want to extract and select "Inspect in FirePath".
- You can see the XPath generated in the box
In this scrapy tutorial, we learned how to how to use XPath in scrapy to extract info, if you have any questions about your project, just left a message here and I will respond ASAP. What is more, you really should install the plugin mentioned above to increase your productivity, it can help you a lot if you have not much experience with Xpath.
For people who like to read ebook instead of blog posts, I have published a book on leanpub，where you can get pdf, epub, mobi version of this Scrapy book Ultimate Guide To Scrapy.