Some readers might have questions when reading this scrapy tutorial series, what is the
extract method, how to use it, what if I want to iterate a note list and extract the sub-nodes. In this post, I would talk about Scrapy Selector and how to use it with iteration.
Scrapy have its own mechanism for extracting data which are called
selectors, they can select the certain part of HTML by using XPath or CSS expression. XPath is designed to select info from XML document since Html is a special type of XML, so XPath can also be used to select info from HTML. CSS is designed to apply styles to HTML nodes, so it can also be used to select nodes from HTML. There is no solid answer to which exprssion is better, so just choose the one you like, in this scrapy tutorial for Python 3 I will use XPath with Scrapy Selector, you can also use CSS with Scrapy Selector.
To make you understand about Scrapy Selector better, here we use real HTML source code to test.
<html> <head> <title>Scrapy Tutorial Series By MichaelYin</title> </head> <body> <div class='links'> <a href='one.html'>Link 1<img src='image1.jpg'/></a> <a href='two.html'>Link 2<img src='image2.jpg'/></a> <a href='three.html'>Link 3<img src='image3.jpg'/></a> </div> </body> </html>
We can construct selector instance by passing
TextResponse object. First, let's enter Scrapy shell by using
scrapy shell, then paste the code from blog post to the terminal. What you should know here is that
>>> indicates an interactive session and code typed in python shell are marked with this. Output is show without the arrows.
>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse
Now constructing selector from text
>>> body = """ <html> <head> <title>Scrapy Tutorial Series By MichaelYin</title> </head> <body> <div class='links'> <a href='one.html'>Link 1<img src='image1.jpg'/></a> <a href='two.html'>Link 2<img src='image2.jpg'/></a> <a href='three.html'>Link 3<img src='image3.jpg'/></a> </div> </body> </html> """ >>> sel = Selector(text=body) >>> sel.xpath("//title/text()").extract() [u'Scrapy Tutorial Series By MichaelYin']
If you want to construct selector instance from response
>>> response = HtmlResponse(url="http://mysite.com", body=body, encoding='utf-8') >>> Selector(response=response).xpath("//title/text()").extract() [u'Scrapy Tutorial Series By MichaelYin']
Response object also exposed a selector on
selector attribute to make it convenient to use the selector. So in most cases, you can directly use it.
>>> response.selector.xpath('//title/text()').extract() >>> response.xpath('//title/text()').extract()
The code above has same outputs.
How to use Scrapy selectors
When you use
CSS method to select nodes from HTML, the output returned by the methods is
>>> response.selector.xpath('//a/@href') [<Selector xpath='//a/@href' data=u'one.html'>, <Selector xpath='//a/@href' data=u'two.html'>, <Selector xpath='//a/@href' data=u'three.html'>]
As you can see, SelectorList is a list of new selectors. If you want to get the textual data instead of
SelectorList, just call
>>> response.selector.xpath('//a/@href').extract() [u'one.html', u'two.html', u'three.html']
You can use
extract_first to extract only first matched element, which can save you from
>>> response.xpath("//a[@href='one.html']/img/@src").extract_first() u'image1.jpg'
extract returns a list of selectors, so we can also call
XPath of these selectors to extract data. For example, we can first get the list of hyper link, then extract all text and image nodes.
>>> links = response.xpath("//a") >>> links.extract() [u'<a href="one.html">Link 1<img src="image1.jpg"></a>', u'<a href="two.html">Link 2<img src="image2.jpg"></a>', u'<a href="three.html">Link 3<img src="image3.jpg"></a>'] >>> for index, link in enumerate(links): ... args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract()) ... print('Link number %d points to url %s and image %s' % args) Link number 0 points to url ['one.html'] and image ['image1.jpg'] Link number 1 points to url ['two.html'] and image ['image2.jpg'] Link number 2 points to url ['three.html'] and image ['image3.jpg']
In this scrapy tutorial for Python 3, I talked about how to construct Scrapy selector, how to use it to extract data and how to use nesting selectors, all the code of this tutorial is Python 3, which is the future of Python. If you have any question about Scrapy, just leave me message here, I will respond ASAP.
For people who like to read ebook instead of blog posts, I have published a book on leanpub，where you can get pdf, epub, mobi version of this Scrapy book Ultimate Guide To Scrapy.