Scrapy Exercises: Make you prepared for web scraping challenge

Scrapy Exercise project is to help Scrapy beginner quickly learn web scraping skills by solving problems from the real world step by step.

Why I create this project

Hi, This is Michael, I am a web scraping export who have over 4 years in this area. I have helped many people to learn web scraping using Python and still doing this today.

When I am a newbie developer, I'd like to read good blog posts or articles about web scraping, however, unlike tutorial in another area such as web programming. The code on the blog will not work if the target website changes its web structure, sometimes this would make reader scratch head and want somebody who can help them.

At that time, I think there should be a better way to help people quickly prepared for web scraping challenge because data is more and more important for us now and web scraping is a very basic skill if you want to do some data analysis.

That is why I create this Scrapy exercises, my goal is to try to break down a complex mission such as crawling a bunch of websites to some small tasks so people can try to solve them step by step. What is more, if they have trouble solving the exercises, they can ask for help with detail instead of "I have trouble crawling the website".

What is included in these web scraping exercises.

Now it has 10 exercises, which simulate the real website actions, and each one contains a specific skill you need to learn, When you solve the exercise one by one, you get much closer to an experienced web scraping developer.

There are also some tips and description for each exercise, which help you know what need learn to extract and where you can get the learning resources.

Who might need this project

Any people who want to learn web scraping, test the web scraping skills or want to make it for fun just like CheckIO might need this project.

How it works

I created a sample e-commerce website which includes many product detail pages and list pages. For example, Two product detail pages which have the same product detail actually use different ways to process the data. That is the key point when scraping, there are a number of ways to display data and you should find the way in which target website show data.

You can get the description of each web scraping exercise below, and each exercise have a entry URL, you need build web spider to get data from it.

All tips and tools you needed will be also discussed in the description of the exercise.

Exercise list

Scrapy Exercise #1: Basic Info Scraping

Exercise link

Right now there are mainly four typical methods for you to extract data.

CSS extraction
XPath extraction
Regex extraction
Custom methods shipped with some package such as find_all in BeautifulSoup

You can choose the one you like to extract the info, in this exercise, try to extract this product detail such as title, desc and price.

Scrapy Exercise #2: Analyze JSON

Exercise link

Javascript is the most popular language today, and JSON format has become a very popular data structure to store data.

Try to extract this product detail such as title, desc and price.

Tips:

In some cases, the XPath expression or CSS expression which work in your browser can not work in your code because some DOM element might have been modified by frontend javascript.
Sometimes there are some Unicode char in the raw JSON string which might cause program raise UnicodeDecodeError. You should remember before running json.loads, make the JSON string Unicode string type. If there is some syntax error when loading, you can use some took such as json lint to help you figure out where the error is.

Scrapy Exercise #3: Recursively Scraping pages

Exercise link

Try to extract all product detail infomation such as title, description, you should also handle pagination here so in the end you can get about 100+ records.

Scrapy Exercise #4: Mimicking Ajax requests

Exercise link

Ajax is so popular nowadays so you should learn how to mimic it in your web crawler.

In this exercise, try to extract the product detail such as title, desc, and price.

You should learn how to inspect network requests in the browser and filter them, after you figure out the URL of the ajax request, implement it in your spider.

Scrapy Exercise #5: Inspect HTTP request

Exercise link

Some website would check HTTP headers such as Referer to block some abnormal request.

Try to extract the product detail such as title, desc and price.

Scrapy Exercise #6: Scraping Infinite Scrolling Pages (Ajax)

Exercise link

The key to scrape infinite scrolling pages is to use network panel in your browser to figure out the url of next page.

Sometimes you also need to take care of the http headers to make your code work.

In this exercise, try to crawl all product info.

Exercise link

Try to extract the product detail such as title, desc and price.

Tips:

After some tests, you might find out it is hard to make the spider get the data through normal Ajax, so you need to dive into the detail of the ajax request.
You need to make sure the URL, HTTP header, cookie values are all reasonable just like what your browser does.

Exercise link

In this exercise, you need to use username scrapingclub and password scrapingclub to login in, after you successfully login in, you will be redirected in a welcome page.

Scrapy Exercise #9: Solve Captcha

Exercise link

In this exercise, you need to use username scrapingclub and password scrapingclub to login in, after you successfully login in, you will be redirected in a welcome page.

Scrapy Exercise #10: Decode minified javascript

Exercise link

Many websites now minified js file when deploying the websites, for example uglify, compressor, so we should learn how to analyze the minimized code in browser and try to debug it in some cases to figure out the workflow. This process is like disassemble in reverse engineering.

You will see that the ajax URL used parameter sign in the URL but you have no idea where it is from, and it seems the js file detail_sign.js is minified.

Tips:

Before checking the minified js, you can find a way to Pretty print it in your browser. Just use Google to help you.