Enhance your scrapy shell

Last updated on by michaelyin

In the development of the web spiders using scrapy framework, scrapy shell is very helpful because it can make developer debug the code quickly, however, I found that I have to type many imports to import the module into the shell, which is very time consuming.

So I dive into the scrapy source code to find a way to make me feel better when using scrapy shell. Here is my solution.

Less import

re, json, urlparse, copy module is very useful lib in writing spiders, if they are available when we star to write code in scrapy shell that will be very good. So let us get started.

edit the scrapy/utils/console.py in scrapy source code

import re, json, urlparse, copy

The module is added to the header of the console file to make the module available in the python script

namespace["re"] = re
namespace["json"] = json
namespace["urlparse"] = urlparse
namespace["copy"] = copy

Since the namespace is used as local namespace in the ipython shell, so I add the module to the dict and make you can directly use re , json and urlparse after running scrapy shell

Compare url

Sometime I need to quickly compare two url and find out the difference between them. So I add a compare method in my scrapy shell

import difflib

def compare(str1, str2):
    a = str1.splitlines()
    b = str2.splitlines()
    d = difflib.Differ()
    diff = d.compare(a, b)
    print '\n'.join(diff)

It use the difflib in python lib and make me get things done quickly.

Here is the final console.py, you can directly copy it and use it as you wish.

import re, json, urlparse, copy

import difflib

def compare(str1, str2):
    a = str1.splitlines()
    b = str2.splitlines()
    d = difflib.Differ()
    diff = d.compare(a, b)
    print '\n'.join(diff)

def start_python_console(namespace=None, noipython=False, banner=''):
    """Start Python console binded to the given namespace. If IPython is
    available, an IPython console will be started instead, unless `noipython`
    is True. Also, tab completion will be used on Unix systems.
    """
    if namespace is None:
        namespace = {}
    namespace["re"] = re
    namespace["json"] = json
    namespace["urlparse"] = urlparse
    namespace["copy"] = copy
    namespace["compare"] = compare

    try:
        try: # use IPython if available
            if noipython:
                raise ImportError()

            try:
                try:
                    from IPython.terminal import embed
                except ImportError:
                    from IPython.frontend.terminal import embed
                sh = embed.InteractiveShellEmbed(banner1=banner)
            except ImportError:
                from IPython.Shell import IPShellEmbed
                sh = IPShellEmbed(banner=banner)

            sh(global_ns={}, local_ns=namespace)
        except ImportError:
            import code
            try: # readline module is only available on unix systems
                import readline
            except ImportError:
                pass
            else:
                import rlcompleter
                readline.parse_and_bind("tab:complete")
            code.interact(banner=banner, local=namespace)
    except SystemExit: # raised when using exit() in python code.interact
        pass

Update on 2015-05-27

I think it is better to use scrapy shell with ipytho profile. So I modified scrapy/utils/console.py to get this done.

def start_python_console(namespace=None, noipython=False, banner=''):
    """Start Python console binded to the given namespace. If IPython is
    available, an IPython console will be started instead, unless `noipython`
    is True. Also, tab completion will be used on Unix systems.
    """
    if namespace is None:
        namespace = {}

    try:
        try: # use IPython if available
            if noipython:
                raise ImportError()
            try:
                from IPython import start_ipython
                start_ipython([], user_ns=namespace)
            except Exception as e:
                import ipdb; ipdb.set_trace()

        except ImportError:
            import code
            try: # readline module is only available on unix systems
                import readline
            except ImportError:
                pass
            else:
                import rlcompleter
                readline.parse_and_bind("tab:complete")
            code.interact(banner=banner, local=namespace)
    except SystemExit: # raised when using exit() in python code.interact
        pass

I created a file called $home/.ipython/profile_default/startup/01-scrapy.py which contains this

import re, json, urlparse, copy

import difflib

def compare(str1, str2):
    a = str1.splitlines()
    b = str2.splitlines()
    d = difflib.Differ()
    diff = d.compare(a, b)
    print '\n'.join(diff)

Send Me Message

Tell me more about your project and see if I can help you.

Contact Me