Make Scrapy work with socket proxy

Post on by michaelyin

Working with http proxy

When crawling infos from some website like google shop, it will detect the source ip and restrict some service to some specific ip address, howerver, scrapy framework can handle this situation by making the request through proxy.

The scrapy has provided HttpProxyMiddleware to support http proxy, if you want to make your web crawler to go through proxy, the first thing you need to do is ...


Install Nodejs on Ubuntu

Post on by michaelyin

I just installed Nodejs on ubuntu through package manager (more detail ) and found there were some problems.

When using npm might get something like this

npm ERR! Error: EACCES, symlink '../lib/node_modules/spm/bin/spm'
npm ERR!  { [Error: EACCES, symlink '../lib/node_modules/spm/bin/spm']
npm ERR!   errno: 3,
npm ERR!   code: 'EACCES',
npm ERR!   path: '../lib/node_modules/spm/bin/spm' }
npm ERR! 
npm ERR! Please try ...

Make python module editable

Post on by michaelyin

Intro

There are so many python packages in the Python world, which provide rich functionality to help us make things done quickly, but sometime you may want to see the detail inside to learn or want to debug the package code to fix some bugs. The package code would be in the site-packages if you install it by default, which is not a good way for python hacker. In this ...


直线相交问题

Post on by michaelyin

最近在checkio上扫题的时候,碰到一个判断直线是否相交的问题,题目请猛击这里,这个题目以前上学的时候做到手软的好么,二话不说直接开写,先根据坐标上的两个点,算出斜率,然后得出直线表达式,最后根据两个直线表达式算出是否存在相交点,然后加上斜率为无穷大的情况等等,最后写出的代码如下。

def checkio(data):
    xw1, yw1 = data[0]
    xw2, yw2 = data[1]
    xa, ya = data[2]
    xb, yb = data[3]

    if(xw1 - xw2 == 0  and xb - xa == 0):
        # vertical vertical
        return False
    elif(xw1 - xw2 == 0  and xb - xa != 0):
        # ver hor
        slope, d = cal_line(xa, ya, xb, yb)
        y = xw1 * slope + d
        dir = direction(data[2], data[3 ...

CodeKata-Bloom Filter

Post on by michaelyin

Bloom Filter在是一种在海量数据处理中很常用的算法,主要用来提供对集合数据的查询,其内在原理和bitmap十分类似。

基本原理:

如下图所示,长为m bit的byte数组,里面的bit位数值全部置0

image

现在需要将a这个值写入到byte数组里面去,使用k个不同的hash函数,对a进行hash值计算,并确保得到的hash值位于{1,m}这个区间之间。假设现在k个hash函数,那么得到的hash值一共有k个,即 $h_{1}(a)$ ,在byte数组里面将这些值对应的bit位全部置为1,完成这一步,a的值就已经写入到byte数组里面去了。如果有多个值,重复该过程,如果写入过程中发现某bit位已经是1,可以不用管。

image

当我要查询数组里面是否包含值为b的时候,同样对b进行k个不同的hash计算,得到k个不同的hash值(注:写入和查询使用的k个hash函数是相同的),针对这k个值在byte数组中进行查看,看对应的bit位是否置1,如果有一处是0,那就证明该元素不包含在byte数组中,如果全部是1,那就证明该元素可能包含在byte数组中(此处注意,对于返回结果为True的情况是有误差的,误差和使用hash函数个数和byte数组长度和输入个数有关系,后面会详细讨论)

image

数学证明:

现有输入n个,将其通过Bloom Filter的方式插入到大小为m的byte数组中,插入中使用k个hash值进行置0。

在将n个值全部插入到byte数组中去以后,某一个特定的bit位还是为0的概率为:

$$(1-\frac{1}{m})^{kn}$$

反过来某一个bit位会被置1的概率是

$$1-(1-\frac{1}{m})^{kn}$$

如果一个本来不存在的元素k个hash函数对应的值都为1,那么误差变会产生,发生的概率为:

$$(1-(1-\frac{1 ...