安装python爬虫scrapy踩过的那些坑和编程外的思考

　　这些天应朋友的要求抓取某个论坛帖子的信息，网上搜索了一下开源的爬虫资料，看了许多对于开源爬虫的比较发现开源爬虫scrapy比较好用。但是以前一直用的java和php，对python不熟悉，于是花一天时间粗略了解了一遍python的基础知识。然后就开干了，没想到的配置一个运行环境就花了我一天时间。下面记录下安装和配置scrapy踩过的那些坑吧。

　　运行环境：CentOS 6.0 虚拟机

　　开始上来先得安装python运行环境。然而我运行了一下python命令，发现已经自带了，窃（大）喜（坑）。于是google搜索了一下安装步骤，pip install Scrapy直接安装，发现不对。少了pip，于是安装pip。再次pip install Scrapy，发现少了python-devel，于是这么来回折腾了一上午。后来下载了scrapy的源码安装，突然曝出一个需要python2.7版本，再通过python --version查看，一个2.6映入眼前；顿时千万个草泥马在心中奔腾。

　　于是查看了官方文档（http://doc.scrapy.org/en/master/intro/install.html），果然是要python2.7。没办法，只能升级python的版本了。

1、升级python

下载python2.7并安装

wget https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz

tar -zxvf Python-2.7..tgz

cd Python-2.7.

./configure

make all

make installmake clean

make distclean

检查python版本

python --version

　　发现还是2.6

更改python命令指向

mv /usr/bin/python /usr/bin/python2..6_bak

ln -s /usr/local/bin/python2. /usr/bin/python

再次检查版本

# python --version

Python 2.7.

　　到这里，python算是升级完成了，继续安装scrapy。于是pip install scrapy，还是报错。

-bash: pip: command not found

安装pip

wget https://bootstrap.pypa.io/get-pip.py

python get-pip.py

　　于是pip install scrapy，还是报错

Collecting Twisted>=10.0. (from scrapy)

  Could not find a version that satisfies the requirement Twisted>=10.0. (from scrapy) (from versions: )

No matching distribution found for Twisted>=10.0. (from scrapy)

　　少了Twisted，于是安装Twisted

2、安装Twisted

下载Twisted（https://pypi.python.org/packages/source/T/Twisted/Twisted-15.2.1.tar.bz2#md5=4be066a899c714e18af1ecfcb01cfef7）
安装

wget https://pypi.python.org/packages/source/T/Twisted/Twisted-15.2.1.tar.bz2
tar -xjvf Twisted-15.2.1.tar.bz2
cd Twisted-15.2.1
python setup.py install

查看是否安装成功

python

Python 2.7. (default, Jun   , ::)

[GCC 4.4.  (Red Hat 4.4.-)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> import twisted

>>>

　　此时索命twisted已经安装成功。于是继续pip install scrapy，还是报错。

3、安装libxlst、libxml2和xslt-config

Collecting libxlst

  Could not find a version that satisfies the requirement libxlst (from versions: )

No matching distribution found for libxlst

Collecting libxml2

  Could not find a version that satisfies the requirement libxml2 (from versions: )

No matching distribution found for libxml2

wget http://xmlsoft.org/sources/libxslt-1.1.28.tar.gz
tar -zxvf libxslt-1.1.28.tar.gz

cd libxslt-1.1./

./configure

make

make install

wget ftp://xmlsoft.org/libxml2/libxml2-git-snapshot.tar.gz
tar -zxvf libxml2-git-snapshot.tar.gz

cd libxml2-2.9./

./configure

make

make install

　　安装好以后继续pip install scrapy，幸运之星仍未降临

4、安装cryptography

Failed building wheel for cryptography

　　下载cryptography（https://pypi.python.org/packages/source/c/cryptography/cryptography-0.4.tar.gz）

　　安装

wget https://pypi.python.org/packages/source/c/cryptography/cryptography-0.4.tar.gz
tar -zxvf cryptography-0.4.tar.gz
cd cryptography-0.4
python setup.py build
python setup.py install

　　发现安装的时候报错：

No package 'libffi' found

　　于是下载libffi下载并安装

wget ftp://sourceware.org/pub/libffi/libffi-3.2.1.tar.gz

tar -zxvf libffi-3.2.1.tar.gz
cd libffi-3.2.1
./configure

make

make install

　　安装后发现仍然报错

Package libffi was not found in the pkg-config search path.

    Perhaps you should add the directory containing `libffi.pc'

    to the PKG_CONFIG_PATH environment variable

    No package 'libffi' found

　　于是设置：PKG_CONFIG_PATH

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH

　　再次安装scrapy

pip install scrapy

　　幸运女神都去哪儿了？　　

ImportError: libffi.so.: cannot open shared object file: No such file or directory

　　于是

whereis libffi
libffi: /usr/local/lib/libffi.a /usr/local/lib/libffi.la /usr/local/lib/libffi.so

　　已经正常安装，网上搜索了一通，发现是LD_LIBRARY_PATH没设置，于是

export LD_LIBRARY_PATH=/usr/local/lib

　　于是继续安装cryptography-0.4

python setup.py build

python setup.py install

　　此时正确安装，没有报错信息了。

　　5、继续安装scrapy

pip install scrapy

　　看着提示信息：

Building wheels for collected packages: cryptography

  Running setup.py bdist_wheel for cryptography

　　在这里停了好久，在想幸运女神是不是到了。等了一会

Requirement already satisfied (use --upgrade to upgrade): zope.interface>=3.6. in /usr/local/lib/python2./site-packages/zope.interface-4.1.-py2.-linux-i686.egg (from Twisted>=10.0.->scrapy)

Collecting cryptography>=0.7 (from pyOpenSSL->scrapy)

  Using cached cryptography-0.9.tar.gz

Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/local/lib/python2./site-packages (from zope.interface>=3.6.->Twisted>=10.0.->scrapy)

Requirement already satisfied (use --upgrade to upgrade): idna in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)

Requirement already satisfied (use --upgrade to upgrade): pyasn1 in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)

Requirement already satisfied (use --upgrade to upgrade): enum34 in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)

Requirement already satisfied (use --upgrade to upgrade): ipaddress in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)

Requirement already satisfied (use --upgrade to upgrade): cffi>=0.8 in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)

Requirement already satisfied (use --upgrade to upgrade): ordereddict in /usr/local/lib/python2./site-packages (from enum34->cryptography>=0.7->pyOpenSSL->scrapy)

Requirement already satisfied (use --upgrade to upgrade): pycparser in /usr/local/lib/python2./site-packages (from cffi>=0.8->cryptography>=0.7->pyOpenSSL->scrapy)

Building wheels for collected packages: cryptography

  Running setup.py bdist_wheel for cryptography

  Stored in directory: /root/.cache/pip/wheels/d7///7258f08eae0b9c930c04209959c9a0794b9729c2b64258117e

Successfully built cryptography

Installing collected packages: cryptography

  Found existing installation: cryptography 0.4

    Uninstalling cryptography-0.4:

      Successfully uninstalled cryptography-0.4

Successfully installed cryptography-0.9

　　显示如此的信息。看到此刻，内流马面。谢谢CCAV，感谢MTV，钓鱼岛是中国的。终于安装成功了。

6、测试scrapy

创建测试脚本

cat > myspider.py <<EOF

from scrapy import Spider, Item, Field

class Post(Item):

    title = Field()

class BlogSpider(Spider):

    name, start_urls = 'blogspider', ['http://www.cnblogs.com/rwxwsblog/']

    def parse(self, response):

        return [Post(title=e.extract()) for e in response.css("h2 a::text")]

EOF

　　测试脚本能否正常运行

scrapy runspider myspider.py

-- :: [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot)

-- :: [scrapy] INFO: Optional features available: ssl, http11

-- :: [scrapy] INFO: Overridden settings: {}

-- :: [py.warnings] WARNING: :: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.

-- :: [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState

-- :: [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats

-- :: [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

-- :: [scrapy] INFO: Enabled item pipelines:

-- :: [scrapy] INFO: Spider opened

-- :: [scrapy] INFO: Crawled  pages (at  pages/min), scraped  items (at  items/min)

-- :: [scrapy] DEBUG: Telnet console listening on 127.0.0.1:

-- :: [scrapy] DEBUG: Crawled () <GET http://www.cnblogs.com/rwxwsblog/> (referer: None)

-- :: [scrapy] INFO: Closing spider (finished)

-- :: [scrapy] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': ,

 'downloader/request_count': ,

 'downloader/request_method_count/GET': ,

 'downloader/response_bytes': ,

 'downloader/response_count': ,

 'downloader/response_status_count/200': ,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(, , , , , , ),

 'log_count/DEBUG': ,

 'log_count/INFO': ,

 'log_count/WARNING': ,

 'response_received_count': ,

 'scheduler/dequeued': ,

 'scheduler/dequeued/memory': ,

 'scheduler/enqueued': ,

 'scheduler/enqueued/memory': ,

 'start_time': datetime.datetime(, , , , , , )}

-- :: [scrapy] INFO: Spider closed (finished)

　　运行正常（此时心中窃喜，^_^....）。

　　7、创建自己的scrapy项目（此时换了一个会话）

scrapy startproject tutorial

　　输出以下信息

Traceback (most recent call last):

  File "/usr/local/bin/scrapy", line , in <module>

    load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')()

  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line , in load_entry_point

    return get_distribution(dist).load_entry_point(group, name)

  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line , in load_entry_point

    return ep.load()

  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line , in load

    return self.resolve()

  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line , in resolve

    module = __import__(self.module_name, fromlist=['__name__'], level=)

  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/__init__.py", line , in <module>

    from scrapy.spiders import Spider

  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/spiders/__init__.py", line , in <module>

    from scrapy.http import Request

  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/http/__init__.py", line , in <module>

    from scrapy.http.request.form import FormRequest

  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/http/request/form.py", line , in <module>

    import lxml.html

  File "/usr/local/lib/python2.7/site-packages/lxml/html/__init__.py", line , in <module>

    from lxml import etree

ImportError: /usr/lib/libxml2.so.: version `LIBXML2_2.9.0' not found (required by /usr/local/lib/python2.7/site-packages/lxml/etree.so)

　　心中无数个草泥马再次狂奔。怎么又不行了？难道会变戏法？定定神看了下：ImportError: /usr/lib/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by /usr/local/lib/python2.7/site-packages/lxml/etree.so)。这是那样的熟悉呀！想了想，这怎么和前面的ImportError: libffi.so.6: cannot open shared object file: No such file or directory那么类似呢？于是

　　8、添加环境变量

export LD_LIBRARY_PATH=/usr/local/lib

　　再次运行：

scrapy startproject tutorial

　　输出以下信息：

[root@bogon scrapy]# scrapy startproject tutorial

-- :: [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot)

-- :: [scrapy] INFO: Optional features available: ssl, http11

-- :: [scrapy] INFO: Overridden settings: {}

New Scrapy project 'tutorial' created in:

    /root/scrapy/tutorial

You can start your first spider with:

    cd tutorial

    scrapy genspider example example.com

　　尼玛的终于成功了。由此可见，scrapy运行的时候需要LD_LIBRARY_PATH环境变量的支持。可以考虑将其加入环境变量中

vi /etc/profile

　　添加：export LD_LIBRARY_PATH=/usr/local/lib 这行（前面的PKG_CONFIG_PATH也可以考虑添加进来，export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH）

　　注：安装的时候可以留意Libraries安装的路径，以libffi为例：

libtool: install: /usr/bin/install -c .libs/libffi.so.6.0. /usr/local/lib/../lib64/libffi.so.6.0.

libtool: install: (cd /usr/local/lib/../lib64 && { ln -s -f libffi.so.6.0. libffi.so. || { rm -f libffi.so. && ln -s libffi.so.6.0. libffi.so.; }; })

libtool: install: (cd /usr/local/lib/../lib64 && { ln -s -f libffi.so.6.0. libffi.so || { rm -f libffi.so && ln -s libffi.so.6.0. libffi.so; }; })

libtool: install: /usr/bin/install -c .libs/libffi.lai /usr/local/lib/../lib64/libffi.la

libtool: install: /usr/bin/install -c .libs/libffi.a /usr/local/lib/../lib64/libffi.a

libtool: install: chmod  /usr/local/lib/../lib64/libffi.a

libtool: install: ranlib /usr/local/lib/../lib64/libffi.a

libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/www/wdlinux/mysql/bin:/root/bin:/sbin" ldconfig -n /usr/local/lib/../lib64

----------------------------------------------------------------------

Libraries have been installed in:

   /usr/local/lib/../lib64

If you ever happen to want to link against installed libraries

in a given directory, LIBDIR, you must either use libtool, and

specify the full pathname of the library, or use the `-LLIBDIR'

flag during linking and do at least one of the following:

   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable

     during execution

   - add LIBDIR to the `LD_RUN_PATH' environment variable

     during linking

   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag

   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for

more information, such as the ld() and ld.so() manual pages.

----------------------------------------------------------------------

 /bin/mkdir -p '/usr/local/share/info'

 /usr/bin/install -c -m  ../doc/libffi.info '/usr/local/share/info'

 install-info --info-dir='/usr/local/share/info' '/usr/local/share/info/libffi.info'

 /bin/mkdir -p '/usr/local/lib/pkgconfig'

 /usr/bin/install -c -m  libffi.pc '/usr/local/lib/pkgconfig'

make[]: Leaving directory `/root/python/libffi-3.2./x86_64-unknown-linux-gnu'

make[]: Leaving directory `/root/python/libffi-3.2./x86_64-unknown-linux-gnu'

make[]: Leaving directory `/root/python/libffi-3.2./x86_64-unknown-linux-gnu'

　　这里可以知道libffi安装的路径为/usr/local/lib/../lib64，因此在引入LD_LIBRARY_PATH时应该为：export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib64:$LD_LIBRARY_PATH，此处需要特别留意。

　　保存后检查是否存在异常：

source /etc/profile

　　开一个新的会话运行

scrapy runspider myspider.py

　　发现正常运行，可见LD_LIBRARY_PATH是生效的。至此scrapy就算正式安装成功了。

　　查看scrapy版本：运行scrapy version，看了下scrapy的版本为“Scrapy 1.0.0rc2”

9、编程外的思考(感谢阅读到此的你，我自己都有点晕了。)

- 有没有更好的安装方式呢？我的这种安装方式是否有问题？有的话请告诉我。（很多依赖包我采用pip和easy_install都无法安装，感觉是pip配置文件配置源的问题）
- 一定要看官方的文档，Google和百度出来的结果往往是碎片化的，不全面。这样可以少走很多弯路，减少不必要的工作量。
- 遇到的问题要先思考，想想是什么问题再Google和百度。
- 解决问题要形成文档，方便自己也方便别人。

　　10、参考文档

　　　　http://scrapy.org/

　　　　http://doc.scrapy.org/en/master/

　　　　http://blog.csdn.net/slvher/article/details/42346887

　　　　http://blog.csdn.net/niying/article/details/27103081

　　　　http://www.cnblogs.com/xiaoruoen/archive/2013/02/27/2933854.html

安装python爬虫scrapy踩过的那些坑和编程外的思考的更多相关文章

Linux 安装python爬虫框架 scrapy
Linux 安装python爬虫框架 scrapy http://scrapy.org/ Scrapy是python最好用的一个爬虫框架.要求: python2.7.x. 1. Ubuntu14.04 ...
python爬虫Scrapy(一)-我爬了boss数据
一.概述学习python有一段时间了,最近了解了下Python的入门爬虫框架Scrapy,参考了文章Python爬虫框架Scrapy入门.本篇文章属于初学经验记录,比较简单,适合刚学习爬虫的小伙伴. ...
python爬虫scrapy框架——人工识别登录知乎倒立文字验证码和数字英文验证码(2)
操作环境:python3 在上一文中python爬虫scrapy框架--人工识别知乎登录知乎倒立文字验证码和数字英文验证码(1)我们已经介绍了用Requests库来登录知乎,本文如果看不懂可以先看之前 ...
python爬虫scrapy项目详解（关注、持续更新）
python爬虫scrapy项目(一) 爬取目标:腾讯招聘网站(起始url:https://hr.tencent.com/position.php?keywords=&tid=0&st ...
06 windows安装Python+Pycharm+Scrapy环境
windows安装Python+Pycharm+Scrapy环境使用微信扫码关注微信公众号,并回复:"Python工具包",免费获取下载链接! 一.卸载python环境卸载以下 ...
[Python爬虫] scrapy爬虫系列 <一>.安装及入门介绍
前面介绍了很多Selenium基于自动测试的Python爬虫程序,主要利用它的xpath语句,通过分析网页DOM树结构进行爬取内容,同时可以结合Phantomjs模拟浏览器进行鼠标或键盘操作.但是,更 ...
安装 python 爬虫框架 Scrapy
官方安装说明文档:https://doc.scrapy.org/en/latest/intro/install.html#installing-scrapy 一.scrapy 需要以下依赖二.一般来 ...
Python爬虫Scrapy框架入门（0）
想学习爬虫,又想了解python语言,有个python高手推荐我看看scrapy. scrapy是一个python爬虫框架,据说很灵活,网上介绍该框架的信息很多,此处不再赘述.专心记录我自己遇到的问题 ...
python爬虫---->scrapy的使用(一)
这里我们介绍一下python的分布式爬虫框架scrapy的安装以及使用.平庸这东西犹如白衬衣上的污痕,一旦染上便永远洗不掉,无可挽回. scrapy的安装使用我的电脑环境是win10,64位的.py ...

随机推荐

Eclipse发布地址不同引发的问题
eclipse发布到workspace metadata时,进不去http://localhost:8888页面.但是,它可以启动JAZZ和“公司的一个工程”. 而 eclipse发布到tomcat ...
各种各样的hack。
http://itakeo.com/blog/2015/11/16/allhack/?none=123 Android Selector Hacks .selector:not(*:root) {} ...
关于第一个Java应用
一.创建Java源文件 Java应用由一个或多个扩展名为".java"的文件构成,这些文件被称为Java源文件,从编译的角度,则被称为编译单元(Compilation Unit). ...
StartUML 破解
各平台版本均适用,本文更改的为Mac版本. 1,打开对应 mac版本的安装包位置,在对应目录/Applications/StarUML.app/Contents/www/license/node/L ...
你会在C#的类库中添加web service引用吗?
本文并不是什么高深的文章,只是VS2008应用中的一小部分,但小部分你不一定会,要不你试试: 本人对于分布式开发应用的并不多,这次正好有一个项目要应用web service,我的开发环境是vs2008 ...
KindEditor
1.官网 www.kindsoft.net 2.MVC下空置处理例: 页面使用 @model XXModel....@Html.EditorFor(model => model.Content ...
解析 HTTP(HttpURLConnection getResponseCode)
HTTP 请求客户端通过发送 HTTP 请求向服务器请求对资源的访问.HTTP 请求由三部分组成,分别是:请求行.消息报头和请求征文. 3.1.请求行请求行以一个方法符号开头,后面跟着请求 URI ...
CardboardCamera Prefab 中文笔记
在Cardboard的预制体(Prefab)中, CardboardCamera是最简单的一个,仅有两个子物体,一个PostRender, 一个PreRender,以及分别带的Camera组件. Ca ...
使用log4j将日志写入数据库并发送邮件
参考: 快速了解Log4J 1.log4j的初始配置参考该问的配置即可完整的实现写入数据库及发送邮件的功能 a.写入数据库需要配置相应的jar包,数据库类型不同,请使用指定的数据库配置,该文仅限于o ...
在matlab和opencv中分别实现稀疏表示
在本文中,稀疏表示的原理不再具体讲解,有需要的同学请自行百度. 本文采用OMP算法来求解稀疏系数.首先随机生成字典数据和待测试数据字典数据: dic =[ 6, 7, 9, 9, 7, 0, 6, ...

安装python爬虫scrapy踩过的那些坑和编程外的思考

安装python爬虫scrapy踩过的那些坑和编程外的思考的更多相关文章

随机推荐

热门专题