这些天应朋友的要求抓取某个论坛帖子的信息,网上搜索了一下开源的爬虫资料,看了许多对于开源爬虫的比较发现开源爬虫scrapy比较好用。但是以前一直用的java和php,对python不熟悉,于是花一天时间粗略了解了一遍python的基础知识。然后就开干了,没想到的配置一个运行环境就花了我一天时间。下面记录下安装和配置scrapy踩过的那些坑吧。

  运行环境:CentOS 6.0 虚拟机

  开始上来先得安装python运行环境。然而我运行了一下python命令,发现已经自带了,窃(大)喜(坑)。于是google搜索了一下安装步骤,pip install Scrapy直接安装,发现不对。少了pip,于是安装pip。再次pip install Scrapy,发现少了python-devel,于是这么来回折腾了一上午。后来下载了scrapy的源码安装,突然曝出一个需要python2.7版本,再通过python --version查看,一个2.6映入眼前;顿时千万个草泥马在心中奔腾。

  于是查看了官方文档(http://doc.scrapy.org/en/master/intro/install.html),果然是要python2.7。没办法,只能升级python的版本了。

1、升级python

  • 下载python2.7并安装
wget https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz
tar -zxvf Python-2.7..tgz
cd Python-2.7.
./configure
make all
make installmake clean
make distclean
  • 检查python版本
python --version

  发现还是2.6

  • 更改python命令指向
mv /usr/bin/python /usr/bin/python2..6_bak
ln -s /usr/local/bin/python2. /usr/bin/python
  • 再次检查版本
# python --version
Python 2.7.

  到这里,python算是升级完成了,继续安装scrapy。于是pip install scrapy,还是报错。

-bash: pip: command not found
  • 安装pip
wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py

  于是pip install scrapy,还是报错

Collecting Twisted>=10.0. (from scrapy)
Could not find a version that satisfies the requirement Twisted>=10.0. (from scrapy) (from versions: )
No matching distribution found for Twisted>=10.0. (from scrapy)

  少了Twisted,于是安装Twisted

2、安装Twisted

  • 下载Twisted(https://pypi.python.org/packages/source/T/Twisted/Twisted-15.2.1.tar.bz2#md5=4be066a899c714e18af1ecfcb01cfef7)
  • 安装
wget https://pypi.python.org/packages/source/T/Twisted/Twisted-15.2.1.tar.bz2
tar -xjvf Twisted-15.2.1.tar.bz2
cd Twisted-15.2.1
python setup.py install
  • 查看是否安装成功
python
Python 2.7. (default, Jun , ::)
[GCC 4.4. (Red Hat 4.4.-)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import twisted
>>>

  此时索命twisted已经安装成功。于是继续pip install scrapy,还是报错。

3、安装libxlst、libxml2和xslt-config

Collecting libxlst
Could not find a version that satisfies the requirement libxlst (from versions: )
No matching distribution found for libxlst
Collecting libxml2
Could not find a version that satisfies the requirement libxml2 (from versions: )
No matching distribution found for libxml2
wget http://xmlsoft.org/sources/libxslt-1.1.28.tar.gz
tar -zxvf libxslt-1.1.28.tar.gz
cd libxslt-1.1./
./configure
make
make install
wget ftp://xmlsoft.org/libxml2/libxml2-git-snapshot.tar.gz
tar -zxvf libxml2-git-snapshot.tar.gz
cd libxml2-2.9./
./configure
make
make install

  安装好以后继续pip install scrapy,幸运之星仍未降临

4、安装cryptography

Failed building wheel for cryptography

  下载cryptography(https://pypi.python.org/packages/source/c/cryptography/cryptography-0.4.tar.gz)

  安装

wget https://pypi.python.org/packages/source/c/cryptography/cryptography-0.4.tar.gz
tar -zxvf cryptography-0.4.tar.gz
cd cryptography-0.4
python setup.py build
python setup.py install

  发现安装的时候报错:

No package 'libffi' found

  于是下载libffi下载并安装

wget ftp://sourceware.org/pub/libffi/libffi-3.2.1.tar.gz
tar -zxvf libffi-3.2.1.tar.gz
cd libffi-3.2.1
./configure
make
make install

  安装后发现仍然报错

Package libffi was not found in the pkg-config search path.
Perhaps you should add the directory containing `libffi.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libffi' found

  于是设置:PKG_CONFIG_PATH

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH

  再次安装scrapy

pip install scrapy

  幸运女神都去哪儿了?  

ImportError: libffi.so.: cannot open shared object file: No such file or directory

  于是

whereis libffi
libffi: /usr/local/lib/libffi.a /usr/local/lib/libffi.la /usr/local/lib/libffi.so

  已经正常安装,网上搜索了一通,发现是LD_LIBRARY_PATH没设置,于是

export LD_LIBRARY_PATH=/usr/local/lib

  于是继续安装cryptography-0.4

python setup.py build
python setup.py install

  此时正确安装,没有报错信息了。

  5、继续安装scrapy

pip install scrapy

  看着提示信息:

Building wheels for collected packages: cryptography
Running setup.py bdist_wheel for cryptography

  在这里停了好久,在想幸运女神是不是到了。等了一会

Requirement already satisfied (use --upgrade to upgrade): zope.interface>=3.6. in /usr/local/lib/python2./site-packages/zope.interface-4.1.-py2.-linux-i686.egg (from Twisted>=10.0.->scrapy)
Collecting cryptography>=0.7 (from pyOpenSSL->scrapy)
Using cached cryptography-0.9.tar.gz
Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/local/lib/python2./site-packages (from zope.interface>=3.6.->Twisted>=10.0.->scrapy)
Requirement already satisfied (use --upgrade to upgrade): idna in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): pyasn1 in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): enum34 in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): ipaddress in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): cffi>=0.8 in /usr/local/lib/python2./site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): ordereddict in /usr/local/lib/python2./site-packages (from enum34->cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): pycparser in /usr/local/lib/python2./site-packages (from cffi>=0.8->cryptography>=0.7->pyOpenSSL->scrapy)
Building wheels for collected packages: cryptography
Running setup.py bdist_wheel for cryptography
Stored in directory: /root/.cache/pip/wheels/d7///7258f08eae0b9c930c04209959c9a0794b9729c2b64258117e
Successfully built cryptography
Installing collected packages: cryptography
Found existing installation: cryptography 0.4
Uninstalling cryptography-0.4:
Successfully uninstalled cryptography-0.4
Successfully installed cryptography-0.9

  显示如此的信息。看到此刻,内流马面。谢谢CCAV,感谢MTV,钓鱼岛是中国的。终于安装成功了。

6、测试scrapy

创建测试脚本

cat > myspider.py <<EOF

from scrapy import Spider, Item, Field

class Post(Item):
title = Field() class BlogSpider(Spider):
name, start_urls = 'blogspider', ['http://www.cnblogs.com/rwxwsblog/'] def parse(self, response):
return [Post(title=e.extract()) for e in response.css("h2 a::text")] EOF

  测试脚本能否正常运行

scrapy runspider myspider.py
-- :: [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot)
-- :: [scrapy] INFO: Optional features available: ssl, http11
-- :: [scrapy] INFO: Overridden settings: {}
-- :: [py.warnings] WARNING: :: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected. -- :: [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
-- :: [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
-- :: [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
-- :: [scrapy] INFO: Enabled item pipelines:
-- :: [scrapy] INFO: Spider opened
-- :: [scrapy] INFO: Crawled pages (at pages/min), scraped items (at items/min)
-- :: [scrapy] DEBUG: Telnet console listening on 127.0.0.1:
-- :: [scrapy] DEBUG: Crawled () <GET http://www.cnblogs.com/rwxwsblog/> (referer: None)
-- :: [scrapy] INFO: Closing spider (finished)
-- :: [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': ,
'downloader/request_count': ,
'downloader/request_method_count/GET': ,
'downloader/response_bytes': ,
'downloader/response_count': ,
'downloader/response_status_count/200': ,
'finish_reason': 'finished',
'finish_time': datetime.datetime(, , , , , , ),
'log_count/DEBUG': ,
'log_count/INFO': ,
'log_count/WARNING': ,
'response_received_count': ,
'scheduler/dequeued': ,
'scheduler/dequeued/memory': ,
'scheduler/enqueued': ,
'scheduler/enqueued/memory': ,
'start_time': datetime.datetime(, , , , , , )}
-- :: [scrapy] INFO: Spider closed (finished)

  运行正常(此时心中窃喜,^_^....)。

  7、创建自己的scrapy项目(此时换了一个会话)

scrapy startproject tutorial

  输出以下信息

Traceback (most recent call last):
File "/usr/local/bin/scrapy", line , in <module>
load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')()
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line , in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line , in load_entry_point
return ep.load()
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line , in load
return self.resolve()
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line , in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=)
File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/__init__.py", line , in <module>
from scrapy.spiders import Spider
File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/spiders/__init__.py", line , in <module>
from scrapy.http import Request
File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/http/__init__.py", line , in <module>
from scrapy.http.request.form import FormRequest
File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/http/request/form.py", line , in <module>
import lxml.html
File "/usr/local/lib/python2.7/site-packages/lxml/html/__init__.py", line , in <module>
from lxml import etree
ImportError: /usr/lib/libxml2.so.: version `LIBXML2_2.9.0' not found (required by /usr/local/lib/python2.7/site-packages/lxml/etree.so)

  心中无数个草泥马再次狂奔。怎么又不行了?难道会变戏法?定定神看了下:ImportError: /usr/lib/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by /usr/local/lib/python2.7/site-packages/lxml/etree.so)。这是那样的熟悉呀!想了想,这怎么和前面的ImportError: libffi.so.6: cannot open shared object file: No such file or directory那么类似呢?于是

  8、添加环境变量

export LD_LIBRARY_PATH=/usr/local/lib

  再次运行:

scrapy startproject tutorial

  输出以下信息:

[root@bogon scrapy]# scrapy startproject tutorial
-- :: [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot)
-- :: [scrapy] INFO: Optional features available: ssl, http11
-- :: [scrapy] INFO: Overridden settings: {}
New Scrapy project 'tutorial' created in:
/root/scrapy/tutorial You can start your first spider with:
cd tutorial
scrapy genspider example example.com

  尼玛的终于成功了。由此可见,scrapy运行的时候需要LD_LIBRARY_PATH环境变量的支持。可以考虑将其加入环境变量中

vi /etc/profile

  添加:export LD_LIBRARY_PATH=/usr/local/lib 这行(前面的PKG_CONFIG_PATH也可以考虑添加进来,export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH)

  注:安装的时候可以留意Libraries安装的路径,以libffi为例:

libtool: install: /usr/bin/install -c .libs/libffi.so.6.0. /usr/local/lib/../lib64/libffi.so.6.0.
libtool: install: (cd /usr/local/lib/../lib64 && { ln -s -f libffi.so.6.0. libffi.so. || { rm -f libffi.so. && ln -s libffi.so.6.0. libffi.so.; }; })
libtool: install: (cd /usr/local/lib/../lib64 && { ln -s -f libffi.so.6.0. libffi.so || { rm -f libffi.so && ln -s libffi.so.6.0. libffi.so; }; })
libtool: install: /usr/bin/install -c .libs/libffi.lai /usr/local/lib/../lib64/libffi.la
libtool: install: /usr/bin/install -c .libs/libffi.a /usr/local/lib/../lib64/libffi.a
libtool: install: chmod /usr/local/lib/../lib64/libffi.a
libtool: install: ranlib /usr/local/lib/../lib64/libffi.a
libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/www/wdlinux/mysql/bin:/root/bin:/sbin" ldconfig -n /usr/local/lib/../lib64
----------------------------------------------------------------------
Libraries have been installed in:
/usr/local/lib/../lib64 If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,-rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf' See any operating system documentation about shared libraries for
more information, such as the ld() and ld.so() manual pages.
----------------------------------------------------------------------
/bin/mkdir -p '/usr/local/share/info'
/usr/bin/install -c -m ../doc/libffi.info '/usr/local/share/info'
install-info --info-dir='/usr/local/share/info' '/usr/local/share/info/libffi.info'
/bin/mkdir -p '/usr/local/lib/pkgconfig'
/usr/bin/install -c -m libffi.pc '/usr/local/lib/pkgconfig'
make[]: Leaving directory `/root/python/libffi-3.2./x86_64-unknown-linux-gnu'
make[]: Leaving directory `/root/python/libffi-3.2./x86_64-unknown-linux-gnu'
make[]: Leaving directory `/root/python/libffi-3.2./x86_64-unknown-linux-gnu'

  这里可以知道libffi安装的路径为/usr/local/lib/../lib64,因此在引入LD_LIBRARY_PATH时应该为:export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib64:$LD_LIBRARY_PATH,此处需要特别留意。

  保存后检查是否存在异常:

source /etc/profile

  开一个新的会话运行

scrapy runspider myspider.py

  发现正常运行,可见LD_LIBRARY_PATH是生效的。至此scrapy就算正式安装成功了。

  查看scrapy版本:运行scrapy version,看了下scrapy的版本为“Scrapy 1.0.0rc2”

9、编程外的思考(感谢阅读到此的你,我自己都有点晕了。)

    • 有没有更好的安装方式呢?我的这种安装方式是否有问题?有的话请告诉我。(很多依赖包我采用pip和easy_install都无法安装,感觉是pip配置文件配置源的问题)
    • 一定要看官方的文档,Google和百度出来的结果往往是碎片化的,不全面。这样可以少走很多弯路,减少不必要的工作量。
    • 遇到的问题要先思考,想想是什么问题再Google和百度。
    • 解决问题要形成文档,方便自己也方便别人。

  10、参考文档

    http://scrapy.org/

    http://doc.scrapy.org/en/master/

    http://blog.csdn.net/slvher/article/details/42346887

    http://blog.csdn.net/niying/article/details/27103081

    http://www.cnblogs.com/xiaoruoen/archive/2013/02/27/2933854.html

安装python爬虫scrapy踩过的那些坑和编程外的思考的更多相关文章

  1. Linux 安装python爬虫框架 scrapy

    Linux 安装python爬虫框架 scrapy http://scrapy.org/ Scrapy是python最好用的一个爬虫框架.要求: python2.7.x. 1. Ubuntu14.04 ...

  2. python爬虫Scrapy(一)-我爬了boss数据

    一.概述 学习python有一段时间了,最近了解了下Python的入门爬虫框架Scrapy,参考了文章Python爬虫框架Scrapy入门.本篇文章属于初学经验记录,比较简单,适合刚学习爬虫的小伙伴. ...

  3. python爬虫scrapy框架——人工识别登录知乎倒立文字验证码和数字英文验证码(2)

    操作环境:python3 在上一文中python爬虫scrapy框架--人工识别知乎登录知乎倒立文字验证码和数字英文验证码(1)我们已经介绍了用Requests库来登录知乎,本文如果看不懂可以先看之前 ...

  4. python爬虫scrapy项目详解(关注、持续更新)

    python爬虫scrapy项目(一) 爬取目标:腾讯招聘网站(起始url:https://hr.tencent.com/position.php?keywords=&tid=0&st ...

  5. 06 windows安装Python+Pycharm+Scrapy环境

    windows安装Python+Pycharm+Scrapy环境 使用微信扫码关注微信公众号,并回复:"Python工具包",免费获取下载链接! 一.卸载python环境 卸载以下 ...

  6. [Python爬虫] scrapy爬虫系列 <一>.安装及入门介绍

    前面介绍了很多Selenium基于自动测试的Python爬虫程序,主要利用它的xpath语句,通过分析网页DOM树结构进行爬取内容,同时可以结合Phantomjs模拟浏览器进行鼠标或键盘操作.但是,更 ...

  7. 安装 python 爬虫框架 Scrapy

    官方安装说明文档:https://doc.scrapy.org/en/latest/intro/install.html#installing-scrapy 一.scrapy 需要以下依赖 二.一般来 ...

  8. Python爬虫Scrapy框架入门(0)

    想学习爬虫,又想了解python语言,有个python高手推荐我看看scrapy. scrapy是一个python爬虫框架,据说很灵活,网上介绍该框架的信息很多,此处不再赘述.专心记录我自己遇到的问题 ...

  9. python爬虫---->scrapy的使用(一)

    这里我们介绍一下python的分布式爬虫框架scrapy的安装以及使用.平庸这东西犹如白衬衣上的污痕,一旦染上便永远洗不掉,无可挽回. scrapy的安装使用 我的电脑环境是win10,64位的.py ...

随机推荐

  1. python基础随笔

    一: 作用域 对于变量的作用域,只要内存中存在,该变量就可以使用. 二:三元运算 name = 值1 if 条件 else 值2 如果条件为真:result = 值1 如果条件为假:result = ...

  2. android studio使用说明

    一.学习的基本配置文档,搞好各种参数的基本配置,熟练使用. C:\Program Files\Java\jdk1.7.0_09\bin   二.problems meet in weather and ...

  3. Linux 网络编程三(socket代码详解)

    //网络编程客户端 #include <stdio.h> #include <stdlib.h> #include <string.h> #include < ...

  4. Git初级使用教程(转)

    http://www.cnblogs.com/xiaogangqq123/archive/2012/03/19/2405805.html 什么是 Git? Git 是一款免费的.开源的.分布式的版本控 ...

  5. java 顺序 读写 Properties 配置文件

    java 顺序 读写 Properties 配置文件 支持中文 不乱码 java 顺序 读写 Properties 配置文件 ,java默认提供的Properties API 继承hashmap ,不 ...

  6. idea 重写toString()模板,转成json格式

    idea toString()模板,将对象toString()为json格式. 1. 2.点击新增 public java.lang.String toString() { final java.la ...

  7. Object-Oriented CSS

    1.指导思想: http://oocss.org/ 2.reset.css http://meyerweb.com/eric/tools/css/reset/ 3.normalize.css http ...

  8. 『设计』Laura.Compute 设计思路

    前言: 前一篇文章 <『开源』也顺手写一个 科学计算器:重磅开源> ,继 Laura.Compute 算法开源之后,有 博客园 园友 希望公开一下 Laura.Compute算法 的 设计 ...

  9. Dandelion - Distributed Computing on GPU Clusters

    linq on GPUs 非常期待中 看起来很cool,期望早点面世

  10. Boostrap(2)

    网页布局 1.网格布局 网格布局就是把网页分为许多小格子,看起来像table,然后在每个小格子中放我们的内容.当然,我们也可以指定一片区域使用网格系统.网格布局主要是应用在移动设备上的,使用方法如下: ...