python抓取百度百科点赞数等动态数据

利用selenium 模拟浏览器打开页面，加载后抓取数据

#!/usr/bin/env python

# coding=utf-8

import urllib2

import re

from bs4 import BeautifulSoup

from selenium import webdriver

import time 

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

class BaikeSpider():

    def __init__(self):

        self.queue = ["http://baike.baidu.com/view/8095.htm",

                      "http://baike.baidu.com/view/2227.htm"]

        self.base = "http://baike.baidu.com"

        self.crawled = set()

        self.crawled_word = set()

#        client = MongoClient("localhost",27017)

#        self.db = client["baike_db"]["html"]

    def crawl(self):

        browser = webdriver.Chrome()

        cnt = 0

        fw = open('./baike_keywords.txt','wb')

        while self.queue:

            url = self.queue.pop(0)

            if url in self.crawled :

                continue

            self.crawled.add(url)

            try:

                browser.get(url)

                res = {}

                links = BeautifulSoup(urllib2.urlopen(url).read(),'lxml').find_all("a")

                links = list(set(links))

                for link in links:

                    if 'href' not in dict(link.attrs) or re.search(u"javascript",link['href']) or len(link['href'])<8:

                        continue

                    url = link['href']

                    if re.search(u"baike\.baidu\.com/view/\d+|baike\.baidu\.com/subview/\d+/\d+.htm",url) and url not in self.crawled:

                        self.queue.append(url)

                    elif re.match(u"view/\d+",url):

                        url = self.base+ url

                        if url not in self.crawled:

                            self.queue.append(url)

                cnt += 1

                print cnt

                if cnt % 10 == 0:

                    print 'queue',len(self.queue)

                    fw.close()

                    fw = open('./baike_keywords.txt','a+')

                res['url'] = url

                res['title'] = browser.title.split(u"_")[0]

                if res['title'] in self.crawled_word:

                    print 'title',res['title'],'has crawled'

                    continue

                vote = browser.find_element_by_class_name("vote-count")

                view = browser.find_element_by_id("j-lemmaStatistics-pv")

                res['voted'] = vote.text

                res['viewed'] = view.text

                line = []

                line.append(res['title'])

                line.append(res['viewed'])

                line.append(res['voted'])

                line.append(res['url'])

                line = '\t'.join(line)

                fw.write(line+'\n')

                self.crawled_word.add(res["title"])

            except Exception,e:

                print e

                continue

if __name__=='__main__':

    test = BaikeSpider()

    test.crawl()

另外，使用chrome加载会比firefox快，且少报错，异常退出！

python抓取百度百科点赞数等动态数据的更多相关文章

Python抓取百度百科数据
前言本文整理自慕课网<Python开发简单爬虫>,将会记录爬取百度百科"python"词条相关页面的整个过程. 抓取策略确定目标:确定抓取哪个网站的哪些页面的哪部分 ...
python3 - 通过BeautifulSoup 4抓取百度百科人物相关链接
导入需要的模块需要安装BeautifulSoup from urllib.request import urlopen, HTTPError, URLError from bs4 import Be ...
Python爬虫之小试牛刀——使用Python抓取百度街景图像
之前用.Net做过一些自动化爬虫程序,听大牛们说使用python来写爬虫更便捷,按捺不住抽空试了一把,使用Python抓取百度街景影像. 这两天,武汉迎来了一个德国总理默克尔这位大人物,又刷了一把武汉 ...
Python——爬取百度百科关键词1000个相关网页
Python简单爬虫——爬取百度百科关键词1000个相关网页——标题和简介网站爬虫由浅入深:慢慢来分析: 链接的URL分析: 数据格式: 爬虫基本架构模型: 本爬虫架构: 源代码: # codin ...
python抓取360百科踩过的坑！
学习python一周,学着写了一个爬虫,用来抓取360百科的词条,在这个过程中.因为一个小小的修改,程序出现一些问题,又花了几天时间研究,问了各路高手,都没解决,终于还是自己攻克了,事实上就是对lis ...
爬虫实战(一) 用Python爬取百度百科
最近博主遇到这样一个需求:当用户输入一个词语时,返回这个词语的解释我的第一个想法是做一个数据库,把常用的词语和词语的解释放到数据库里面,当用户查询时直接读取数据库结果但是自己又没有心思做这样一个数 ...
使用python抓取百度搜索、百度新闻搜索的关键词个数
由于实验的要求,需要统计一系列的字符串通过百度搜索得到的关键词个数,于是使用python写了一个相关的脚本. 在写这个脚本的过程中遇到了很多的问题,下面会一一道来. ps:我并没有系统地学习过pyth ...
C#运用实例.读取csv里面的词条，对每一个词条抓取百度百科相关资料，然后存取到数据库
第一步:首先需要将csv先装换成datatable,这样我们就容易进行对datatable进行遍历: /// 将CSV文件的数据读取到DataTable中 /// CSV文件路径 /// 返回读取了C ...
Python抓取百度汉字笔画的gif
偶然发现百度汉语里面,有一笔一划的汉字顺序: 觉得这个动态的图片,等以后娃长大了,可以用这个教写字.然后就去找找常用汉字,现代汉语常用字表 .拿到这里面的汉字,做两个数组出来,一共是 ...

随机推荐

Docker Resources
Menu Main Resources Books Websites Documents Archives Community Blogs Personal Blogs Videos Related ...
PHP-通过strace定位故障原因
俗话说:不怕贼偷,就怕贼惦记着.在面对故障的时候,我也有类似的感觉:不怕出故障,就怕你不知道故障的原因,故障却隔三差五的找上门来. 十一长假还没结束,服务器却频现高负载,Nginx出现错误日志: co ...
php 错误 Strict Standards: PHP Strict Standards: Declaration of .... should be compatible with that of 解决办法
错误原因:这是由于 php 5.3版本后.要求继承类必须在父类之后定义.否则就会出现Strict Standards: PHP Strict Standards: Declaration of ... ...
010. 使用.net框架提供的属性
C#允许在类和类成员上声明特性(类), 可在运行时解释类和类的成员. 这个特性也称为属性, 使用Attribute.下面演示如何使用.net框架提供的属性. using System; using S ...
Jenkins入门系列之
Jenkins入门系列之——00答疑解惑 Jenkins进阶系列之——11修改Jenkins用户的密码 Jenkins进阶系列之——12详解Jenkins节点配置 Jenkins进阶系列之——13修改 ...
解决：mvn archetype:create Abstract class or interface 'org.apache.maven.artifact.repository.ArtifactRepository' cannot be instantiated
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-archetype-plugin:2 .3:create (default- ...
Keepalived 双机web服务宕机检测切换系统软件
简介 Keepalived的作用是检测web服务器的状态,如果有一台web服务器死机,或工作出现故障,Keepalived将检测到,并将有故障的web服务器从系统中剔除,当web服务器工作正常后Kee ...
String 深浅拷贝的测试---有待继续测试
public class TestString { void test1() { // TODO Auto-generated method stub String str = new String( ...
【shiro】org.apache.shiro.authc.IncorrectCredentialsException: Submitted credentials for token
org.apache.shiro.authc.IncorrectCredentialsException: Submitted credentials for token [org.apache.sh ...
Flex外包团队—开发工具：Flex4.6新特性介绍
在今年初,Adobe发布了其第一个支持移动应用程序开发的Flex SDK和Flash Builder版本.Flex 4.5引入了一组移动优化的组件和移动优化的应用程序框架,而Flash Builder ...

python抓取百度百科点赞数等动态数据

python抓取百度百科点赞数等动态数据的更多相关文章

随机推荐

热门专题