python 爬虫初试

python3.5 抓网易新闻的排行榜上的新闻，主要用自带的request模块和lxml

import re

from urllib import request

from lxml import etree
import threadpool
import threading

htmlcode='gbk'
threadlock=threading.Lock()

testurl="http://news.163.com/rank/"

with request.urlopen(testurl) as f:

    print('Status:', f.status, f.reason)

    #网页的编码格式只取一次，默认所有的编码方式都是这个

    decode=(f.headers['Content-Type'].split(';')[1]).split('=')[1]

    data = f.read().decode(decode.lower())

    infos = re.findall(r'<div class="titleBar" id=".*?"><h2>(.*?)</h2><div class="more"><a href="(.*?)">.*?</a></div></div>', data, re.S)

    for i in range(len(infos)):

        print('%s-%s'%(i,infos[i][0]))

    print('选择新闻类型')

    k=input()

    if k.isdigit()and int(k)<len(infos):

        newpage=(request.urlopen(infos[int(k)][1]).read()).decode(decode.lower())

        dom=etree.HTML(newpage)

        items=dom.xpath('//tr/td/a/text()')

        urls=dom.xpath('//tr/td/a/@href')

        assert (len(items)==len(urls))

        print(len(items))

        for i in range(len(urls)):

            print(items[i])

            new=(request.urlopen(urls[i]).read()).decode(decode.lower())

            ncs=re.findall(r'<div id="endText" class="end-text">.*?</div>',data,re.S)

            newdom=etree.HTML(new)

            newitems=newdom.xpath("//div[@id='endText'and @class='post_text']/p/text()")

            for n in newitems:

                print(n)

            print('=======================输入y继续')

            if 'y'==input():continue

            else:break;

多线程版本，用threadpool 直接pip安装的，实测读取同样的50个页面多线程要比上边的要快点，不过这个跟实时网速有关系，不太好具体测试时间

def test2():

    with request.urlopen(testurl) as f:

        htmlcode=(f.headers['Content-Type'].split(';')[1]).split('=')[1]

        data = f.read().decode(htmlcode.lower())

        infos = re.findall(r'<div class="titleBar" id=".*?"><h2>(.*?)</h2><div class="more"><a href="(.*?)">.*?</a></div></div>', data, re.S)

        newpage=(request.urlopen(infos[0][1]).read()).decode(htmlcode.lower())

        dom=etree.HTML(newpage)

        items=dom.xpath('//tr/td/a/text()')

        urls=dom.xpath('//tr/td/a/@href')

        assert (len(items)==len(urls))

        urlss=urls[:50]

        print(len(items))

        news=[]

        args=[]

        [args.append(([i,news],None)) for i in urlss]

        pool=threadpool.ThreadPool(8)

        ress=threadpool.makeRequests(GetNewpage,args)

        [pool.putRequest(req) for req in ress]

        print("start=====%s"%len(urlss))

        pool.wait()

        print("end==========")

        print(len(news))

        print(news[0])

        while(True):

            k=input()

            if not k.isdigit()or int(k)>=len(news):break

            print(news[int(k)])

def GetNewpage(url,news):

    try:

        new=(request.urlopen(url).read()).decode(htmlcode.lower())

        newdom=etree.HTML(new)

        newitems=newdom.xpath("//div[@id='endText'and @class='post_text']/p/text()")

        newcontent=""

        for n in newitems:

            newcontent=newcontent+n

        threadlock.acquire()//

        news.append(newcontent)

        threadlock.release()

    except Exception:

        print('%s------------------error'%url)

threadpool多线程和profile的调用示例，线程的执行函数应该是可以多参数的··，

threadpool 调用示例，线程执行函数有两个参数，穿的参数列表有点奇怪，只有一个参数直接传一个list就可以了

def pooltest():

    a=[1,2,3,4,5,6,7,8,9,10,111]

    b=[]

    args=[]

    [args.append(([i,b],None)) for i in a]

    pool=threadpool.ThreadPool(5)

    ress=threadpool.makeRequests(lambda x,y:y.append(x),args,lambda x,y:print(x.requestID))

    [pool.putRequest(req) for req in ress]

    pool.wait()

profile的调用示例，可以查看函数执行耗时情况

    import profile

    profile.run('test2()','prores')

    import pstats

    p=pstats.Stats('prores')

    p.strip_dirs().sort_stats("cumulative").print_stats(0)//显示前几行，这里设置0 只显示总时间

使用cookie访问需要用户名的网站：

首先你得从浏览器里边吧对应网站的cookie给弄出来存在本地文件或者什么地方，用的时候直接取出来··

这里本地测试了百度和豆瓣，百度OK可以获取对应的信息，豆瓣不行··· 豆瓣最近抽筋，经常无法访问，估计是夏天来了豆瓣太穷买不起空调的原因吧···

从网站返回的是gzip格式的数据，这个是压缩的数据，不能直接解码，要先解压再解码

def cookietest():

    headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0',

             'Accept':'*/*',

             'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',

             'Accept-Encoding':'gzip, deflate, br',

             'Referer':'https://www.douban.com/',

             'Cookie':cookiedouban,

             'Connection':'keep-alive'}

    req=request.Request('https://www.douban.com',headers=headers)

    with request.urlopen(req) as f:

        print('Status:', f.status, f.reason)

        for k, v in f.getheaders():

            print('%s: %s' % (k, v))

        bs=f.read();

        bi = io.BytesIO(bs)

        gf = gzip.GzipFile(fileobj=bi, mode="rb")

        print(gf.read().decode('utf-8'))

python 爬虫初试的更多相关文章

Python 爬取豆瓣电影Top250排行榜，爬虫初试
from bs4 import BeautifulSoup import openpyxl import re import urllib.request import urllib.error # ...
Python爬虫入门（二）之Requests库
Python爬虫入门(二)之Requests库我是照着小白教程做的,所以该篇是更小白教程hhhhhhhh 一.Requests库的简介 Requests 唯一的一个非转基因的 Python HTTP ...
Python 爬虫模拟登陆知乎
在之前写过一篇使用python爬虫爬取电影天堂资源的博客,重点是如何解析页面和提高爬虫的效率.由于电影天堂上的资源获取权限是所有人都一样的,所以不需要进行登录验证操作,写完那篇文章后又花了些时间研究了 ...
python爬虫成长之路（一）：抓取证券之星的股票数据
获取数据是数据分析中必不可少的一部分,而网络爬虫是是获取数据的一个重要渠道之一.鉴于此,我拾起了Python这把利器,开启了网络爬虫之路. 本篇使用的版本为python3.5,意在抓取证券之星上当天所 ...
python爬虫学习(7) —— 爬取你的AC代码
上一篇文章中,我们介绍了python爬虫利器--requests,并且拿HDU做了小测试. 这篇文章,我们来爬取一下自己AC的代码. 1 确定ac代码对应的页面如下图所示,我们一般情况可以通过该顺序 ...
python爬虫学习(6) —— 神器 Requests
Requests 是使用 Apache2 Licensed 许可证的 HTTP 库.用 Python 编写,真正的为人类着想. Python 标准库中的 urllib2 模块提供了你所需要的大多数 H ...
批量下载小说网站上的小说（python爬虫）
随便说点什么因为在学python,所有自然而然的就掉进了爬虫这个坑里,好吧,主要是因为我觉得爬虫比较酷,才入坑的. 想想看,你可以批量自动的采集互联网上海量的资料数据,是多么令人激动啊! 所以我就被 ...
python 爬虫（二）
python 爬虫 Advanced HTML Parsing 1. 通过属性查找标签:基本上在每一个网站上都有stylesheets,针对于不同的标签会有不同的css类于之向对应在我们看到的标签可能 ...
Python 爬虫1——爬虫简述
Python除了可以用来开发Python Web之后,其实还可以用来编写一些爬虫小工具,可能还有人不知道什么是爬虫的. 一.爬虫的定义: 爬虫——网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区 ...

随机推荐

[转载]ssget 用法详解 by yxp
总结得很好的ssget用法.....如此好文,必须转载. 原文地址: http://blog.csdn.net/yxp_xa/article/details/72229202 ssget 用法详解 b ...
【BlockingQueue】BlockingQueue 阻塞队列实现
前言: 在新增的Concurrent包中,BlockingQueue很好的解决了多线程中,如何高效安全“传输”数据的问题.通过这些高效并且线程安全的队列类,为我们快速搭建高质量的多线程程序带来极大的便 ...
Sql Server两个数据库中有一张表的结构一样，怎么快速将一张表中的数据复制到另一个表中
1,下面这句会把表2数据删除,然后把表1复制到表一,两表内容一样 SELECT * into 表2 FROM 表1 2,这句只追加,不删除表2的数据 insert into 表1 select * f ...
Unity---动画系统学习(3)---使用状态机来实现走、跑、转弯等的动画切换
1. 初始设置用动画学习笔记(2)中方法,把动画全都切割好. 拖进状态机并设置箭头.并设置具体箭头触发的事件. 在状态机左侧中添加参数,Float和Int类型参数只能从-1~1之间变化 Float: ...
P3235 [HNOI2014]江南乐
$ \color{#0066ff}{ 题目描述 }$ 小A是一个名副其实的狂热的回合制游戏玩家.在获得了许多回合制游戏的世界级奖项之后,小A有一天突然想起了他小时候在江南玩过的一个回合制游戏. 游戏的 ...
RabbitMQ 在Linux环境中的默认位置
参考:https://www.rabbitmq.com/relocate.html
用 Koa 提供 Restful service 和查询 MongoDB 的例子
const path = require('path'); const Koa = require('koa'); const app = new Koa(); const compose = req ...
Oracle 序列（自增ID）
-- 创建序列CREATE SEQUENCE "JPADMIN"."SEQ_JP_BAS_USER_ID" MINVALUE 1 // 最小值MAXVALUE ...
Android百分比布局方案
百分比布局让其中的控件在指定高度,宽度,margin时使用屏幕宽高的百分比,不使用dp,px.这样一套布局可以适应多个屏幕,方便适配.如: app:layout_heightPercent=" ...
es6 Null 传导运算符
Null 传导运算符程实务中,如果读取对象内部的某个属性,往往需要判断一下该对象是否存在.比如,要读取message.body.user.firstName,安全的写法是写成下面这样. const ...

python 爬虫初试

python 爬虫初试的更多相关文章

随机推荐

热门专题