Python爬虫(一)爬百度贴吧

简单的GET请求:

# python2

import urllib2

response = urllib2.urlopen('http://www.baidu.com')

html = response.read()

print html

req = urllib2.Request('http://www.baidu.com')

req.add_header('User-Agent', 'Chrome')

response = urllib2.urlopen(req)

print 'response headers:'

print response.info()

爬百度贴吧一个帖子：

# python2

# -*- coding: utf-8 -*-

import urllib2

import string

def crawl_tieba(base_url, begin_page, end_page):

    for i in range(begin_page, end_page + 1):

        print '正在下载第' + str(i) + '个网页...'

        url = base_url + '?pn=' + str(i)

        m = urllib2.urlopen(url).read()

        file_name = string.zfill(i, 5) + '.html'

        f = open(file_name, 'w+')

        f.write(m)

        f.close()

    print 'done'

crawl_tieba('http://tieba.baidu.com/p/4999189637', 1, 10)

WARNING:

　　如果没有第二行的注释，会报错：" SyntaxError: Non-ASCII character '\xe6' "。

爬糗百的帖子：

# python2

# -*- coding: utf-8 -*-

import urllib2

import string

import re

def crawl_qiubai(base_url, begin_page, end_page):

    for i in range(begin_page, end_page + 1):

        url = base_url + str(i)

        user_agent = 'chrome'

        headers = {'User-Agent': user_agent}

        print '正在下载网页' + str(i) + '...'

        req = urllib2.Request(url, headers=headers)

        response = urllib2.urlopen(req).read()

        pattern = re.compile(r'<div.*?class="content">.*?<span>(.*?)</span>', re.S)

        features = re.findall(pattern, response)

        file_name = 'qiubai_' + string.zfill(i, 5) + '.txt'

        f = open(file_name, 'w+')

        for index in range(len(features)):

            feature = features[index].replace('<br/>', '\n')

            f.write('第' + str(index + 1) + '条:\n')

            f.write(feature + '\n\n')

        f.close()

        print '网页' + str(i) + '下载完成'

    print 'done'

crawl_qiubai('http://www.qiushibaike.com/hot/page/', 1, 10)

参考资料:

Python爬虫入门教程

谈谈 Python 中的连接符（+、+=）

Python爬虫(一)爬百度贴吧的更多相关文章

Python 爬虫实例(爬百度百科词条)
爬虫是一个自动提取网页的程序,它为搜索引擎从万维网上下载网页,是搜索引擎的重要组成.爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入 ...
Python爬虫(二)爬百度贴吧楼主发言
爬取电影吧一个帖子里的所有楼主发言: # python2 # -*- coding: utf-8 -*- import urllib2 import string import re class Ba ...
初次尝试python爬虫，爬取小说网站的小说。
本次是小阿鹏,第一次通过python爬虫去爬一个小说网站的小说. 下面直接上菜. 1.首先我需要导入相应的包,这里我采用了第三方模块的架包,requests.requests是python实现的简单易 ...
Python爬虫之爬取慕课网课程评分
BS是什么? BeautifulSoup是一个基于标签的文本解析工具.可以根据标签提取想要的内容,很适合处理html和xml这类语言文本.如果你希望了解更多关于BS的介绍和用法,请看Beautiful ...
[Python爬虫] Selenium爬取新浪微博客户端用户信息、热点话题及评论 (上)
转载自:http://blog.csdn.net/eastmount/article/details/51231852 一. 文章介绍源码下载地址:http://download.csdn.net/ ...
from appium import webdriver 使用python爬虫,批量爬取抖音app视频（requests+Fiddler+appium）
使用python爬虫,批量爬取抖音app视频(requests+Fiddler+appium) - 北平吴彦祖 - 博客园 https://www.cnblogs.com/stevenshushu/p ...
【Python必学】Python爬虫反爬策略你肯定不会吧？
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 正文 Python爬虫反爬策略三部曲,拥有这三步曲就可以在爬虫界立足了: ...
Python爬虫之爬取站内所有图片
title date tags layut Python爬虫之爬取站内所有图片 2018-10-07 Python post 目标是 http://www.5442.com/meinv/ 如需在非li ...
python爬虫实战---爬取大众点评评论
python爬虫实战—爬取大众点评评论(加密字体) 1.首先打开一个店铺找到评论很多人学习python,不知道从何学起.很多人学习python,掌握了基本语法过后,不知道在哪里寻找案例上手.很多已经 ...

随机推荐

spark模型运行时无法连接摸个excutors异常org.apache.spark.shuffle.FetchFailedException: Failed to connect to xxxx/xx.xx.xx.xx:xxxx
error:org.apache.spark.shuffle.FetchFailedException: Failed to connect to xxxx/xx.xx.xx.xx:xxxx 定位来定 ...
softmax回归推导
向量\(y\)(为one-hot编码,只有一个值为1,其他的值为0)真实类别标签(维度为\(m\),表示有\(m\)类别): \[y=\begin{bmatrix}y_1\\ y_2\\ ...\\y ...
使用StopWatch类来计时 (perf4j-0.9.16.jar 包里的类)
public class StopWatch { static public int AN_HOUR = 60 * 60 * 1000; static public int A_MINUTE = 60 ...
【python实现卷积神经网络】开始训练
代码来源:https://github.com/eriklindernoren/ML-From-Scratch 卷积神经网络中卷积层Conv2D(带stride.padding)的具体实现:https ...
Android调用系统设置
最近,弄了一下,调用系统设置的方法,Android4.0的系统,下面的所有设置项,都亲测可以调用.首先调用的方式如下: Intent mintent_setting_time = new Intent ...
[算法]素数筛法(埃氏筛法&线性筛法)
目录一.素数筛的定义二.埃氏筛法(Eratosthenes筛法) 三.线性筛法四.一个性质一.素数筛的定义给定一个整数n,求出[1,n]之间的所有质数(素数),这样的问题为素数筛(素数的筛选 ...
怎么自定义DataGridViewColumn（日期列，C#)
参考:https://msdn.microsoft.com/en-us/library/7tas5c80.aspx 未解决的问题:如果日期要设置为null,怎么办? DataGridView控件提供了 ...
Ansible简明教程
Ansible是当下比较流行的自动化运维工具,可通过SSH协议对远程服务器进行集中化的配置管理.应用部署等,常结合Jenkins来实现自动化部署. 除了Ansible,还有像SaltStack.Fab ...
pickle\json，configparser，hashlib模块
python常用模块目录 python常用模块 json模块\pickle模块 configparser模块 hashlib模块 subprocess模块 json模块\pickle模块首先说一下 ...
stand up meeting 1/13/2016
part 组员工作工作耗时/h 明日计划工作耗时/h UI 冯晓云 UI测试和调整:与主程序完成合并 6 查漏补缺,扫除UI ...

Python爬虫(一)爬百度贴吧

Python爬虫入门教程

谈谈 Python 中的连接符（+、+=）

Python爬虫(一)爬百度贴吧的更多相关文章

随机推荐

热门专题