【爬虫】BeautifulSoup之爬取百度贴吧的帖子

在网上看到爬百度贴吧的例子，仿照写了一个用BeautifulSoup实现的，直接上代码吧

#coding:gbk

import urllib2

from bs4 import BeautifulSoup

import re

import os  

class TiebatoTxt:

    def __init__(self, url, seeLZ):

        #传入url

        self.url = url

        #是否只看楼主

         self.seeLZ = '?see_lz='+str(seeLZ)

        self.floor = 1

        self.File = None

        self.defaultTitle = "百度贴吧"

    #获得每一页的BeautifulSoup对象

    def get_body(self, pageNum):

        url = self.url + self.seeLZ  + '&pn=' + str(pageNum)

        req = urllib2.Request(url)

        try :

            html = urllib2.urlopen(req)

        except (urllib2.HTTPError, urllib2.URLError) as e:

            print u"获取帖子链接错误"

            return None

        try:

            bsObj = BeautifulSoup(html, "html.parser")

        except AttributeError as e:

            print u"获得BeautifulSoup对象错误"

            return None

        return bsObj

    #获得帖子标题

    def find_title(self, page):

        name = page.find("head").find("title").get_text()

        if name:

            return name

        else:

            return None

    #获取帖子共有多少页

    def get_pagenum(self, page):

        pageinfoList= page.findAll("li", {"class":"l_reply_num"})

        if pageinfoList is not None:

            for info in pageinfoList:

                span = info.findAll("span")

                if span is not None:

                    return span[1].get_text().encode("gbk")

        else:

            print "pageinfoList is none"

    #获得每一楼层的内容

    def get_content(self, page):

        div = page.findAll("div", {"id":re.compile("post_content_.*?")})

        contents = []

        for item in div:

            floorLine = "\n\n" + str(self.floor) + u"------------------------------------------------------\n\n"

            contents.append(floorLine)

            con = item.getText("\n", strip=True).encode("gbk", "ignore")#忽略一些特殊字符

            self.floor = self.floor + 1

            txturl = None

            txturl = item.findAll("a")

            #有些词带链接，去掉链接

            if txturl:

                for i in txturl:

                    word = i.getText(strip=True).encode("gbk", "ignore")

                    con = con.replace(("\n%s\n"%word), word)

            contents.append(con)

        return contents

            #print item.get_text(strip=True)

    def setFileTitle(self,title):

        #如果标题不是为None，即成功获取到标题

        if title is not None:

            title = title.replace('/', '')

            self.File = open(os.path.join(os.getcwd(), (title + ".txt")),"w+")

        else:

            self.File = open(os.path.join(os.getcwd(), (self.defaultTitle + ".txt")),"w+")

    def writetotxt(self,contents):

        #向文件写入每一楼的信息

        for item in contents:

            self.File.write(item)

    def start(self):

        indexPage = self.get_body(1)

        pageNum = self.get_pagenum(indexPage)

        title = self.find_title(indexPage)

        self.setFileTitle(title)

        if pageNum == None:

            print "URL已失效，请重试"

            return

        try:

            print "该帖子共有" + str(pageNum) + "页"

            for i in range(1,int(pageNum)+1):

                print "正在写入第" + str(i) + "页数据"

                page = self.get_body(i)

                contents = self.get_content(page)

                self.writetotxt(contents)

        #出现写入异常

        except IOError,e:

            print "写入异常，原因" + e.message

        finally:

            print "写入任务完成"

#

if __name__ == '__main__':

    print u"请输入帖子代号"

    baseURL = 'http://tieba.baidu.com/p/' + str(raw_input(u'http://tieba.baidu.com/p/'))

    seeLZ = raw_input(u"是否只获取楼主发言，是输入1，否输入0：")

    t = TiebatoTxt(baseURL, seeLZ)

    b = t.start()

【爬虫】BeautifulSoup之爬取百度贴吧的帖子的更多相关文章

利用python的爬虫技术爬取百度贴吧的帖子
在爬取糗事百科的段子后,我又在知乎上找了一个爬取百度贴吧帖子的实例,为了巩固提升已掌握的爬虫知识,于是我打算自己也做一个. 实现目标:1,爬取楼主所发的帖子 2,显示所爬去的楼层以及帖子题目 3,将爬 ...
Python爬虫实战之爬取百度贴吧帖子
大家好,上次我们实验了爬取了糗事百科的段子,那么这次我们来尝试一下爬取百度贴吧的帖子.与上一篇不同的是,这次我们需要用到文件的相关操作. 本篇目标对百度贴吧的任意帖子进行抓取指定是否只抓取楼主发帖 ...
Python爬虫爬取百度贴吧的帖子
同样是参考网上教程,编写爬取贴吧帖子的内容,同时把爬取的帖子保存到本地文档: #!/usr/bin/python#_*_coding:utf-8_*_import urllibimport urlli ...
Python 爬虫练习：爬取百度贴吧中的图片
背景:最近开始看一些Python爬虫相关的知识,就在网上找了一些简单已与练习的一些爬虫脚本实现功能:1,读取用户想要爬取的贴吧 2,读取用户先要爬取某个贴吧的页数范围 3,爬取每个贴吧中用户输入的页 ...
芝麻HTTP:Python爬虫实战之爬取百度贴吧帖子
本篇目标 1.对百度贴吧的任意帖子进行抓取 2.指定是否只抓取楼主发帖内容 3.将抓取到的内容分析并保存到文件 1.URL格式的确定首先,我们先观察一下百度贴吧的任意一个帖子. 比如:http:// ...
【python爬虫】之爬取百度首页
刚开始学习爬虫,照着教程手打了一遍,还是蛮有成就感的.使用版本:python2.7 注意:python2的默认编码是ASCII编码而python3默认编码是utf-8 import urllib2 u ...
百度图片爬虫-python版-如何爬取百度图片?
上一篇我写了如何爬取百度网盘的爬虫,在这里还是重温一下,把链接附上: http://www.cnblogs.com/huangxie/p/5473273.html 这一篇我想写写如何爬取百度图片的爬虫 ...
写一个python 爬虫爬取百度电影并存入mysql中
目标是利用python爬取百度搜索的电影在类型地区年代各个标签下电影的名字评分和图片连接以及电影连接首先我们先在mysql中建表 create table liubo4( id in ...
Python开发简单爬虫（二）---爬取百度百科页面数据
一.开发爬虫的步骤 1.确定目标抓取策略: 打开目标页面,通过右键审查元素确定网页的url格式.数据格式.和网页编码形式. ①先看url的格式, F12观察一下链接的形式;② 再看目标文本信息的标签格 ...

随机推荐

nodeJS搭建本地服务器
准备工作: 安装Node JS: 1:安装全局express:在express4.x版本中,安装时语句变为了这样: npm install -g express-generator 2:创建项目: 选 ...
Received an invalid response. Origin 'null' is therefore not allowed access
Received an invalid response. Origin 'null' is therefore not allowed access. 今天在做二级联动,使用ajax请求xml数据, ...
mybatis 书写
查询语句是使用 MyBatis 时最常用的元素之一 select元素配置细节如下属性描述取值默认 id 在这个模式下唯一的标识符,可被其它语句引用 parameterType 传给此语 ...
会话控制:Cookie和session
HTTP(超文本传输协议)定义了通过万维网(WWW)传输文本.图形.视频和所有其他数据所有的规则.HTTP是一种无状态的协议,说明每次请求的处理都与之前或之后的请求无关.虽然这种简化实现对于HTTP的 ...
hashmap实现原理浅析
看了下JAVA里面有HashMap.Hashtable.HashSet三种hash集合的实现源码,这里总结下,理解错误的地方还望指正 HashMap和Hashtable的区别 HashSet和Hash ...
Django配置和初探
Django是python下的一款网络服务器框架. 1.安装 windos: pip install django linux: sudo pip install django 2.启动 ...
PLSQL Develop PlugIn 之脚本自动匹配补全工具CnPlugin
插件位置:百度云 -- 开发工具空间 -- CnPlugin CnPlugin 支持PL/sql Developer 7.0以上版本,它可以根据关键字+tab/space 来触发代码补全,而关键字. ...
Appium中部分api的使用方法
使用的语言是java,appium的版本是1.3.4,java-client的版本是java-client-2.1.0,建议多参考java-client-2.1.0-javadoc. 1.使用Andr ...
python教程
教程1: http://www.runoob.com/python/python-tutorial.html 教程2: http://www.liaoxuefeng.com/wiki/00137473 ...
web Worker使js实现‘多线程’？
大家都知道js是单线程的,在上一段js执行结束之前,后面的js绝对不会执行,那么为什么标题说js实现‘多线程’,虽然说加了引号,可是标题也不能乱写不是,可恶的标题党? 姑且抛开标题不说,先说我们经常会 ...

【爬虫】BeautifulSoup之爬取百度贴吧的帖子

【爬虫】BeautifulSoup之爬取百度贴吧的帖子的更多相关文章

随机推荐

热门专题