Crawl(1)

爬贴吧小说。

爬取该链接中的楼主发言前10页另存为文本文件

python2.7

# *-* coding: UTF-8 *-*

import urllib2

import re

class BDTB:

    baseUrl = 'http://tieba.baidu.com/p/4896490947?see_lz=&pn='

    def getPage(self, pageNum):

        try:

            url = self.baseUrl+str(pageNum)

            request = urllib2.Request(url)

            response = urllib2.urlopen(request).read()

            return response

        except Exception, e:

            print e

    def Title(self, pageNum):

        html = self.getPage(pageNum)

        reg = re.compile(r'title="【原创】(.*?)"')

        items = re.findall(reg, html)

        for item in items:

            f = open('text.txt', 'w')

            f.write('标题'+'\t'+item)

            f.close()

        return items

    def Text(self, pageNum):

        html = self.getPage(pageNum)

        reg = re.compile(r'd_post_content j_d_post_content ">            (.*?)</div><br>', re.S)

        req = re.findall(reg, html)

        if pageNum == 1:

            req = req[2:]

        for i in req:

            removeAddr = re.compile('<a.*?>|</a>')

            i = re.sub(removeAddr, "", i)

            removeAddr = re.compile('<img.*?>')

            i = re.sub(removeAddr, "", i)

            removeAddr = re.compile('http.*?.html')

            i = re.sub(removeAddr, "", i)

            i = i.replace('<br>', '')

            f = open('text.txt', 'a')

            f.write('\n\n'+i)

            f.close()

bdtb = BDTB()

print 'Crawl is starting....'

try:

    for i in range(1, 10):

        print 'Crawling Page %s...' % (i)

        bdtb.Title(i)

        bdtb.Text(i)

except Exception, e:

    print e

Crawl(1)的更多相关文章

How Google TestsSoftware - Crawl, walk, run.
One of the key ways Google achievesgood results with fewer testers than many companies is that we ra ...
SharePoint Error - An unrecognized HTTP response was received when attempting to crawl this item
SharePoint 2013爬网报错 An unrecognized HTTP response was received when attempting to crawl this item. V ...
Creating a SharePoint BCS .NET Connectivity Assembly to Crawl RSS Data in Visual Studio 2010
from:http://blog.tallan.com/2012/07/18/creating-a-sharepoint-bcs-net-assembly-connector-to-crawl-rss ...
SharePoint Search之(两)持续抓取Continues crawl
于SharePoint 2010与在先前的版本号.有两种类型的抓取,Full和Incremental.故名思议.Full Crawl 抓取的时间.该Content Source里面的内容再次攀升.In ...
scrapy crawl 源码修改爬虫多开
import os from scrapy.commands import ScrapyCommand from scrapy.utils.conf import arglist_to_dict fr ...
Scrapy Crawl 运行出错 AttributeError: 'xxxSpider' object has no attribute '_rules' 的问题解决
按照官方的文档写的demo,只是多了个init函数,最终执行时提示没有_rules这个属性的错误日志如下: ...... File "C:\ProgramData\Anaconda3\lib ...
21天打造分布式爬虫-Crawl类爬取小程序社区（八）
8.1.Crawl的用法实战新建项目 scrapy startproject wxapp scrapy genspider -t crawl wxapp_spider "wxapp-uni ...
运行scrapy crawl （文件名）时显示invalid syntax和no modle 'win32api'解决方案
使用pycharm爬取知乎网站的时候,在terminal端输入scarpy crawl zhihu,提示语法错误,如下: 原因是python3.7中将async设为关键字,根据错误提示,找到manho ...
Python.错误解决：scrapy 没有crawl 命令
确保2点: 1.把爬虫.py复制到spiders文件夹里如执行scrapy crawl demo ,spiders里面就要有demo.py文件 2.在项目文件夹内执行命令在scrapy.cfg所在 ...
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl
阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl 1.函数调用它自身,这样就形成了一个循环,一环套一环: from urllib.request ...

随机推荐

Qt 利用XML文档，写一个程序集合四
接上一篇https://www.cnblogs.com/DreamDog/p/9214067.html 启动外部程序这里简单了,直接上代码吧 connect(button,&MPushBut ...
关于摄像头PCB图设计经验谈
摄像头PCB设计,因为客观原因等.容易引起干扰这是个涉及面大的问题.我们抛开其它因素,仅仅就PCB设计环节来说,分享以下几点心得,供参考交流: 1.合理布置电源滤波/退耦电容:一般在原理图中仅画出若干 ...
执行sh脚本报“/usr/bin/env: "sh\r": 没有那个文件或目录”错误
出现这个错误的原因是出错的语句后面多了“\r”这个字符,换言之,脚本文件格式的问题,我们只需要把格式改成unix即可: vi xx.sh :set ff :set ff=unix :wq!
【SIKIA计划】_07_Unity3D游戏开发-坦克大战笔记
[新增分类][AudioClips]音频剪辑[AudioMixers]音频混合器[Editor][Fonts]字体[Materials]材质[Models]模型[Standard Assets] [渲 ...
Spark聚合操作：combineByKey()
Spark中对键值对RDD(pairRDD)基于键的聚合函数中,都是通过combineByKey()实现的. 它可以让用户返回与输入数据类型不同的返回值(可以自己配置返回的参数,返回的类型) 首先理解 ...
记一次开发人员的奇葩操作-------导致root用户不能登录
首先,我表示国庆长假被开发呼叫,是一件很不开心的事...... 1.问开发,是不是/etc/passwd文件被更改了? 回答:没有还好是新装的服务器,还好哥有服务器管理口的远程控制单用户模式 ...
yocto-sumo源码解析（九）: ProcessServer.main
前面讲到BitbakeServer实际上是一个ProcessServer,因此对ProcessServer进行了一个大略的分析集,这里着重再介绍一下ProcessServer.main. 1. 初始化 ...
codeforces 1133E K Balanced Teams
题目链接:http://codeforces.com/contest/1133/problem/E 题目大意: 在n个人中找到k个队伍.每个队伍必须满足最大值减最小值不超过5.求满足条件k个队伍人数的 ...
[Paper Reading] Image Captioning using Deep Neural Architectures (arXiv: 1801.05568v1)
Main Contributions: A brief introduction about two different methods (retrieval based method and gen ...
umount命令详解
基础命令学习目录首页 umount 用来卸载设备 -a:卸除/etc/mtab中记录的所有文件系统: -h:显示帮助: -n:卸除 ...

Crawl(1)

Crawl(1)的更多相关文章

随机推荐

热门专题