Python爬虫抓取csdn博客

昨天晚上为了下载保存某位csdn大牛的所有博文，写了一个爬虫来自己主动抓取文章并保存到txt文本，当然也能够保存到html网页中。

这样就能够不用Ctrl+C 和Ctrl+V了，很方便。抓取别的站点也是大同小异。

为了解析抓取的网页。用到了第三方模块，BeautifulSoup，这个模块对于解析html文件很实用，当然也能够自己使用正則表達式去解析，可是比較麻烦。

因为csdn站点的robots.txt文件里显示禁止不论什么爬虫，所以必须把爬虫伪装成浏览器。并且不能频繁抓取。得sleep一会再抓。使用频繁会被封ip的，但能够使用代理ip。

#-*- encoding: utf-8 -*-

'''

Created on 2014-09-18 21:10:39

@author: Mangoer

@email: 2395528746@qq.com

'''

import urllib2

import re

from bs4 import BeautifulSoup

import random

import time

class CSDN_Blog_Spider:

     def __init__(self,url):

          print '\n'

          print('已启动网络爬虫。。

。')

          print  '网页地址： ' + url

          user_agents = [

                    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',

                    'Opera/9.25 (Windows NT 5.1; U; en)',

                    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',

                    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',

                    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',

                    'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',

                    "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",

                    "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 ",

                   ]

          # use proxy ip

          # ips_list = ['60.220.204.2:63000','123.150.92.91:80','121.248.150.107:8080','61.185.21.175:8080','222.216.109.114:3128','118.144.54.190:8118',

          #           '1.50.235.82:80','203.80.144.4:80']

          # ip = random.choice(ips_list)

          # print '使用的代理ip地址： ' + ip

          # proxy_support = urllib2.ProxyHandler({'http':'http://'+ip})

          # opener = urllib2.build_opener(proxy_support)

          # urllib2.install_opener(opener)

          agent = random.choice(user_agents)

          req = urllib2.Request(url)

          req.add_header('User-Agent',agent)

          req.add_header('Host','blog.csdn.net')

          req.add_header('Accept','*/*')

          req.add_header('Referer','http://blog.csdn.net/mangoer_ys?

viewmode=list')

          req.add_header('GET',url)

          html = urllib2.urlopen(req)

          page = html.read().decode('gbk','ignore').encode('utf-8')

          self.page = page

          self.title = self.getTitle()

          self.content = self.getContent()

          self.saveFile()

     def printInfo(self):

          print('文章标题是：   '+self.title + '\n')

          print('内容已经存储到out.txt文件里！')

     def getTitle(self):

          rex = re.compile('<title>(.*?)</title>',re.DOTALL)

          match = rex.search(self.page)

          if match:

                return match.group(1)

          return 'NO TITLE'

     def getContent(self):

          bs = BeautifulSoup(self.page)

          html_content_list = bs.findAll('div',{'id':'article_content','class':'article_content'})

          html_content = str(html_content_list[0])

          rex_p = re.compile(r'(?:.*?)>(.*?)<(?

:.*?)',re.DOTALL)

          p_list = rex_p.findall(html_content)

          content = ''

          for p in p_list:

               if p.isspace() or p == '':

                    continue

               content = content + p

          return content

     def saveFile(self):

          outfile = open('out.txt','a')

          outfile.write(self.content)

     def getNextArticle(self):

          bs2 = BeautifulSoup(self.page)

          html_nextArticle_list = bs2.findAll('li',{'class':'prev_article'})

          # print str(html_nextArticle_list[0])

          html_nextArticle = str(html_nextArticle_list[0])

          # print html_nextArticle

          rex_link = re.compile(r'<a href=\"(.*?)\"',re.DOTALL)

          link = rex_link.search(html_nextArticle)

          # print link.group(1)

          if link:

               next_url = 'http://blog.csdn.net' + link.group(1)

               return next_url

          return None

class Scheduler:

     def __init__(self,url):

          self.start_url = url

     def start(self):

          spider = CSDN_Blog_Spider(self.start_url)

          spider.printInfo()

          while True:

               if spider.getNextArticle():

                    spider = CSDN_Blog_Spider(spider.getNextArticle())

                    spider.printInfo()

               elif spider.getNextArticle() == None:

                    print 'All article haved been downloaded!'

                    break

               time.sleep(10)

#url = input('请输入CSDN博文地址：')

url = "http://blog.csdn.net/mangoer_ys/article/details/38427979"

Scheduler(url).start()

程序中有个问题一直不能解决：不能使用标题去命名文件，所以所有的文章所有放在一个out.txt中，说的编码的问题。希望大神能够解决问题。

Python爬虫抓取csdn博客的更多相关文章

Python实现抓取CSDN博客首页文章列表
1.使用工具: Python3.5 BeautifulSoup 2.抓取网站: csdn首页文章列表 http://blog.csdn.net/ 3.分析网站文章列表代码: 4.实现抓取代码: __a ...
Python爬虫简单实现CSDN博客文章标题列表
Python爬虫简单实现CSDN博客文章标题列表操作步骤: 分析接口,怎么获取数据? 模拟接口,尝试提取数据封装接口函数,实现函数调用. 1.分析接口打开Chrome浏览器,开启开发者工具(F1 ...
JAVA爬虫挖取CSDN博客文章
开门见山,看看这个教程的主要任务,就去csdn博客,挖取技术文章,我以<第一行代码–安卓>的作者为例,将他在csdn发表的额博客信息都挖取出来.因为郭神是我在大学期间比较崇拜的对象之一.他 ...
Hello Python!用 Python 写一个抓取 CSDN 博客文章的简单爬虫
网络上一提到 Python,总会有一些不知道是黑还是粉的人大喊着:Python 是世界上最好的语言.最近利用业余时间体验了下 Python 语言,并写了个爬虫爬取我 csdn 上关注的几个大神的博客, ...
Python爬虫:爬取自己博客的主页的标题，链接，和发布时间
代码 # -*- coding: utf-8 -*- """ ------------------------------------------------- File ...
利用Python抓取CSDN博客
这两天发现了一篇好文章,陈皓写的makefile的教程,具体地址在这里<跟我一起写makefile> 这篇文章一共分成了14个部分,我看东西又习惯在kindle上面看,感觉一篇一篇地复制成 ...
python 爬虫爬取序列博客文章列表
python中写个爬虫真是太简单了 import urllib.request from pyquery import PyQuery as PQ # 根据URL获取内容并解码为UTF-8 def g ...
python抓取51CTO博客的推荐博客的全部博文，对标题分词存入mongodb中
原文地址: python抓取51CTO博客的推荐博客的全部博文,对标题分词存入mongodb中
python 爬虫抓取心得
quanwei9958 转自 python 爬虫抓取心得分享 urllib.quote('要编码的字符串') 如果你要在url请求里面放入中文,对相应的中文进行编码的话,可以用: urllib.quo ...

随机推荐

Linux 下 Solr的搭建与使用(建议jdk1.8以上)
官方表示solr5之后的版本不再提供对第三方容器的支持(不提供war包了). “旧式”solr.xml格式不再支持,核心必须使用core.properties文件定义. 使用第三方容器的需要自己手动修 ...
EF--ModelFirst
EF框架有三种基本的方式:DB First,Model First,Code First.这里简单的说一下Model First,适合没有基础的同学照着做,学习基础的东西. 1.建立一个类库项目,这个 ...
Swift自适应布局（Adaptive Layout）教程
通用的Storyboard 通用的stroyboard文件是通向自适应布局光明大道的第一步.在一个storyboard文件中适配iPad和iPhone的布局在iOS8中已不再是梦想.我们不必再为不同尺 ...
Android彻底组件化方案实践
本文提出的组件化方案demo已经开源,参见文章Android彻底组件化方案开源. 文末有罗辑思维"得到app"的招聘广告,欢迎各路牛人加入!! 一.模块化.组件化与插件化项目发展 ...
[Android]异常8-android.view.WindowManager$BadTokenException
背景:Service服务中使用WindowManager时,Android4.4使用正常,Android6.0使用应用崩溃停止运行,提示android.view.WindowManager$BadTo ...
webSocket客服在线交谈
一>用户端 <%@ page language="java" pageEncoding="UTF-8" %><%@ taglib uri ...
jQuery——属相操作
属性获取:attr(属性名), 属性设置:attr(属性名,具体值) 移除属性:removeAttr(属性名) 特殊情况:prop(属性名).prop(属性名,具体值):表单中状态属性checked. ...
【转载】HTTP 基础与变迁
原文地址:https://segmentfault.com/a/1190000006689489 HTTP HTTP(HyperTextTransferProtocol)是超文本传输协议的缩写,它用于 ...
Python 之pytesseract模块读取知乎验证码案例
import pytesseract from PIL import Image import requests import time # 获取只会验证码图片并保存为本地 def get_data_ ...
c#中通过事件实现按下回车跳转控件
//接受用户输入参数后回车事件 private void tb_KeyPress(object sender, KeyPressEventArgs e) { ) { SendKeys.Send(&qu ...

Python爬虫抓取csdn博客

Python爬虫抓取csdn博客的更多相关文章

随机推荐

热门专题