Cnbolg Crawl

a). 加载用到的R包

##library packages needed in this case

library(proto)

library(gsubfn)

## Warning in doTryCatch(return(expr), name, parentenv, handler): 无法载入共享目标对象‘/Library/Frameworks/R.framework/Resources/modules//R_X11.so’：:

##   dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib

##   Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so

##   Reason: image not found

## Could not load tcltk.  Will use slower R code instead.

library(bitops)

library(rvest)

library(stringr)

library(DBI)

library(RSQLite)

library(sqldf)

library(RCurl)

library(ggplot2)

library(sp)

library(raster)

##由于我们的电脑一般是中文环境，但是我想要Monday，Tuesday，所以，这时需要增加设置参数

##来告知系统采用英文（北美）环境用法。

Sys.setlocale("LC_TIME", "C")

## [1] "C"

b). 自定义一个函数，后续用于爬取信息。

## Create a function,the parameter 'i' means page number.

getdata <- function(i){

    url <- paste0("www.cnblogs.com/p",i)##generate url

    combined_info <- url%>%html_session()%>%html_nodes("div.post_item div.post_item_foot")%>%html_text()%>%strsplit(split="\r\n")

    post_date <- sapply(combined_info, function(v) return(v[3]))%>%str_sub(9,24)%>%as.POSIXlt()##get the date

    post_year <- post_date$year+1900

    post_month <- post_date$mon+1

    post_day <- post_date$mday

    post_hour <- post_date$hour

    post_weekday <- weekdays(post_date)

    title <- url%>%html_session()%>%html_nodes("div.post_item h3")%>%html_text()%>%as.character()%>%trim()

    link <- url%>%html_session()%>%html_nodes("div.post_item a.titlelnk")%>%html_attr("href")%>%as.character()

    author <- url%>%html_session()%>%html_nodes("div.post_item a.lightblue")%>%html_text()%>%as.character()%>%trim()

    author_hp <- url%>%html_session()%>%html_nodes("div.post_item a.lightblue")%>%html_attr("href")%>%as.character()

    recommendation <- url%>%html_session()%>%html_nodes("div.post_item span.diggnum")%>%html_text()%>%trim()%>%as.numeric()

    article_view <- url%>%html_session()%>%html_nodes("div.post_item span.article_view")%>%html_text()%>%str_sub(4,20)

    article_view <- gsub(")","",article_view)%>%trim()%>%as.numeric()

    article_comment <- url%>%html_session()%>%html_nodes("div.post_item span.article_comment")%>%html_text()%>%str_sub(14,100)

    article_comment <- gsub(")","",article_comment)%>%trim()%>%as.numeric()

    data.frame(title,recommendation,article_view,article_comment,post_date,post_weekday,post_year,post_month,post_day,post_hour,link,author,author_hp)

}

c). 爬取博客园的博文发布相关的数据，我这里只爬1-5页的数据，100条记录。

cnblog<- data.frame()

for(m in 1:5){

    cnblog <- rbind(cnblog,getdata(m))

}

d). 查看一下爬的数据。

dim(cnblog)

## [1] 100  13

head(cnblog)

##                                                     title recommendation

## 1 Dynamic CRM 2015学习笔记（3）oData 查询方法及GUID值比较              0

## 2                                        Unity 之圆环算法              0

## 3                                        浅谈研发项目经理              1

## 4                                                C# Redis              0

## 5              JavaScript系列----AJAX机制详解以及跨域通信              0

## 6                                           MP4视频编码器              1

##   article_view article_comment           post_date post_weekday post_year

## 1            0               0 2015-04-10 20:46:00       Friday      2015

## 2           58               0 2015-04-10 19:57:00       Friday      2015

## 3          143               0 2015-04-10 19:38:00       Friday      2015

## 4          152               2 2015-04-10 19:25:00       Friday      2015

## 5           72               0 2015-04-10 19:14:00       Friday      2015

## 6           72               1 2015-04-10 19:14:00       Friday      2015

##   post_month post_day post_hour

## 1          4       10        20

## 2          4       10        19

## 3          4       10        19

## 4          4       10        19

## 5          4       10        19

## 6          4       10        19

##                                                    link     author

## 1       http://www.cnblogs.com/fengwenit/p/4415631.html     疯吻IT

## 2 http://www.cnblogs.com/wuzhang/p/wuzhang20150410.html    wuzhang

## 3        http://www.cnblogs.com/fancyamx/p/4415521.html     maxlin

## 4       http://www.cnblogs.com/caokai520/p/4409712.html   每日一bo

## 5     http://www.cnblogs.com/renlong0602/p/4414872.html 天天向上中

## 6         http://www.cnblogs.com/dhenskr/p/4414984.html    dhenskr

##                             author_hp

## 1   http://www.cnblogs.com/fengwenit/

## 2     http://www.cnblogs.com/wuzhang/

## 3    http://www.cnblogs.com/fancyamx/

## 4   http://www.cnblogs.com/caokai520/

## 5 http://www.cnblogs.com/renlong0602/

## 6     http://www.cnblogs.com/dhenskr/

tail(cnblog)

##                                  title recommendation article_view

## 95          前端资源预加载并展示进度条              3          560

## 96  Android中的Handler的机制与用法详解              1          213

## 97              JS学习笔记3_函数表达式              0          219

## 98                    iOS-MVVM设计模式              0          228

## 99             HTML5简单入门系列（七）              0          385

## 100  【Win 10应用开发】认识一下UAP项目              5          523

##     article_comment           post_date post_weekday post_year post_month

## 95                4 2015-04-08 18:03:00    Wednesday      2015          4

## 96                0 2015-04-08 18:02:00    Wednesday      2015          4

## 97                0 2015-04-08 17:56:00    Wednesday      2015          4

## 98                0 2015-04-08 17:47:00    Wednesday      2015          4

## 99                0 2015-04-08 17:36:00    Wednesday      2015          4

## 100               6 2015-04-08 17:31:00    Wednesday      2015          4

##     post_day post_hour

## 95         8        18

## 96         8        18

## 97         8        17

## 98         8        17

## 99         8        17

## 100        8        17

##                                                              link   author

## 95  http://www.cnblogs.com/lvdabao/p/resource-preload-plugin.html 每日一bo

## 96            http://www.cnblogs.com/JczmDeveloper/p/4403129.html   吕大豹

## 97                     http://www.cnblogs.com/ayqy/p/4403086.html Jamy Cai

## 98                    http://www.cnblogs.com/xqios/p/4403071.html     梦烬

## 99                   http://www.cnblogs.com/cotton/p/4403042.html   ciderX

## 100                 http://www.cnblogs.com/tcjiaan/p/4403018.html 棉花年度

##                                 author_hp

## 95      http://www.cnblogs.com/caokai520/

## 96        http://www.cnblogs.com/lvdabao/

## 97  http://www.cnblogs.com/JczmDeveloper/

## 98           http://www.cnblogs.com/ayqy/

## 99          http://www.cnblogs.com/xqios/

## 100        http://www.cnblogs.com/cotton/

e). 我这里只查看Mar.02-Mar.29四个周的博文数据，下面对数据进行简单处理。

##cnblog_Mar<- sqldf("select * from cnblog where post_day>=2 and post_day<=29")#这里我们只分析3月份四个周的数据。

cnblog_Mar$post_weekday<- factor(cnblog_Mar$post_weekday,order=TRUE,levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))

cnblog_Mar$post_hour <- as.factor(cnblog_Mar$post_hour)

f). 简单数据分析——图表呈现

Mar.02-Mar.29,博客发布数量按周分布

ggplot(data=cnblog_Mar,aes(post_weekday))+geom_bar()

每日博文数量分布

ggplot(data=cnblog_Mar,aes(post_date))+geom_bar()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

每个小时博文发布量分布

ggplot(data=cnblog_Mar,aes(post_hour))+geom_bar()

g). 总结

用R 写的这种爬虫总共是两部分 1.自定义一个函数，将原型写出来。2.利用自定义函数，写个循环就可以欢乐的爬数据了。其中第一步找准确html_nodes是关键
Google chrome 浏览器结合 CSS selector 使用，寻找html_nodes 非常方便。
存在table的网站，爬数据更方便。例如NBA 2014-2015常规赛技术统计排行 - 得分榜,若不存在table，则需要一个字段一个字段的爬取了，例如博客园。
商业网站的数据都是他们的宝藏，例如淘宝，京东，携程等
接下来打算爬一些招聘网站的数据，分析一下自己感兴趣的行业的薪资待遇，以及是哪些公司在招聘，工作地点在哪里，岗位对技术要求是什么，是R，是python，是SAS，还是数据库等等。
我的前两次爬虫在博客园写了一篇博客,有兴趣的可以去看看R语言网络爬虫学习基于rvest包

R 语言爬虫之 cnblog博文爬取的更多相关文章

R语言爬虫初尝试-基于RVEST包学习
注意:这文章是2月份写的,拉勾网早改版了,代码已经失效了,大家意思意思就好,主要看代码的使用方法吧.. 最近一直在用且有维护的另一个爬虫是KINDLE 特价书爬虫,blog地址见此: http://w ...
给社团同学做的R语言爬虫分享
大家好,给大家做一个关于R语言爬虫的分享,很荣幸也有些惭愧,因为我是一个编程菜鸟,社团里有很多优秀的同学经验比我要丰富的多,这次分享是很初级的,适用于没有接触过爬虫且有一些编程基础的同学,内容主要有以 ...
简单R语言爬虫
R爬虫实验 R爬虫实验 PeRl 简单的R语言爬虫实验,因为比较懒,在处理javascript翻页上用了取巧的办法. 主要用到的网页相关的R包是: {rvest}. 其余的R包都是常用包. libra ...
爬虫07 /scrapy图片爬取、中间件、selenium在scrapy中的应用、CrawlSpider、分布式、增量式
爬虫07 /scrapy图片爬取.中间件.selenium在scrapy中的应用.CrawlSpider.分布式.增量式目录爬虫07 /scrapy图片爬取.中间件.selenium在scrapy ...
初识python 之爬虫：使用正则表达式爬取“糗事百科 - 文字版”网页数据
初识python 之爬虫:使用正则表达式爬取"古诗文"网页数据的兄弟篇. 详细代码如下: #!/user/bin env python # author:Simple-Sir ...
【图文详解】scrapy爬虫与动态页面——爬取拉勾网职位信息（2）
上次挖了一个坑,今天终于填上了,还记得之前我们做的拉勾爬虫吗?那时我们实现了一页的爬取,今天让我们再接再厉,实现多页爬取,顺便实现职位和公司的关键词搜索功能. 之前的内容就不再介绍了,不熟悉的请一定要 ...
Python爬虫实战二之爬取百度贴吧帖子
大家好,上次我们实验了爬取了糗事百科的段子,那么这次我们来尝试一下爬取百度贴吧的帖子.与上一篇不同的是,这次我们需要用到文件的相关操作. 前言亲爱的们,教程比较旧了,百度贴吧页面可能改版,可能代码不 ...
Python爬虫实战一之爬取糗事百科段子
大家好,前面入门已经说了那么多基础知识了,下面我们做几个实战项目来挑战一下吧.那么这次为大家带来,Python爬取糗事百科的小段子的例子. 首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把 ...
转 Python爬虫实战二之爬取百度贴吧帖子
静觅 » Python爬虫实战二之爬取百度贴吧帖子大家好,上次我们实验了爬取了糗事百科的段子,那么这次我们来尝试一下爬取百度贴吧的帖子.与上一篇不同的是,这次我们需要用到文件的相关操作. 本篇目标 ...

随机推荐

python 数据可视化---Anscombe’s quartet
import seaborn as sns sns.set(style="ticks") # Load the example dataset for Anscombe's qua ...
JavaSE---使用反射生成JDK动态代理
1.概述 1.1 在Java.lang.reflect包下,提供了Proxy类.InvocationHandler接口,使用它们可以生成JDK动态代理类或动态代理对象: 1.2 [Proxy类] 1. ...
键盘接收用户输入案例2(案例内容包含键盘接收 int、String、Char、double、boolean)等类型及介绍
int类型: int age = input.nextInt(); double类型: double score = input.nextDouble(); String类型: String n ...
RTT设备与驱动之PWM
这里将PWM当成一个设备:PWM简介上图是一个简单的 PWM 原理示意图,假定定时器工作模式为向上计数,当计数值小于阈值时,则输出一种电平状态,比如高电平,当计数值大于阈值时则输出相反的电平状态,比 ...
RTT之内存管理及异常中断
内存管理分静态内存管理和动态内存管理(根据大小又分2种) 静态内存管理:创建.删除.初始化.解绑.申请和释放.初始化内存池是属于静态内存管理,与创建内存池不同的是,此处内存池对象所使用的内存空间是由用 ...
pat1046. Shortest Distance (20)
1046. Shortest Distance (20) 时间限制 100 ms 内存限制 65536 kB 代码长度限制 16000 B 判题程序 Standard 作者 CHEN, Yue The ...
自定义Qt组件-通讯模块(P3)
1. 半双工模式实时检测串口 ComHalfDuplex类是为了解决上位机发送控制指令和下位机发送数据会在半双工RS485总线中产生冲突引起乱码而引入的(v0.010版本引入). 解决冲突的原理主 ...
GitKraken使用教程-基础部分(1)
1. 首次打开程序第一次打开GitKraken程序时, GitKraken会提示需要登陆,可以用github.com的账号登陆,或者用邮箱创建账号登陆(如图 1‑1). 图 1‑1登陆帐户界面登陆 ...
保护REST API/Web服务的最佳实践
在设计REST API或服务时,是否存在处理安全性(身份验证,授权,身份管理)的最佳实践? 在构建SOAP API时,您可以使用WS-Security作为指导,有关该主题的文献很多.我发现了有关保护R ...
C#入门--索引器
C#入门--索引器索引器允许类或结构的实例按照与数组相同的方式进行索引.索引器类似于属性,不同之处在于它们的访问器采用参数. 索引器概述索引器使得对象可按照与数组相似的方法进行索引. get 访问 ...

R 语言爬虫 之 cnblog博文爬取

Cnbolg Crawl

a). 加载用到的R包

b). 自定义一个函数，后续用于爬取信息。

c). 爬取 博客园的博文发布相关的数据 ，我这里只爬1-5页的数据，100条记录。

d). 查看一下爬的数据。

e). 我这里只查看Mar.02-Mar.29四个周的博文数据，下面对数据进行简单处理。

f). 简单数据分析——图表呈现

Mar.02-Mar.29,博客发布数量按周分布

每日博文数量分布

每个小时博文发布量分布

g). 总结

R 语言爬虫 之 cnblog博文爬取的更多相关文章

随机推荐

热门专题

R 语言爬虫之 cnblog博文爬取

c). 爬取博客园的博文发布相关的数据，我这里只爬1-5页的数据，100条记录。

R 语言爬虫之 cnblog博文爬取的更多相关文章