使用R语言和XML包抓取网页数据-Scraping data from web pages in R with XML package

In the last years a lot of data has been released publicly in different formats, but sometimes the data we're interested in are still inside the HTML of a web page: let's see how to get those data.

One of the existing packages for doing this job is the XML package. This package allows us to read and create XML and HTML documents; among the many features, there's a function called readHTMLTable() that analyze the parsed HTML and returns the tables present in the page. The details of the package are available in the official documentation of the package.

Let's start.
Suppose we're interested in the italian demographic info present in this page http://sdw.ecb.europa.eu/browse.do?node=2120803 from the EU website. We start loading and parsing the page:

page <- "http://sdw.ecb.europa.eu/browse.do?node=2120803"

parsed <- htmlParse(page)

Now that we have parsed HTML, we can use the readHTMLTable() function to return a list with all the tables present in the page; we'll call the function with these parameters:

parsed: the parsed HTML
skip.rows: the rows we want to skip (at the beginning of this table there are a couple of rows that don't contain data but just formatting elements)
colClasses: the datatype of the different columns of the table (in our case all the columns have integer values); the rep() function is used to replicate 31 times the "integer" value

table <- readHTMLTable(parsed, skip.rows=c(1,3,4,5), colClasses = c(rep("integer", 31)))

As we can see from the page source code, this web page contains six HTML tables; the one that contains the data we're interested in is the fifth, so we extract that one from the list of tables, as a data frame:

values <- as.data.frame(table[5])

Just for convenience, we rename the columns with the period and italian data:

# renames the columns for the period and Italy

colnames(values)[1] <- 'Period'

colnames(values)[19] <- 'Italy'

The italian data lasts from 1990 to 2014, so we have to subset only those rows and, of course, only the two columns of period and italian data:

# subsets the data: we are interested only in the first and the 19th column (period and italian info)

ids <- values[c(1,19)]

# Italy has only 25 years of info, so we cut away the other rows

ids <- as.data.frame(ids[1:25,])

Now we can plot these data calling the plot function with these parameters:

ids: the data to plot
xlab: the label of the X axis
ylab: the label of the Y axis
main: the title of the plot
pch: the symbol to draw for evey point (19 is a solid circle: look here for an overview)
cex: the size of the symbol

plot(ids, xlab="Year", ylab="Population in thousands", main="Population 1990-2014", pch=19, cex=0.5)

and here is the result:

Here's the full code, also available on my github:

library(XML)

# sets the URL

url <- "http://sdw.ecb.europa.eu/browse.do?node=2120803"

# let the XML library parse the HTMl of the page

parsed <- htmlParse(url)

# reads the HTML table present inside the page, paying attention

# to the data types contained in the HTML table

table <- readHTMLTable(parsed, skip.rows=c(1,3,4,5), colClasses = c(rep("integer", 31) ))

# this web page contains seven HTML pages, but the one that contains the data

# is the fifth

values <- as.data.frame(table[5])

# renames the columns for the period and Italy

colnames(values)[1] <- 'Period'

colnames(values)[19] <- 'Italy'

# now subsets the data: we are interested only in the first and

# the 19th column (period and Italy info)

ids <- values[c(1,19)]

# Italy has only 25 year of info, so we cut away the others

ids <- as.data.frame(ids[1:25,])

# plots the data

plot(ids, xlab="Year", ylab="Population in thousands", main="Population 1990-2014", pch=19, cex=0.5)

from: http://andreaiacono.blogspot.com/2014/01/scraping-data-from-web-pages-in-r-with.html

使用R语言和XML包抓取网页数据-Scraping data from web pages in R with XML package的更多相关文章

java抓取网页数据，登录之后抓取数据。
最近做了一个从网络上抓取数据的一个小程序.主要关于信贷方面,收集的一些黑名单网站,从该网站上抓取到自己系统中. 也找了一些资料,觉得没有一个很好的,全面的例子.因此在这里做个笔记提醒自己. 首先需要一 ...
Asp.net 使用正则和网络编程抓取网页数据(有用)
Asp.net 使用正则和网络编程抓取网页数据(有用) Asp.net 使用正则和网络编程抓取网页数据(有用) /// <summary> /// 抓取网页对应内容 /// </su ...
使用JAVA抓取网页数据
一.使用 HttpClient 抓取网页数据 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ...
使用HtmlAgilityPack批量抓取网页数据
原文:使用HtmlAgilityPack批量抓取网页数据相关软件点击下载登录的处理.因为有些网页数据需要登陆后才能提取.这里要使用ieHTTPHeaders来提取登录时的提交信息.抓取网页 Htm ...
web scraper 抓取网页数据的几个常见问题
如果你想抓取数据,又懒得写代码了,可以试试 web scraper 抓取数据. 相关文章: 最简单的数据抓取教程,人人都用得上 web scraper 进阶教程,人人都用得上如果你在使用 web s ...
c#抓取网页数据
写了一个简单的抓取网页数据的小例子,代码如下: //根据Url地址得到网页的html源码 private string GetWebContent(string Url) { string strRe ...
【iOS】正則表達式抓取网页数据制作小词典
版权声明:本文为博主原创文章,未经博主同意不得转载. https://blog.csdn.net/xn4545945/article/details/37684127 应用程序不一定要自己去提供数据. ...
01 UIPath抓取网页数据并导出Excel（非Table表单）
上次转载了一篇<UIPath抓取网页数据并导出Excel>的文章,因为那个导出的是table标签中的数据,所以相对比较简单.现实的网页中,有许多不是通过table标签展示的,那又该如何处理 ...
Node.js的学习--使用cheerio抓取网页数据
打算要写一个公开课网站,缺少数据,就决定去网易公开课去抓取一些数据. 前一阵子看过一段时间的Node.js,而且Node.js也比较适合做这个事情,就打算用Node.js去抓取数据. 关键是抓取到网页 ...

随机推荐

MVC公开课 – 1.基础 (2013-3-15广州传智MVC公开课)
1.MVC设计模式 Model 是指要处理的业务代码和数据操作代码 View 视图主要是指的跟用户打交道并能够展示数据 Controller 看成是 Model和View的桥梁优点: 1.1 ...
python开发学习-day03(set集合、collection系列、深浅拷贝、函数)
s12-20160116-day03 *:first-child { margin-top: 0 !important; } body>*:last-child { margin-bottom: ...
Webpack, VSCode 和 Babel 组件模块导入别名
很多时候我们使用别人的库,都是通过 npm install,再简单的引入,就可以使用了. 1 2 import React from 'react' import { connect } fr ...
微信开发(一)SAE环境搭建
登录新浪sae平台,点击sae 点击创建新应用->继续创建环境选择: 填好后点击创建应用点击创建版本点击链接可以访问,点击编辑代码可以在线编辑,代码上传可以是svn,git,可以在线上传 ...
json调试
private static void mockapi(OkHttpClient.Builder httpClientBuilder) { if (Config.isMockApi) { MockAp ...
Mybatis源码分析之插件的原理
MyBatis 允许你在已映射语句执行过程中的某一点进行拦截调用. 默认情况下,可以使用插件来拦截的方法调用包括: Executor (update, query, flushStatements, ...
MYSQL注入天书之前言
写在前面的一些内容请允许我叨叨一顿: 最初看到sqli-labs也是好几年之前了,那时候玩了前面的几个关卡,就没有继续下去了.最近因某个需求想起了sqli-labs,所以翻出来玩了下.从每一关卡的娱 ...
ssvm和console 模板机连接不上管理节点
说明: cloudstack 版本http://www.shapeblue.com/packages/ 并不是官方的 systemvm64template-4.6.0-vmware.ova 官 ...
ESXI 5.5卡在LSI_MR3.V00
方法一故障现象此问题无论使用VMware官方镜像还是HP的自定义镜像都会出现一下情况并卡着不动.(此文档普遍存在各种服务器上,包括其它厂商服务器) 故障原因: 故障原因VMware官方和HP官方并 ...
我的OI生涯第六章
开学了,但是我们并没有像一个正常的高二学生一样坐在教室里接受调研考试的洗礼. 暑假作业这种东西早已被甩在一旁,可以想象回去补文化课时该有多么狼狈. 大王给我们制定了周密的计划,每周两次测试,加上蔡老师 ...

使用R语言和XML包抓取网页数据-Scraping data from web pages in R with XML package

使用R语言和XML包抓取网页数据-Scraping data from web pages in R with XML package的更多相关文章

随机推荐

热门专题