R自动数据收集第二章HTML笔记1(主要关于handler处理器函数和帮助文档所有示例)
1在chrome(chrome的效果相对比用360极速好,虽然内核一致),选中一行文本,右键检查(inspect),就可以选中对应的那一行HTML源码
url <-"http://www.r-datacollection.com/materials/html/fortunes.html"> fortunes <- readLines(con = url)> fortunes[1]"<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML//EN\">"[2]"<html> <head>"[3]"<title>Collected R wisdoms</title>"[4]"</head>"[5]""[6]"<body>"[7]"<div id=\"R Inventor\" lang=\"english\" date=\"June/2003\">"[8]" <h1>Robert Gentleman</h1>"[9]" <p><i>'What we have is nice, but we need something very different'</i></p>"[10]" <p><b>Source: </b>Statistical Computing 2003, Reisensburg"[11]"</div>"[12]""[13]"<div lang=english date=\"October/2011\">"[14]" <h1>Rolf Turner</h1>"[15]" <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>"[16]" <p><b>Source: </b><a href=\"https://stat.ethz.ch/mailman/listinfo/r-help\">R-help</a></p>"[17]"</div>"[18]""[19]"<address><a href=\"www.r-datacollectionbook.com\"><i>The book homepage</i><a/></address>"[20]""[21]"</body> </html>"
> library(XML)> parsed_fortunes <- htmlParse(file = url)>print(parsed_fortunes)<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"><html><head><title>Collected R wisdoms</title></head><body><divid="R Inventor"lang="english"date="June/2003"><h1>Robert Gentleman</h1><p><i>'What we have is nice, but we need something very different'</i></p><p><b>Source: </b>Statistical Computing 2003, Reisensburg</p></div><divlang="english"date="October/2011"><h1>Rolf Turner</h1><p><i>'R is wonderful, but it cannot work magic'</i><br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p><p><b>Source: </b><ahref="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p></div><address><ahref="www.r-datacollectionbook.com"><i>The book homepage</i></a><a></a></address></body></html>
#代码片段3h1 <- list("body"= function(x){NULL})parsed_fortunes <- htmlTreeParse(url, handlers = h1, asTree = TRUE)parsed_fortunes$children# $html# <html># <head># <title>Collected R wisdoms</title># </head># </html>
h2 <- list(startElement = function(node,...){name <- xmlName(node)if(name %in% c("div","title")){NULL}else{node}},comment = function(node){NULL})parsed_fortunes <- htmlTreeParse(file = url, handlers = h2, asTree = TRUE)parsed_fortunes$children$html<html><head/><body><address><ahref="www.r-datacollectionbook.com"><i>The book homepage</i></a><a/></address></body></html>
h1 <- list("body"= function(x){print('here is a body tag')NULL})parsed_fortunes <- htmlTreeParse(url, handlers = h1, asTree = TRUE)[1] "here is a body tag"
i <-0h2 <- list(startElement = function(node,...){i <<- i +1print(paste("here is the ",i,"st tag,its name is",xmlName(node)))NULL}# comment = function(node){# print(paste("here is a comment,its name is",xmlName(node)))# NULL# })parsed_fortunes <- htmlTreeParse(file = url, handlers = h2, asTree = TRUE)[1]"here is the 1 st tag,its name is title"[1]"here is the 2 st tag,its name is head"[1]"here is the 3 st tag,its name is h1"[1]"here is the 4 st tag,its name is i"[1]"here is the 5 st tag,its name is p"[1]"here is the 6 st tag,its name is b"[1]"here is the 7 st tag,its name is p"[1]"here is the 8 st tag,its name is div"[1]"here is the 9 st tag,its name is h1"[1]"here is the 10 st tag,its name is i"[1]"here is the 11 st tag,its name is br"[1]"here is the 12 st tag,its name is emph"[1]"here is the 13 st tag,its name is p"[1]"here is the 14 st tag,its name is b"[1]"here is the 15 st tag,its name is a"[1]"here is the 16 st tag,its name is p"[1]"here is the 17 st tag,its name is div"[1]"here is the 18 st tag,its name is i"[1]"here is the 19 st tag,its name is a"[1]"here is the 20 st tag,its name is a"[1]"here is the 21 st tag,its name is address"[1]"here is the 22 st tag,its name is body"[1]"here is the 23 st tag,its name is html"
getItalics = function(){i_container = character()list(i = function(node,...){i_container <<- c(i_container, xmlValue(node))}, returnI = function() i_container)}h3 <- getItalics()invisible(htmlTreeParse(url, handlers = h3))h3$returnI()[1]"'What we have is nice, but we need something very different'"[2]"'R is wonderful, but it cannot work magic'"[3]"The book homepage"
> a = character()> b<-c(a,'2')> b[1]"2"> c<-c(b,'3')> c[1]"2""3"> acharacter(0)> b[1]"2"
这样的写法也很神奇?

name The name of the element.attributes For regular elements, a named list of XML attributes converted from the <tag x="1" y="abc">children List of sub-nodes.value Used only for text entries.Some nodes specializations of XMLNode, such as XMLComment, XMLProcessingInstruction, XMLEntityRef are used.
<?xml version="1.0"?><!DOCTYPE foo [<!ENTITY % bar "for R and S"><!ENTITY % foo "for Omegahat"><!ENTITY testEnt "test entity bar"><!ENTITY logo SYSTEM "images/logo.gif" NDATA gif><!ENTITY % extEnt SYSTEM "http://www.omegahat.net"><!-- include the contents of the README file in the same directory as this one. --><!ELEMENT x (#PCDATA) ><!ELEMENT y (x)* >]><!-- A comment --><foox="1"><elementattrib1="my value"/>&testEnt;<?R sum(rnorm(100))?><a><!-- A comment --><b>%extEnt;</b></a><![CDATA[This is escaped datacontaining < and &.]]>Note that this caused a segmentation fault if replaceEntities wasnot TRUE.That is,<code>xmlTreeParse("test.xml", replaceEntities = TRUE)</code>works, but<code>xmlTreeParse("test.xml")</code>does not if this is called before the one above.This is now fixed and was caused bytreating an xmlNodePtr in the C codethat had type XML_ELEMENT_DECLand so was in fact an xmlElementPtr.Aaah, C and casting!</foo>
fileName <- system.file("exampleData","test.xml", package="XML")# parse the document and return it in its standard format.xmlTreeParse(fileName)# parse the document, discarding comments.xmlTreeParse(fileName, handlers=list("comment"=function(x,...){NULL}), asTree = TRUE)
这没什么好说的
# print the entitiesinvisible(xmlTreeParse(fileName,handlers=list(entity=function(x){cat("In entity",x$name, x$value,"\n")x}), asTree = TRUE))
# Parse some XML text.# Read the text from the filexmlText <- paste(readLines(fileName),"\n", collapse="")print(xmlText)xmlTreeParse(xmlText, asText=TRUE)# with version 1.4.2 we can pass the contents of an XML# stream without pasting them.xmlTreeParse(readLines(fileName), asText=TRUE)
# Read a MathML document and convert each node# so that the primary class is# <name of tag>MathML# so that we can use method dispatching when processing# it rather than conditional statements on the tag name.# See plotMathML() in examples/.fileName <- system.file("exampleData","mathml.xml",package="XML")m <- xmlTreeParse(fileName,handlers=list(startElement = function(node){cname <- paste(xmlName(node),"MathML", sep="",collapse="")class(node)<- c(cname,class(node));node}))
这个功能有点意思,修改node的属性,将其第一个属性修改为标签名+MathML,之前的属性紧随其后,这样当我们调用的时候,就可以自动根据“标签名+MathML”属性调用泛型函数中对应的函数,这样就避免了我们还有使用if分支结构去筛选调用相应的方法。
# In this example, we extract _just_ the names of the# variables in the mtcars.xml file.# The names are the contents of the <variable>tags.- # We discard all other tags by returning NULL
# from the startElement handler.## We cumulate the names of variables in a charactervector named `vars'.# We define this within a closure and define the# variable function within that closure so that it# will be invoked when the parser encounters a <variable>tag.# This is called with 2 arguments: the XMLNode object (containingits children) and- # the list of attributes.
# We get the variable name via call to xmlValue().# Note that we define the closure function in the call and then# create an instance of it by calling it directly as# (function() {...})()# Note that we can get the names by parsing# in the usual manner and the entire document and then executing# xmlSApply(xmlRoot(doc)[[1]], function(x) xmlValue(x[[1]]))# which is simpler but is more costly in terms of memory.fileName <- system.file("exampleData","mtcars.xml", package="XML")doc <- xmlTreeParse(fileName,handlers =(function(){vars <- character(0);list(variable=function(x, attrs){vars <<- c(vars, xmlValue(x[[1]]));print(vars)},startElement=function(x,attr){NULL},names = function(){vars})})())[1]"mpg"[1]"mpg""cyl"[1]"mpg" "cyl" "disp"[1]"mpg" "cyl" "disp""hp"[1]"mpg" "cyl" "disp""hp" "drat"[1]"mpg" "cyl" "disp""hp" "drat""wt"[1]"mpg" "cyl" "disp""hp" "drat""wt" "qsec"[1]"mpg" "cyl" "disp""hp" "drat""wt" "qsec""vs"[1]"mpg" "cyl" "disp""hp" "drat""wt" "qsec""vs" "am"[1]"mpg" "cyl" "disp""hp" "drat""wt" "qsec""vs" "am" "gear"[1]"mpg" "cyl" "disp""hp" "drat""wt" "qsec""vs" "am" "gear""carb"

# Here we just print the variable names to the console# with a special handler.doc <- xmlTreeParse(fileName, handlers = list(variable=function(x, attrs){print(xmlValue(x[[1]])); TRUE}), asTree=TRUE)
doc <- xmlTreeParse(fileName,handlers = list(variable=function(x, attrs){print(xmlValue(x[[1]]))}))[1]"mpg"[1]"cyl"[1]"disp"[1]"hp"[1]"drat"[1]"wt"[1]"qsec"[1]"vs"[1]"am"[1]"gear"[1]"carb"
try(xmlTreeParse(system.file("exampleData","TestInvalid.xml", package="XML"),validate=TRUE))
# Parse an XML document directly from a URL.# Requires Internet access.xmlTreeParse("http://www.omegahat.net/Scripts/Data/mtcars.xml", asText=TRUE)Error: XML content does not seem to be XML:'http://www.omegahat.net/Scripts/Data/mtcars.xml'
counter = function(){counts = integer(0)list(startElement = function(node){name = xmlName(node)if(name %in% names(counts))elsecounts[name]<<-1},counts = function() counts)}h = counter()invisible(xmlParse(system.file("exampleData","mtcars.xml", package="XML"),handlers = h))h$counts()variable variables record dataset22 2 64 2
getLinks = function(){links = character()list(a = function(node,...){links <<- c(links, xmlGetAttr(node,"href"))node},links = function()links)}h1 = getLinks()invisible(htmlTreeParse(system.file("examples","index.html", package ="XML"),handlers = h1))h1$links()[1]"XML_0.97-0.tar.gz"[2]"XML_0.97-0.zip"[3]"XML_0.97-0.tar.gz"[4]"XML_0.97-0.zip"[5]"Overview.html"[6]"manual.pdf"[7]"Tour.pdf"[8]"description.pdf"[9]"WritingXML.html"[10]"FAQ.html"[11]"Changes"[12]"http://cm.bell-labs.com/stat/duncan"[13]"mailto:duncan@wald.ucdavis.edu"
h2 = getLinks()htmlTreeParse(system.file("examples","index.html", package ="XML"),handlers = h2, useInternalNodes = TRUE)all(h1$links()== h2$links())[1] TRUE
# Using flat treestt = xmlHashTree()f = system.file("exampleData","mtcars.xml", package="XML")xmlTreeParse(f, handlers = list(.startElement = tt[[".addNode"]]))####输出了处理函数本身,加了asTree = TRUE貌似也没效果啊tt #这个是我自己加的命令<variable/>xmlRoot(tt)<variable/>
>class(tt)[1]"XMLHashTree" "XMLAbstractDocument"
function (nodes = list(), parents = character(), children = list(),env = new.env(TRUE, parent = emptyenv())){.count =0env$.children =.children = new.env(TRUE)env$.parents =.parents = new.env(TRUE)f = function(suggestion =""){if(suggestion ==""|| exists(suggestion, env, inherits = FALSE))as.character(.count +1)else suggestion}assign(".nodeIdGenerator", f, env)addNode = function(node, parent = character(),..., attrs = NULL,namespace = NULL, namespaceDefinitions = character(),.children = list(...), cdata = FALSE, suppressNamespaceWarning = getOption("suppressXMLNamespaceWarning",FALSE)){if(is.character(node))node = xmlNode(node, attrs = attrs, namespace = namespace,namespaceDefinitions = namespaceDefinitions).kids =.children.children =.this$.childrennode = asXMLTreeNode(node,.this, className ="XMLHashTreeNode")id = node$idassign(id, node, env).count <<-.count +1if(!inherits(parent,"XMLNode")&&(!is.environment(parent)&&length(parent)==0)|| parent =="")return(node)if(inherits(parent,"XMLHashTreeNode"))parent = parent$idif(length(parent)){assign(id, parent, envir =.parents)if(exists(parent,.children, inherits = FALSE))tmp = c(get(parent,.children), id)else tmp = idassign(parent, tmp,.children)}return(node)}env$.addNode <- addNode.tidy = function(){idx <- idx -1length(nodeSet)<- idxlength(nodeNames)<- idxnames(nodeSet)<- nodeNames.nodes <<- nodeSetidx}.this = structure(env,class= oldClass("XMLHashTree")).this}
f = system.file("exampleData","mtcars.xml", package="XML")doc = xmlTreeParse(f, useInternalNodes = TRUE)sapply(getNodeSet(doc,"//variable"), xmlValue)[1]"mpg" "cyl" "disp""hp" "drat""wt" "qsec""vs" "am"[10]"gear""carb"
# character set encoding for HTMLf = system.file("exampleData","9003.html", package ="XML")# we specify the encodingd = htmlTreeParse(f, encoding ="UTF-8")# get a different result if we do not specify any encodingd.no = htmlTreeParse(f)# document with its encoding in the HEAD of the document.d.self = htmlTreeParse(system.file("exampleData","9003-en.html",package ="XML"))# XXX want to do a test here to see the similarities between d and# d.self and differences between d.no
关于编码解码
<xxmlns:xinclude="http://www.w3.org/2001/XInclude"><!-- Simple test of including a set of nodes from an XML document --><xinclude:includehref="something.xml#xpointer(//p)"/></x>
<xxmlns:xinclude="http://www.w3.org/2001/XInclude"><!-- Simple test of including a set of nodes from an XML document --><xinclude:includehref="doesnt_exist.xml#xpointer(//p)"><xinclude:fallback>Some <i>fallback text</i></xinclude:fallback></xinclude:include></x>
# includef = system.file("exampleData","nodes1.xml", package ="XML")xmlRoot(xmlTreeParse(f, xinclude = FALSE))<x xmlns:xinclude="http://www.w3.org/2001/XInclude"><!--Simple test of including a set of nodes from an XML document--><xinclude:include href="something.xml#xpointer(//p)"/></x>xmlRoot(xmlTreeParse(f, xinclude = TRUE))<x xmlns:xinclude="http://www.w3.org/2001/XInclude"><!--Simple test of including a set of nodes from an XML document--><p ID="author">something</p><p>really</p><p>simple</p></x>f = system.file("exampleData","nodes2.xml", package ="XML")xmlRoot(xmlTreeParse(f, xinclude = TRUE))failed to load external entity "D:/RSets/R-3.3.2/library/XML/exampleData/doesnt_exist.xml"<x xmlns:xinclude="http://www.w3.org/2001/XInclude"><!--Simple test of including a set of nodes from an XML document-->Some<i>fallback text</i></x>
<doc><pID="author">something</p><p>really</p><foo>bar</foo><p>simple</p></doc>
try(xmlTreeParse("<doc><a> & < <?pi ></doc>"))xmlParseEntityRef: no nameStartTag: invalid element nameParsePI: PI pi never end ...Premature end of data in tag a line 1Premature end of data in tag doc line 1Error : 1: xmlParseEntityRef: no name2: StartTag: invalid element name3: ParsePI: PI pi never end ...4: Premature end of data in tag a line 15: Premature end of data in tag doc line 1
tryCatch(xmlTreeParse("<doc><a> & < <?pi > </doc>"),"XMLParserErrorList"= function(e){cat("Errors in XML document\n", e$message,"\n")})xmlParseEntityRef: no nameStartTag: invalid element nameParsePI: PI pi never end ...Premature end of data in tag a line 1Premature end of data in tag doc line 1Error : in XML document1: xmlParseEntityRef: no name2: StartTag: invalid element name3: ParsePI: PI pi never end ...4: Premature end of data in tag a line 15: Premature end of data in tag doc line 1
try(xmlTreeParse("<doc><a> & < <?pi > </doc>", error = NULL))Error: xmlParseEntityRef: no name
f = system.file("exampleData","book.xml", package ="XML")doc.trim = xmlInternalTreeParse(f, trim = TRUE)doc = xmlInternalTreeParse(f, trim = FALSE)xmlSApply(xmlRoot(doc.trim),class)chapter chapter[1,]"XMLInternalElementNode""XMLInternalElementNode"[2,]"XMLInternalNode" "XMLInternalNode"[3,]"XMLAbstractNode" "XMLAbstractNode"xmlSApply(xmlRoot(doc),class)text chapter[1,]"XMLInternalTextNode""XMLInternalElementNode"[2,]"XMLInternalNode" "XMLInternalNode"[3,]"XMLAbstractNode" "XMLAbstractNode"text chapter[1,]"XMLInternalTextNode""XMLInternalElementNode"[2,]"XMLInternalNode" "XMLInternalNode"[3,]"XMLAbstractNode" "XMLAbstractNode"text[1,]"XMLInternalTextNode"[2,]"XMLInternalNode"[3,]"XMLAbstractNode"
神奇的是,xmlInternalTreeParse函数虽然也在该帮助文档页面,但是一点相关的说明都没有....
f = system.file("exampleData","book.xml", package ="XML")titles = list()xmlTreeParse(f, handlers = list(title = function(x)]]<<- x))$title #此为输出function (x)titles[[length(titles)+1]]<<- xsapply(titles, xmlValue)[1]"XML"[2]"The elements of an XML document"[3]"Parsing XML"[4]"DOM"[5]"SAX"[6]"XSL"[7]"templates"[8]"XPath expressions"[9]"named templates"rm(titles)
附件列表
R自动数据收集第二章HTML笔记1(主要关于handler处理器函数和帮助文档所有示例)的更多相关文章
- R自动数据收集第二章HTML笔记2(主要关于htmlTreeParse函数)
包含以下几个小的知识点 1htmlTreeParse函数源码和一些参数 2hander的写法 3关于missing函数 4关于if-else语句中else语句的花括号问题 5关于checkHandle ...
- R自动数据收集第一章概述——《List of World Heritage in Danger》
导包 library(stringr) library(XML) library(maps) heritage_parsed <- htmlParse("http://en ...
- AS开发实战第二章学习笔记——其他
第二章学习笔记(1.19-1.22)像素Android支持的像素单位主要有px(像素).in(英寸).mm(毫米).pt(磅,1/72英寸).dp(与设备无关的显示单位).dip(就是dp).sp(用 ...
- #Spring实战第二章学习笔记————装配Bean
Spring实战第二章学习笔记----装配Bean 创建应用对象之间协作关系的行为通常称为装配(wiring).这也是依赖注入(DI)的本质. Spring配置的可选方案 当描述bean如何被装配时, ...
- CentOS6安装各种大数据软件 第二章:Linux各个软件启动命令
相关文章链接 CentOS6安装各种大数据软件 第一章:各个软件版本介绍 CentOS6安装各种大数据软件 第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件 第三章:Linux基础 ...
- Machine Learning In Action 第二章学习笔记: kNN算法
本文主要记录<Machine Learning In Action>中第二章的内容.书中以两个具体实例来介绍kNN(k nearest neighbors),分别是: 约会对象预测 手写数 ...
- Day2 《机器学习》第二章学习笔记
这一章应该算是比价了理论的一章,我有些概率论基础,不过起初有些地方还是没看多大懂.其中有些公式的定义和模型误差的推导应该还是很眼熟的,就是之前在概率论课上提过的,不过有些模糊了,当时课上学得比较浅. ...
- Python核心编程第三版第二章学习笔记
第二章 网络编程 1.学习笔记 2.课后习题 答案是按照自己理解和查阅资料来的,不保证正确性.如由错误欢迎指出,谢谢 1. 套接字:A network socket is an endpoint of ...
- Linux第一章第二章学习笔记
第一章 Linux内核简介 1.1 Unix的历史 它是现存操作系统中最强大最优秀的系统. 设计简洁,在发布时提供原代码. 所有东西都被当做文件对待. Unix的内核和其他相关软件是用C语言编写而成的 ...
随机推荐
- kvm/qemu/libvirt学习笔记 (1) qemu/kvm/libvirt介绍及虚拟化环境的安装
kvm简介 kvm最初由Quramnet公司开发,2008年被RedHat公司收购.kvm全称基于内核的虚拟机(Kernel-based Virtual Machine),它是Linux的一个内核模块 ...
- apache httpd服务器403 forbidden的问题
一.问题描述 在apache2的httpd配置中,很多情况都会出现403. 刚安装好httpd服务,当然是不会有403的问题了.主要是修改了一些配置后出现,问题描述如下: 修改了DocumentRoo ...
- 理解Docker(3):Docker 使用 Linux namespace 隔离容器的运行环境
本系列文章将介绍Docker的有关知识: (1)Docker 安装及基本用法 (2)Docker 镜像 (3)Docker 容器的隔离性 - 使用 Linux namespace 隔离容器的运行环境 ...
- Java使用MyEclipse构建webService简单案例
什么是WebServices? 它是一种构建应用程序的普遍模型,可以在任何支持网络通信的操作系统中实施运行;它是一种新的web应用程序分支,是自包含.自描述.模块化的应用,可以发布.定位.通过web ...
- php调接口
浏览器直接访问接口时会弹出账号密码框 当用程序调用时需要加入 curl_setopt($ch, CURLOPT_USERPWD, "$username:$password") ...
- [LeetCode] Trapping Rain Water 收集雨水
Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...
- [LeetCode] Roman to Integer 罗马数字转化成整数
Given a roman numeral, convert it to an integer. Input is guaranteed to be within the range from 1 t ...
- GDB调试汇编堆栈过程分析
GDB调试汇编堆栈过程分析 分析过程 这是我的C源文件:click here 使用gcc - g example.c -o example -m32指令在64位的机器上产生32位汇编,然后使用gdb ...
- ImportError: cannot import name '_imagingtk'
问题描述 使用tkinter画pillow生成的图片时,在tkinter中抛出此异常. 解决方案 pip install -I --no-cache-dir Pillow 更新pillow 重启解决一 ...
- java-读取javabean中所有属性和属性的类型
/** * java读取文件中的属性类型 * @param model * @return * @throws Exception */ public static Map<String,Str ...