R自动数据收集第二章HTML笔记2(主要关于htmlTreeParse函数)

包含以下几个小的知识点

1htmlTreeParse函数源码和一些参数

2hander的写法

3关于missing函数

4关于if-else语句中else语句的花括号问题

5关于checkHandlerNames函数

6关于GeneralHandlerNames属性

7关于match函数

8关于inherits函数

9关于on.exit函数

===============================================================================

还是尝试去阅读下该函数的源码：

htmlTreeParse函数

function (file, ignoreBlanks = TRUE, handlers = NULL, replaceEntities = FALSE,
asText = FALSE, trim = TRUE, validate = FALSE, getDTD = TRUE,
isURL = FALSE, asTree = FALSE, addAttributeNamespaces = FALSE,
useInternalNodes = FALSE, isSchema = FALSE, fullNamespaceInfo = FALSE,
encoding = character(), useDotNames = length(grep("^\\.",
names(handlers)))>0, xinclude = TRUE, addFinalizer = TRUE,
error = htmlErrorHandler, isHTML = TRUE, options = integer(),
parentFirst = FALSE)
{
#asText=T则file作为XML文本处理
isMissingAsText = missing(asText)
#地址大于一个时要中止程序并抛出异常
if(length(file)>1){
file = paste(file, collapse ="\n")
if(!missing(asText)&&!asText)
stop(structure(list(message ="multiple URLs passed to xmlTreeParse. If this is the content of the file, specify asText = TRUE"),
class= c("MultipleURLError","XMLParserError",
"simpleError","error","condition")))
asText = TRUE
}
#当isURL非空时且是XML的时候，才修改URL为数值型
#比如这里isURL=1
if(missing(isURL)&&!asText)
isURL <- length(grep("^(http|ftp|file)://", file, useBytes = TRUE,
perl = TRUE))
#isHTML 默认为 TRUE
if(isHTML){
validate = FALSE
getDTD = FALSE
isSchema = FALSE
docClass ="HTMLInternalDocument"
}
else docClass = character()
#checkHandlerNames返回的是一个逻辑值，其作用是
checkHandlerNames(handlers,"DOM")
if(missing(fullNamespaceInfo)&& inherits(handlers,"RequiresNamespaceInfo"))
fullNamespaceInfo = TRUE
oldValidate = xmlValidity()
xmlValidity(validate)
on.exit(xmlValidity(oldValidate))
if(!asText && isURL == FALSE){
if(file.exists(file)== FALSE)
if(!missing(asText)&& asText == FALSE){
e = simpleError(paste("File", file,"does not exist"))
class(e)= c("FileNotFound",class(e))
stop(e)
}
else asText <- TRUE
}
if(asText && length(file)>1)
file = paste(file, collapse ="\n")
old = setEntitySubstitution(replaceEntities)
on.exit(setEntitySubstitution(old), add = TRUE)
if(asText && length(grep(sprintf("^%s?\\s*<",BOMRegExp),
file, perl = TRUE, useBytes = TRUE))==0){
if(!isHTML ||(isMissingAsText &&!inherits(file,"AsIs"))){
e = simpleError(paste("XML content does not seem to be XML:",
sQuote(file)))
class(e)= c("XMLInputError",class(e))
(if(isHTML)
warning
else stop)(e)
}
}
if(!is.logical(xinclude)){
xinclude =as.logical(xinclude)
}
if(!asText &&!isURL)
file = path.expand(as.character(file))
if(useInternalNodes && trim){
prevBlanks =.Call("RS_XML_setKeepBlanksDefault",0L,
PACKAGE ="XML")
on.exit(.Call("RS_XML_setKeepBlanksDefault", prevBlanks,
PACKAGE ="XML"), add = TRUE)
}
.oldErrorHandler = setXMLErrorHandler(error)
on.exit(.Call("RS_XML_setStructuredErrorHandler",.oldErrorHandler,
PACKAGE ="XML"), add = TRUE)
if(length(options))
options = sum(options)
ans <-.Call("RS_XML_ParseTree",as.character(file), handlers,
as.logical(ignoreBlanks),as.logical(replaceEntities),
as.logical(asText),as.logical(trim),as.logical(validate),
as.logical(getDTD),as.logical(isURL),as.logical(addAttributeNamespaces),
as.logical(useInternalNodes),as.logical(isHTML),as.logical(isSchema),
as.logical(fullNamespaceInfo),as.character(encoding),
as.logical(useDotNames), xinclude, error, addFinalizer,
as.integer(options),as.logical(parentFirst), PACKAGE ="XML")
if(!missing(handlers)&& length(handlers)&&!as.logical(asTree))
return(handlers)
if(!isSchema && length(class(ans)))
class(ans)= c(docClass, oldClass(class(ans)))
if(inherits(ans,"XMLInternalDocument"))
addDocFinalizer(ans, addFinalizer)
elseif(!getDTD &&!isSchema){
class(ans)= oldClass("XMLDocumentContent")
}
ans
}
<environment: namespace:XML>

查看整个htmlTreeParse函数的源代码，可以用到handler的地方并不多。

我们看一下xmlTreeParse {XML}中对于该参数的描述：

参数handlers：

Optional collection of functions(函数集) used to map the different XML nodes to R objects. Typically, this is a named list of functions, and a closure(闭包) can be used to provide local data. This provides a way of filtering the tree as it is being created in R, adding or removing nodes, and generally processing them as they are constructed in the C code.

In a recent addition to the package (version 0.99-8), if this is specified as a single function object, we call that function for each node (of any type) in the underlying DOM tree. It is invoked with（调用） the new node（新节点） and its parent node（父节点）. This applies to regular nodes（普通节点） and also comments（注释）, processing instructions, CDATA nodes, etc. So this function must be sufficiently general to handle them all.

asTree参数

this only applies when on passes a value for the handlers argument and is used then to determine whether the DOM tree should be returned or the handlers object.

该参数在有传入处理器函数的时候要设置为TRUE，此时，函数的返回值是处理后的DOM树，而不是handlers函数对象本身

useDotNames参数

a logical value indicating whether to use the newer format for identifying general element function handlers with the '.' prefix, e.g. .text, .comment, .startElement. If this is FALSE, then the older format text, comment, startElement, ... are used. This causes problems when there are indeed nodes named text or comment or startElement as a node-specific handler are confused with the corresponding general handler of the same name(如果函数名叫startElement，而有个节点(自定义标签吧)，也交startElement就会让函数困惑滴). Using TRUE means that your list of handlers should have names that use the '.' prefix for these general element handlers. This is the preferred way to write new code.

TRUE是使用新的格式(函数以.开头)来区分

FLALSE是用旧的格式

实际上配合参数里的这一句很容易理解：

useDotNames = length(grep("^\\.", names(handlers))) > 0

取得处理函数的名字，看看其是否以点开头，如果长度不大于0，就是没有，就是FALSE咯

isURL参数

indicates whether the file argument refers to a URL (accessible via ftp or http) or a regular file on the system. If asText is TRUE, this should not be specified. The function attempts to determine whether the data source is a URL by using grep to look for http or ftp at the start of the string. The libxml parser handles the connection to servers, not the R facilities (e.g. scan).

hander的写法——在文档的Details中：

The handlers argument is used similarly to those specified in xmlEventParse. When an XML tag (element) is processed, we look for a function in this collection with the same name as the tag's name. If this is not found, we look for one named startElement. If this is not found, we use the default built in converter(变换器). The same works for comments, entity references, cdata, processing instructions, etc. The default entries should be named comment, startElement, externalEntity, processingInstruction, text, cdata and namespace. All but the last should take the XMLnode as their first argument. In the future, other information may be passed via ..., for example, the depth in the tree, etc. Specifically, the second argument will be the parent node into which they are being added, but this is not currently implemented, so should have a default value (NULL).

当一个标签被处理时，在函数集里1先找和标签同名的函数，2找startElement，最后才找默认的函数

当一个注释等被处理时，也是一样。那么也就是说，比如处理注释的时候，先找叫comment的函数呗，

所以，handler中的函数命名(列表的组件名)是需要讲规律的

node必须作为这些函数的第一个参数。

嗯，这一段讲的很好~，把handler函数怎么编写讲清楚了。

这个博客，也提到了这些参数的含义，有空可以看看

http://blog.163.com/zhoulili1987619@126/blog/static/35308201201531511273389/

=======================================================================

#test
#导包
library(XML)
#链接
url <-"http://www.r-datacollection.com/materials/html/fortunes.html"
#handers函数
h2 <- list(
startElement = function(node,...){
name <- xmlName(node)
if(name %in% c("div","title")){NULL}else{node}
},
comment = function(node){NULL}
)
#正式开始，函数参数
file=url
ignoreBlanks = TRUE
handlers = h2
replaceEntities = FALSE
asText = FALSE
trim = TRUE
validate = FALSE
getDTD = TRUE
isURL = FALSE
asTree = TRUE
addAttributeNamespaces = FALSE
useInternalNodes = FALSE
isSchema = FALSE
fullNamespaceInfo = FALSE
encoding = character()
useDotNames = length(grep("^\\.", names(handlers)))>0
xinclude = TRUE
addFinalizer = TRUE
error = XML:::htmlErrorHandler
isHTML = TRUE
options = integer()
parentFirst = FALSE
#函数体部分
#asText参数没有传入
#我们将其从missing(asText)改为TRUE
isMissingAsText = TRUE
#file的长度是否大于1，大于1如果asText未传入则要报错：传入了多个URL
#不大于1，则跳过
if(length(file)>1){
file = paste(file, collapse ="\n")
if(!missing(asText)&&!asText)
stop(structure(list(message ="multiple URLs passed to xmlTreeParse. If this is the content of the file, specify asText = TRUE"),
class= c("MultipleURLError","XMLParserError",
"simpleError","error","condition")))
asText = TRUE
}
# 本来是if (missing(isURL) && !asText)
# 我们修改为if (TURE && !asText)
#isURL参数没有传递且asText参数为假才执行
if(TRUE &&!asText)
isURL <- length(grep("^(http|ftp|file)://", file, useBytes = TRUE,
perl = TRUE))
#只要有http|ftp|file中的一个协议开头，比如http://，就是URL了。
# 此时isURL=1
#是否为HTML，是的
if(isHTML){
validate = FALSE
getDTD = FALSE #从默认值T改为了F
isSchema = FALSE
docClass ="HTMLInternalDocument"
}else{docClass = character()}#否则创建空的
# class(docClass)
#返回TRUE通过检验了，否则函数会中止并抛出异常
XML:::checkHandlerNames(handlers,"DOM")
#fullNamespaceInfo参数whether to provide the namespace URI and prefix on each node
#其实就是是否在节点面前带上URI信息
#fullNamespaceInfo为空且handlers含有该属性，才执行这一步
#missing(fullNamespaceInfo)被替换为TRUE
#handlers并没有RequiresNamespaceInfo属性,所以不执行
if( TRUE && inherits(handlers,"RequiresNamespaceInfo"))
fullNamespaceInfo = TRUE
#以下两行的结果都是integer(0)
#奇妙~，先保存原有的配置
oldValidate = XML:::xmlValidity()
#当前的配置
XML:::xmlValidity(validate)
#还原原来的配置
on.exit(XML:::xmlValidity(oldValidate))
#asText为假，且isURL是假的
#结果为TFALSE，我们不必管他
if(!asText && isURL == FALSE){
if(file.exists(file)== FALSE)
#如果本地文件不存在
if(!missing(asText)&& asText == FALSE){
#抛出异常，文件不存在
e = simpleError(paste("File", file,"does not exist"))
class(e)= c("FileNotFound",class(e))
stop(e)
}
else asText <- TRUE
}
#此时asText是FALSE，这个跟我们无关
if(asText && length(file)>1)
file = paste(file, collapse ="\n")
#replaceEntities的默认值是FALSE
old = XML:::setEntitySubstitution(replaceEntities)
#old的值是FALSE
on.exit(XML:::setEntitySubstitution(old), add = TRUE)
#BOMRegExp是一个内置的常量吧
#看名字应该是基于BOM的正则表达式
#因为是FALSE，所以我们也先不去管它
if(asText && length(grep(
sprintf("^%s?\\s*<", XML:::BOMRegExp),
file, perl = TRUE, useBytes = TRUE
)
)==0)
{
if(!isHTML ||(isMissingAsText &&!inherits(file,"AsIs"))){
e = simpleError(paste("XML content does not seem to be XML:",
sQuote(file)))
class(e)= c("XMLInputError",class(e))
(if(isHTML)
warning
else stop)(e)
}
}
#xinclude默认值是TRUE
#以下三个if都是F，所以不管了
if(!is.logical(xinclude)){
xinclude =as.logical(xinclude)
}
if(!asText &&!isURL)
file = path.expand(as.character(file))
if(useInternalNodes && trim){
prevBlanks =.Call("RS_XML_setKeepBlanksDefault",0L,
PACKAGE ="XML")
on.exit(.Call("RS_XML_setKeepBlanksDefault", prevBlanks,
PACKAGE ="XML"), add = TRUE)
}
.oldErrorHandler = XML:::setXMLErrorHandler(error)
#所以，这种点开头的命名是什么鬼？
# class(.oldErrorHandler)
# [1] "list"
on.exit(.Call("RS_XML_setStructuredErrorHandler",.oldErrorHandler,
PACKAGE ="XML"), add = TRUE)
# length(options)是0，所以不执行
if(length(options))
options = sum(options)
#调用一个叫做RS_XML_ParseTree的函数
getAnywhere("RS_XML_ParseTree")
ans <-.Call("RS_XML_ParseTree",as.character(file), handlers,
as.logical(ignoreBlanks),as.logical(replaceEntities),
as.logical(asText),as.logical(trim),as.logical(validate),
as.logical(getDTD),as.logical(isURL),as.logical(addAttributeNamespaces),
as.logical(useInternalNodes),as.logical(isHTML),as.logical(isSchema),
as.logical(fullNamespaceInfo),as.character(encoding),
as.logical(useDotNames), xinclude, error, addFinalizer,
as.integer(options),as.logical(parentFirst), PACKAGE ="XML")
print(ans)
print("-------我是可爱的分割线------------------")
#这里的missing(handlers)我们就不改了哈
#毕竟只要有默认值，他都觉得是TRUE
#和我们的确传递了处理函数的效果TRUE是一样的
if(!missing(handlers)&& length(handlers)&&!as.logical(asTree))
return("呵呵")
if(!isSchema && length(class(ans)))
class(ans)= c(docClass, oldClass(class(ans)))
if(inherits(ans,"XMLInternalDocument")){
addDocFinalizer(ans, addFinalizer)
}elseif(!getDTD &&!isSchema){
print("看我的类型")
print(class(ans))
class(ans)= oldClass("XMLDocumentContent")
print(class(ans))
}
print("看条件判断的逻辑值")
print("1")
print(!missing(handlers)&& length(handlers)&&!as.logical(asTree))
print("2")
print(!isSchema && length(class(ans)))
print("3")
print(inherits(ans,"XMLInternalDocument"))
print("4")
print(!getDTD &&!isSchema)
print(class(ans))
ans

用来调试理解该函数的代码：

注释1：关于missing函数

missing(asText)的含义并不是判断asText是否缺失，而是判断asText作为函数的形参，是否已经在函数被调用的时候传入了实参，所以即便形参asText是有默认值的，missing(asText)的返回值结果仍然是TRUE

#例子1
testMissing<-function(a=TRUE,b=FALSE){
if(missing(b))
return("b is missing")
else"b is here "+b
}
testMissing(F)
# [1] "b is missing"
#例子2
if(missing(b))
return("b is missing")
# Error in missing(b) : 'missing' can only be used for arguments
#例子3
b=NULL
if(missing(b))
return("b is missing")
#没有输出

注释2 关于if-else语句中else语句的花括号问题

例子1

if(isHTML){
validate = FALSE
getDTD = FALSE #从默认值T改为了F
isSchema = FALSE
docClass ="HTMLInternalDocument"
}else{docClass = character()}#否则创建空的

或者

}else docClass = character()

前面的花括号很重要，在交互式模式，R语法分析器用else前面的右花括号来推断这是一个if-else结构，而不是if结构

见《R语言编程艺术中文版》P20

例子2

function(){
if(isHTML){
validate = FALSE
getDTD = FALSE #从默认值T改为了F
isSchema = FALSE
docClass ="HTMLInternalDocument"
}
else docClass = character()#否则创建空的
}

如果在函数体内，这样写是OK的。
否则，会报错：

Error: unexpected 'else' in " else"

注释3.1：checkHandlerNames(handlers, "DOM")函数

事实上，必须要借助getAnywhere(checkHandlerNames)的帮助，才能获得其源码

我们可以在最后一行可以看到：<environment: namespace:XML>

所以我们可以通过XML:::checkHandlerNames()调用它

function (handlers, id ="SAX")
{
if(is.null(handlers)) #为空，则返回TRUE
return(TRUE)
ids = names(handlers) #取出handlers中的函数名
i = match(ids,GeneralHandlerNames) #匹配，返回逻辑值向量
prob = any(!is.na(i))#任一个回空才是TRUE
if(prob){
warning("future versions of the XML package will require names of general handler functions
to be prefixed by a . to distinguish them from handlers for nodes with those names. This _may_ affect the ",
paste(names(handlers)[!is.na(i)], collapse =", "))
}
#任意一个handler中的函数不是函数类型，则抛出异常
if(any(w <-!sapply(handlers,is.function)))
warning("some handlers are not functions: ", paste(names(handlers[w]),
collapse =", "))
#返回TRUE，后续代码继续运行
!prob
}
<environment: namespace:XML>

注释3.2关于GeneralHandlerNames属性

还真是人如其名呀，常用处理函数名

> XML:::GeneralHandlerNames
$SAX
[1]"text" "startElement"
[3]"endElement" "comment"
[5]"startDocument" "endDocument"
[7]"processingInstruction""entityDeclaration"
[9]"externalEntity"
$DOM
[1]"text" "startElement"
[3]"comment" "entity"
[5]"cdata" "processingInstruction"

分别有SAX和DOM两种方式，想起学Java时的DOM和SAX解析没？

注释3.3关于match函数

> h2 <- list(
+ startElement = function(node,...){
+ name <- xmlName(node)
+ if(name %in% c("div","title")){NULL}else{node}
+ },
+ comment = function(node){NULL}
+)
> handlers<-h2
> ids = names(handlers)
> ids
[1]"startElement""comment"
> i = match(ids, XML:::GeneralHandlerNames)
> i
[1] NA NA
>?match
starting httpd help server ... done
>!is.na(i)#如果i中有NA，则返回F，没有NA则返回T
[1] FALSE FALSE
> prob = any(!is.na(i))#当i中的任意一个元素都不是NA的时候，prob才返回T
> prob
[1] FALSE

match returns a vector of the positions of (first) matches of its first argument in its second.

math函数实际上返回的是：第一个参数中的元素如果匹配了第二个参数中的值，那么到底匹配的是第二个元素中的第几个，即其位置。

> testMatch<-c("a","c")
> testSet <- c('a','b','c')
> match(testMatch,testSet)
[1]13

事实上，看警告信息，未来版本的XML将要求普通的处理函数前面要加一个前缀点(.)，以区别于节点他们名字的处理函数(到底是哪些呢？)

future versions of the XML package will require names of general handler functions

to be prefixed by a . to distinguish them from handlers for nodes with those names. This _may_ affect the

所以我觉得，这一个match和if判断，其实是一个预设吧，毕竟id='SAX'压根没有用到，应该是作者的编码还没打算在当前版本增加该设定。

其实在useDotNames中已经提到了

注释4关于inherits函数

inherits indicates whether its first argument inherits from any of the classes specified in the what argument. If which is TRUE then an integer vector of the same length as what is returned. Each element indicates the position in the class(x) matched by the element of what; zero indicates no match. If which is FALSE then TRUE is returned by inherits if any of the names in what match with any class.

what, value

a character vector naming classes. value can also be NULL.

inherits函数指明第一个参数是否继承了what参数中的任何一个classs类型

如果which为真，则inherits 返回一个同what一样长度的整型向量，每一个元素表示该位置的元素匹配了what的类型，如果数字是0，则表示不匹配

如果which为假，则当what中任意一个名字匹配了类型时，inherits 返回真

x <-10
class(x)# "numeric"
oldClass(x)# NULL
#看了下文档，我个人觉得oldClass是S语言的余毒啊！！！
inherits(x,"a")#FALSE
class(x)<- c("a","b")
# x
# [1] 10
# attr(,"class")
# [1] "a" "b"
#即，x为10这个变量被赋予了两个class，分别名为"a"和 "b"
inherits(x,"a")#TRUE
inherits(x,"a", TRUE)# 1
inherits(x, c("a","b","c"), TRUE)# 1 2 0

注释5.1：几行难懂的代码

#以下两行的结果都是integer(0)
#奇妙~，先保存原有的配置
oldValidate = XML:::xmlValidity()
#使用当前的配置
XML:::xmlValidity(validate)
#还原
on.exit(XML:::xmlValidity(oldValidate))

先看下xmlValidity函数

> XML:::xmlValidity()
integer(0)
> getAnywhere(xmlValidity)
A single object matching ‘xmlValidity’ was found
It was found in the following places
namespace:XML
with value
function (val = integer(0))
{
.Call("RS_XML_getDefaultValiditySetting",as.integer(val),
PACKAGE ="XML")
}
<environment: namespace:XML>

无奈没有更多的信息

RS_XML_getDefaultValiditySetting，看命名应该是获取默认正确的设定

我们再来看：

注释5.2.1：关于on.exit函数

on.exit records the expression given as its argument as needing to be executed when the current function exits (either naturally or as the result of an error). This is useful for resetting graphical parameters or performing other cleanup actions.

on.exit记录作为其参数的表达式，在当前的函数退出(自然结束或者出错)时执行，他通常用来充值图形参数，或者执行清理行为。

例子1

> opar <- par(bg='lightblue')
> on.exit(par(opar))
> plot(c(1,2,3),c(4,5,6))#蓝色背景
> plot(c(1,2,3,4,5),runif(5))#蓝色背景
#此时关闭绘图窗口
> plot(c(1,2,3,4,5),rnorm(5))
#白色背景

例子2

plot_with_big_margins <- function(...)
{
old_pars <- par(mar = c(10,9,9,7))
on.exit(par(old_pars))
plot(...)
}
plot_with_big_margins(with(cars, speed, dist))
#不关闭图像窗口，此时再运行如下语句
plot(c(1,2,3),c(4,5,6))
对比
plot_with_big_margins <- function(...)
{
par(mar = c(10,9,9,7))
plot(...)
}
plot_with_big_margins(with(cars, speed, dist))
#不关闭图像窗口，此时再运行：
plot(c(1,2,3),c(4,5,6))

并在其所在函数退出后，就失效（注意，par()函数的设定是对当前窗口的所有图标有效，只要窗口没有清理和关闭）。

他所谓的在函数结束时候执行，是说这个on.exit()函数被放在哪个函数里，哪个函数结束的时候，它才执行。

但是代码有点奇怪对不对？

old_pars <- par(mar = c(10,9,9,7))
on.exit(par(old_pars))

不是要还原吗？为什么是把old_pars传递给par()?

事实上：

> old_pars <- par(mar = c(10,9,9,7))
> old_pars
$mar
[1]5.14.14.12.1
> op <- options(stringsAsFactors = FALSE)
> op
$stringsAsFactors
[1] TRUE

我们得到的是尚未改变之前的参数，而不是改变后的参数

我们可以验证这个结论，重启R。

par() #得到的mar是5.14.14.12.1
plot_with_big_margins <- function(...)
{
old_pars <- par(mar = c(10,9,9,7)) #原参数被保存，新参数设置生效
print(par()) ,9,9,7


 
  on.exit(par(old_pars))                 #参数被还原为原来的参数
  plot(...)
}
plot_with_big_margins(with(cars, speed, dist))
 
par()             #得到的mar是5.14.14.12.1


看到了吧，这就是他重置原参数的功能


例子3



my_plot <- function()
{
  with(cars, plot(speed, dist))
}
save_base_plot <- function(plot_fn, file)
{
  png(file)
  on.exit(dev.off())
  plot_fn()
}
save_base_plot(my_plot,"testcars.png")



下面这篇sof的答案回答的很详细:
http://stackoverflow.com/questions/28300713/how-and-when-should-i-use-on-exit
 
注释5.2.2：关于on.exit函数
 那么
 
 


  #replaceEntities的默认值是FALSE
  old = XML:::setEntitySubstitution(replaceEntities)
  #old的值是FALSE
  on.exit(XML:::setEntitySubstitution(old), add = TRUE)

这几行代码的功能是类似的
 
 


> XML:::setEntitySubstitution
function (val) 
.Call("RS_XML_SubstituteEntitiesDefault",as.logical(val), PACKAGE ="XML")
<environment: namespace:XML>

 
注释5.2.3：关于on.exit函数



.oldErrorHandler = XML:::setXMLErrorHandler(error)
  #所以，这种点开头的命名是什么鬼？
  # class(.oldErrorHandler)
  # [1] "list"
  on.exit(.Call("RS_XML_setStructuredErrorHandler",.oldErrorHandler, 
                PACKAGE ="XML"), add = TRUE)
 
 
> XML:::htmlErrorHandler
function (msg, code, domain, line, col, level, filename,class="XMLError") 
{
    e = makeXMLError(msg, code, domain, line, col, level, filename, 
        class)
    dom = names(e$domain)
    class(e)= c(names(e$code), sprintf("%s_Error", gsub("_FROM_", 
        "_", dom)),class(e))
    if(e$code == xmlParserErrors["XML_IO_LOAD_ERROR"]) 
        stop(e)
}
<environment: namespace:XML>
 
 
> XML:::setXMLErrorHandler
function (fun) 
{
    prev =.Call("RS_XML_getStructuredErrorHandler", PACKAGE ="XML")
    sym = getNativeSymbolInfo("R_xmlStructuredErrorHandler", 
        "XML")$address
    .Call("RS_XML_setStructuredErrorHandler", list(fun, sym), 
        PACKAGE ="XML")
    prev
}
<environment: namespace:XML>

其实是错误处理的方式的配置

 

注释4：关于判断条件
其实BOMRegExp是一个常量
 


> getAnywhere(BOMRegExp)
A single object matching ‘BOMRegExp’ was found
It was found in the following places
  namespace:XML
with value
[1]"(\\xEF\\xBB\\xBF|\\xFE\\xFF|\\xFF\\xFE)"



 sprintf("^%s?\\s*<", BOMRegExp)是C语言中的打印语句，被包装后用到R里面了，%s代表BOMRegExp的内容。



 


ans <- .Call("RS_XML_ParseTree", as.character(file), handlers, 
               as.logical(ignoreBlanks), as.logical(replaceEntities), 
               as.logical(asText), as.logical(trim), as.logical(validate), 
               as.logical(getDTD), as.logical(isURL), as.logical(addAttributeNamespaces), 
               as.logical(useInternalNodes), as.logical(isHTML), as.logical(isSchema), 
               as.logical(fullNamespaceInfo), as.character(encoding), 
               as.logical(useDotNames), xinclude, error, addFinalizer, 
               as.integer(options), as.logical(parentFirst), PACKAGE = "XML")

 
这里再插入下怎么看.Call的源码的那个sof链接
http://stackoverflow.com/questions/19226816/how-can-i-view-the-source-code-for-a-function
主要是community wiki的答案
还有Rstudio其实选中函数按下F2可以查看源码的
这篇问答里，提供了R中各种类型的函数的源码查看方式
 《R news article》P43
我也下载了：

 
我也使用照做了，下载了源码看：
 


untar(download.packages(pkgs ="XML",
                        destdir =".",
                        type ="source")[,2])

然后用everything搜索了下，发现在我的文档下，找到src：

在文件内部检索，也只查到了

 但是再也查不到其他的信息了，ENTYR应该入口吧？小弟的C只是本科时被老师误人子弟的...........（如有读者朋友知道，请评论），21是限定的参数的个数。

所以没有看到最终C的代码，有点遗憾，其实我关于htmlTreeParse函数源码的查看的核心任务没有完成呐
以后不看这么细致了...........抓住主要结构，毕竟C不给看，不给还是学到了一些琐碎的知识点。


null

附件列表
												

											R自动数据收集第二章HTML笔记2(主要关于htmlTreeParse函数)的更多相关文章	

								R自动数据收集第二章HTML笔记1(主要关于handler处理器函数和帮助文档所有示例)
		本文知识点:     1潜在畸形页面使用htmlTreeParse函数 2startElement的用法 3闭包 4handler函数的命令和函数体主要写法 5节点的丢弃,取出,取出标签名称.属性.属 ...
		
						R自动数据收集第一章概述——《List of World Heritage in Danger》
		  导包     library(stringr) library(XML) library(maps) heritage_parsed <- htmlParse("http://en ...
		
						AS开发实战第二章学习笔记——其他
		第二章学习笔记(1.19-1.22)像素Android支持的像素单位主要有px(像素).in(英寸).mm(毫米).pt(磅,1/72英寸).dp(与设备无关的显示单位).dip(就是dp).sp(用 ...
		
						#Spring实战第二章学习笔记————装配Bean
		Spring实战第二章学习笔记----装配Bean 创建应用对象之间协作关系的行为通常称为装配(wiring).这也是依赖注入(DI)的本质. Spring配置的可选方案 当描述bean如何被装配时, ...
		
						CentOS6安装各种大数据软件 第二章：Linux各个软件启动命令
		相关文章链接 CentOS6安装各种大数据软件 第一章:各个软件版本介绍 CentOS6安装各种大数据软件 第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件 第三章:Linux基础 ...
		
						Machine Learning In Action 第二章学习笔记: kNN算法
		本文主要记录<Machine Learning In Action>中第二章的内容.书中以两个具体实例来介绍kNN(k nearest neighbors),分别是: 约会对象预测 手写数 ...
		
						Day2 《机器学习》第二章学习笔记
		这一章应该算是比价了理论的一章,我有些概率论基础,不过起初有些地方还是没看多大懂.其中有些公式的定义和模型误差的推导应该还是很眼熟的,就是之前在概率论课上提过的,不过有些模糊了,当时课上学得比较浅.  ...
		
						Python核心编程第三版第二章学习笔记
		第二章 网络编程 1.学习笔记 2.课后习题 答案是按照自己理解和查阅资料来的,不保证正确性.如由错误欢迎指出,谢谢 1. 套接字:A network socket is an endpoint of ...
		
						Linux第一章第二章学习笔记
		第一章 Linux内核简介 1.1 Unix的历史 它是现存操作系统中最强大最优秀的系统. 设计简洁,在发布时提供原代码. 所有东西都被当做文件对待. Unix的内核和其他相关软件是用C语言编写而成的 ...
		
		
	

随机推荐	

									Hive安装（二）之表不见了
			重启一下电脑,发现表不见了,原来我用的derby存储hive的meta,网上找了一下资料,说是要用mysql, 于是安装mysql   sudo apt-get install mysql-serve ...
			
						让我们用心感受泛型接口的协变和抗变out和in
			关键字out和in相信大家都不陌生,系统定义的很多泛型类型大家F12都或多或少看见了.但是实际中又很少会用到,以前在红皮书里看到,两三页就介绍完了.有的概念感觉直接搬出来的,只是说这样写会怎样,并没有 ...
			
						MyBatis源码分析-IDEA新建MyBatis源码工程
			MyBatis 是支持定制化 SQL.存储过程以及高级映射的优秀的持久层框架.MyBatis 避免了几乎所有的 JDBC 代码和手动设置参数以及获取结果集.MyBatis 可以对配置和原生Map使用简 ...
			
						Yahoo14条军规-前端性能优化
			1.尽可能减少HTTP请求数 什么是http请求? 2.使用CDN(内容分发网络) 什么是CDN? 3.添加Expire/Cache-Control头 Expire Cache-Control 4.启 ...
			
						pcl曲面重建模块-poisson重建算法示例
			poisson曲面重建算法 pcl-1.8测试通过 #include <iostream> #include <pcl/common/common.h> #include &l ...
			
						linux查看主板型号及内存硬件信息
			  公司服务器内存不够用了. 想看看买啥型号的. 购买内存条注意点: ddr3 or4 频率 块钱. 内存槽及内存条: dmidecode |grep -A16 "Memory Device ...
			
						[LeetCode] Super Pow 超级次方
			Your task is to calculate ab mod 1337 where a is a positive integer and b is an extremely large posi ...
			
						[LeetCode] Binary Tree Level Order Traversal 二叉树层序遍历
			Given a binary tree, return the level order traversal of its nodes' values. (ie, from left to right, ...
			
						iOS10推送通知适配
			iOS10推送新增了UserNotifications Framework,使用起来其实很简单. 只是在iOS10以上系统上点击通知栏,回调方法不再走原来的这两个方法 - (void)applicat ...
			
						ReactJS尝鲜：实现tab页切换和菜单栏切换和手风琴切换效果，进度条效果
			前沿 对于React, 去年就有耳闻, 挺不想学的, 前端那么多东西, 学了一个框架又有新框架要学

R自动数据收集第二章HTML笔记2(主要关于htmlTreeParse函数)

附件列表

R自动数据收集第二章HTML笔记2(主要关于htmlTreeParse函数)的更多相关文章

随机推荐

热门专题