使用htmldom分析HTML代码

使用语言是Python 3.5。开发环境是Windows。

在使用HTMLParser库的时候，发现它不能正确的解析多重div元素嵌套的情况，因为这些div元素中又包含了a元素等其它元素。

这似乎是一个长期以来都没解决的BUG：

https://sourceforge.net/p/nekohtml/bugs/98/
http://jericho.htmlparser.net//docs/javadoc/net/htmlparser/jericho/StartTag.html

于是我寻找一个新库，希望它能像javascript那样，能从html代码构建一个dom对象模型。但没找到完美的，只找到这个库：

http://thehtmldom.sourceforge.net/

我的源代码下载的方法是这样的：

1. 首先在火狐中打开URL；

2. Ctrl+Shift+C打开“DOM和样式查看器”；

3. 选中顶部html元素，右键选择复制outerHTML（HTML外面O）；

4. 找个文本编辑器粘贴，保存为utf-8编码。

然后直接上我的代码，记忆力不好，以后便于我直接复制：

#!/usr/bin/python

# -*- coding: <encoding name> -*- 

from htmldom import htmldom

url = 'file:///C:/Users/Microsoft/Desktop/analyse/ict_in_companies.html'

dom = htmldom.HtmlDom(url).createDom()

def returndict(estartup):

    if not isinstance(estartup, htmldom.HtmlNodeList):

        return None

    _returndict = {'market': '', 'name': '', 'link': '', 'pitch': '', 'raised': '', 'signal': '', 'joined': '', 'employee': '', 'stage': '', 'location': ''}

    try:

        etext = estartup.children('div[class~=company]').first().children().first().children('div[class=text]').first()

        ename = etext.children('div[class=name]').first()

        elink = ename.children('a[class=startup-link]').first()

        epitch = etext.children('div[class=pitch]').first()

    except BaseException as err:

        print('**** ERROR There is something wrong! ****')

        print(err)

    else:

        _returndict['name'] = ename.text().rstrip()

        _returndict['pitch'] = epitch.text().rstrip()

        _returndict['link'] = elink.attr('href').rstrip()

    try:

        ejoined = estartup.children('div[class~=joined]').first().children('div[class=value]').first()

        elocation = estartup.children('div[class~=location]').first().children('div[class=value]').first()

        emarket = estartup.children('div[class~=market]').first().children('div[class=value]').first()

        eemployee = estartup.children('div[class~=company_size]').first().children('div[class=value]').first()

        estage = estartup.children('div[class~=stage]').first().children('div[class=value]').first()

        eraised = estartup.children('div[class~=raised]').first().children('div[class=value]').first()

        esignal = estartup.children('div[class~=signal]').first().children('div[class=value]').first().children('img').first()

    except BaseException as err:

        print('**** ERROR There is something wrong! ****')

        print(err)

    else:

        _returndict['joined'] = ejoined.text().rstrip()

        _returndict['location'] = elocation.text().rstrip()

        _returndict['market'] = emarket.text().rstrip()

        _returndict['employee'] = "'%s" % eemployee.text().rstrip()

        _returndict['stage'] = estage.text().rstrip()

        _returndict['raised'] = eraised.text().rstrip()

        _returndict['signal'] = esignal.attr('src')[37:38].rstrip()

    return _returndict

def returngbk(original):

    return original.encode('gbk', 'ignore')
    #return original.encode('utf-8')

ecompanies = dom.find('div[class~=frw44]')

lcompanies = []

for ecompany in ecompanies:

    estartup = ecompany.children(selector='div[class~=startup]', all_children=False).first()

    dcompany = returndict(estartup)

    if not dcompany: continue

    print('index: %d' % len(lcompanies))

    lcompanies.append(dcompany)

output = open('C:/Users/Microsoft/Desktop/a.csv', 'wb')

output.write(b'market\tname\tlink\tpitch\traised\tsignal\toined\temployee\tstage\tlocation\n')

for i in range(len(lcompanies)):

    print('Index: %d' % i)

    tcompany = (returngbk(lcompanies[i]['market']),

                returngbk(lcompanies[i]['name']),

                returngbk(lcompanies[i]['link']),

                returngbk(lcompanies[i]['pitch']),

                returngbk(lcompanies[i]['raised']),

                returngbk(lcompanies[i]['signal']),

                returngbk(lcompanies[i]['joined']),

                returngbk(lcompanies[i]['employee']),

                returngbk(lcompanies[i]['stage']),

                returngbk(lcompanies[i]['location']))

    data = b'%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n' % tcompany

    output.write(data)

output.close

htmldom 2.0这个库有个特点，每次通过一定方式（如find方法、children方法、for...in语句等）返回的对象，始终都是HtmlNodeList类型的，这给使用这个库造成了一些歧义。

例如，children方法应该返回的是当前元素p的集合cs，用for...in语句遍历得到的c才是子元素本身，但p（单个元素）、cs（集合）、c（单个元素）都是HtmlNodeList类型的对象，意味着它们的方法属性都是相同的。

如果cs中只有一个c，c.attr(...)可以成功，而cs.attr(...)是不能成功的。如果想引用c，可以使用cs.first()，即c is cs.first()。不要把cs和c搞混了。

还有，虽然手册中children方法能够通过传参指定是否返回递归的子元素，但实际上传参与否，都只能返回下一级的子元素。所以就当作没有这个参数吧。

最后，类似于dom.find('div[class=classname]')这样的用法，classname是不能有空格或者其它特殊字符的，如果有只能使用dom.find('div[class~=assna]')这样的用法。

此次在Windows平台下使用open函数写入文件发现一些问题：网页中的某些字符无法转换成GBK编码字节并写入文件，代码运行后报错终止。

那么，为什么要转换成GBK编码写入文件，是我的代码中这么做了吗？答案是我的代码没这么做，但操作系统默认使用GBK编码保存文本文件，所以有这一转换。

由于html源代码是utf-8格式，当中有一部分unicode字符集的内容无法映射至gbk字符集，因此报错。解决方法是代码里主动转换，使用s.encode('gbk', 'ignore')，s是原字符串对象，ignore是必加参数，表示不能转换的则忽略。type(s.encode('gbk', 'ignore'))可知其类型是bytes类型，即__repr__()是b'...'。既然已经是bytes类型，直接以二进制的方式写入文本，所以open函数使用wb参数。

当然，直接s.encode('utf-8')结合open(path, 'wb')，也是不会报错的，但是在Windows平台下，一般软件显示的字符还是GBK编码，导致UTF-8编码的字符在显示时会有问题，需要手动设置为UTF-8解决这个问题。

附上excel中打开utf-8编码的.csv文件不乱码的方法：

https://jingyan.baidu.com/article/48a4205705c098a925250455.html

发现这个方法更好用，尤其是一些特殊字符，如欧元符号等可以保留，若是转换成GBK则会丢失。

使用htmldom分析HTML代码的更多相关文章

洛谷P1345 [USACO5.4]奶牛的电信Telecowmunication【最小割】分析+题解代码
洛谷P1345 [USACO5.4]奶牛的电信Telecowmunication[最小割]分析+题解代码题目描述农夫约翰的奶牛们喜欢通过电邮保持联系,于是她们建立了一个奶牛电脑网络,以便互相交流. ...
洛谷 P2194 HXY烧情侣【Tarjan缩点】分析+题解代码
洛谷 P2194 HXY烧情侣[Tarjan缩点] 分析+题解代码题目描述: 众所周知,HXY已经加入了FFF团.现在她要开始喜(sang)闻(xin)乐(bing)见(kuang)地烧情侣了.这里 ...
洛谷P2832 行路难分析+题解代码【玄学最短路】
洛谷P2832 行路难分析+题解代码[玄学最短路] 题目背景: 小X来到了山区,领略山林之乐.在他乐以忘忧之时,他突然发现,开学迫在眉睫题目描述: 山区有n座山.山之间有m条羊肠小道,每条连接两座 ...
洛谷P1783 海滩防御分析+题解代码
洛谷P1783 海滩防御分析+题解代码题目描述: WLP同学最近迷上了一款网络联机对战游戏(终于知道为毛JOHNKRAM每天刷洛谷效率那么低了),但是他却为了这个游戏很苦恼,因为他在海边的造船厂和 ...
洛谷P1854 花店橱窗布置分析+题解代码
洛谷P1854 花店橱窗布置分析+题解代码蒟蒻的第一道提高+/省选-,纪念一下. 题目描述: 某花店现有F束花,每一束花的品种都不一样,同时至少有同样数量的花瓶,被按顺序摆成一行,花瓶的位置是固定 ...
《linux内核分析》作业一：分析汇编代码
通过汇编一个简单的C程序,分析汇编代码理解计算机是如何工作的(王海宁) 姓名:王海宁学号:20135103 课程:<Linux内核分析& ...
通过汇编一个简单的C程序，分析汇编代码理解计算机是如何工作的
秦鼎涛 <Linux内核分析>MOOC课程http://mooc.study.163.com/course/USTC-1000029000 实验一通过汇编一个简单的C程序,分析汇编代码 ...
Java NIO原理图文分析及代码实现
Java NIO原理图文分析及代码实现前言: 最近在分析hadoop的RPC(Remote Procedure Call Protocol ,远程过程调用协议,它是一种通过网络从远程计算机程序上请 ...
通过反汇编一个简单的C程序，分析汇编代码理解计算机是如何工作的
实验一:通过反汇编一个简单的C程序,分析汇编代码理解计算机是如何工作的学号:20135114 姓名:王朝宪注: 原创作品转载请注明出处 <Linux内核分析>MOOC课程http: ...

随机推荐

sql 时间日期格式化
sql server2000中使用convert来取得datetime数据类型样式(全) 日期数据格式的处理,两个示例: CONVERT(varchar(16), 时间一, 20) 结果:2007-0 ...
【PHPExcel实例】 php 导出 excel 实例
CREATE TABLE `person` ( `) DEFAULT NULL, `name` ) DEFAULT NULL, `birthday` date DEFAULT NULL ) ENGIN ...
Codeforces 1099 A. Snowball-暴力(Codeforces Round #530 (Div. 2))
A. Snowball time limit per test 1 second memory limit per test 256 megabytes input standard input ou ...
LBP，LBP-TOP的MATLAB公开代码
http://www.cse.oulu.fi/CMV/Downloads http://www.cse.oulu.fi/wsgi/CMV/Downloads/LBPMatlab
HDU 5669 Road（线段树建树）（分层图最短路）
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=5669 [分析]线段树建树+分层图最短路 #include <cstdio> #includ ...
BZOJ 1532 [POI2005]Kos-Dicing（二分+最大流判断）
[题目链接] http://www.lydsy.com/JudgeOnline/problem.php?id=1532 [题目大意] n个人,给出m场比赛,求出胜出的人最少赢的场次. [题解] 我们发 ...
[Luogu1462]通往奥格瑞玛的道路
题目大意: 一个n个点,m条边的图,每个边有一个边权,每个点也有一个点权. 现在要找一条从1到n的路径,保证边权和不超过b的情况下,最大点权尽量小. 问最大点权最小能是多少? 思路: 二分答案,然后D ...
Educational Codeforces Round 9 D. Longest Subsequence dp
D. Longest Subsequence 题目连接: http://www.codeforces.com/contest/632/problem/D Description You are giv ...
《ArcGIS Runtime SDK for Android开发笔记》——（5）、基于Android Studio构建ArcGIS Android开发环境（离线部署）（转）
1.前言在上一篇的内容里我们介绍了基于Android Studio构建ArcGIS Runtime SDK for Android开发环境的基本流程,流程中我们采用的是基于Gradle的构建方式,在 ...
Android进阶笔记：AIDL内部实现详解（一）
AIDL内部实现详解 (一) AIDL的作用是实现跨进程通讯使用方法也非常的简单,他的设计模式是典型的C/S架构.使用AIDL只要在Client端和Server端的项目根目录下面创建一个aidl的文件 ...

使用htmldom分析HTML代码

使用htmldom分析HTML代码的更多相关文章

随机推荐

热门专题