爬虫4 html输出器 html

#coding:utf8

__author__ = 'wang'

class HtmlOutputer(object):

    def __init__(self):

        self.datas = [];

    def collect_data(self, data):

        if data is None:

            return

        print data

        self.datas.append(data)

    def output_html(self):

        fout = open('output.html', 'w')

        fout.write('<html>')

        fout.write('<body>')

        fout.write('<table>')

        for data in self.datas:

            fout.write('<tr>')

            fout.write('<td>%s</td>' % data['url'])

            fout.write('<td>%s</td>' % data['title'].encode('utf-8'))

            fout.write('<td>%s</td>' % data['summary'].encode('utf-8'))

            fout.write('</tr>')

        fout.write('</table>')

        fout.write('</body>')

        fout.write('</html>')

    def test(self):

        pass

爬虫4 html输出器 html_outputer.py的更多相关文章

爬虫5 html下载器 html_downloader.py
#coding:utf8 import urllib2 __author__ = 'wang' class HtmlDownloader(object): def download(self, url ...
爬虫3 html解析器 html_parser.py
#coding:utf8 import urlparse from bs4 import BeautifulSoup import re __author__ = 'wang' class HtmlP ...
爬虫2 url管理器 url_manager.py
#coding:utf8 class UrlManager(object): def __init__(self): self.new_urls = set() self.old_urls = set ...
python爬虫主要就是五个模块：爬虫启动入口模块，URL管理器存放已经爬虫的URL和待爬虫URL列表，html下载器，html解析器，html输出器同时可以掌握到urllib2的使用、bs4（BeautifulSoup）页面解析器、re正则表达式、urlparse、python基础知识回顾（set集合操作）等相关内容。
本次python爬虫百步百科,里面详细分析了爬虫的步骤,对每一步代码都有详细的注释说明,可通过本案例掌握python爬虫的特点: 1.爬虫调度入口(crawler_main.py) # coding: ...
Python即时网络爬虫项目: 内容提取器的定义(Python2.7版本)
1. 项目背景在Python即时网络爬虫项目启动说明中我们讨论一个数字:程序员浪费在调测内容提取规则上的时间太多了(见上图),从而我们发起了这个项目,把程序员从繁琐的调测规则中解放出来,投入到更高端 ...
pyspider源码解读--调度器scheduler.py
pyspider源码解读--调度器scheduler.py scheduler.py首先从pyspider的根目录下找到/pyspider/scheduler/scheduler.py其中定义了四个类 ...
exporter API（导出、输出器api）moodel3.3
Moodle[导出器]是接收数据并将其序列化为一个简单的预定义结构的类.它们确保输出的数据格式统一,易于维护.它们也用于生成外部函数的签名(参数和返回值) 外部函数定义在moodle/lib/exte ...
swing版网络爬虫-丑牛迷你采集器2.0
swing版网络爬虫-丑牛迷你采集器2.0 http://www.javacoo.com/code/704.jhtml 整合JEECMS http://bbs.jeecms.com/fabu/3186 ...
html_outputer.py
coding=UTF-8 # HTML输出器 import sys class htmlOutputer(): def __init__(self): self.data = [] def colle ...

随机推荐

第八章：Java集合
1.Java集合 A:对象的容器. B:实现数据结构(栈.队列) 2. Set:无序不重复 List: 有序可重复,长度可变. Map: 存放键值对. 3. Iterator foreach
cxf和jaxws的对比
和jaxws相比,服务器发布方式和客户端访问方式不同以下是cxf的代码: 服务器发布方式: package service; import javax.xml.ws.Endpoint; import ...
The hierarchy of the type NsRedisConnectionFactory is inconsistent
The hierarchy of the type is inconsistent 解释为:层次结构的类型不一致由于我在eclipse里建了两个JAVA PROJECT项目,分别是A projiec ...
iOS开发小技巧--计算label的Size的方法总结
计算label的Size方法 sizeWithAttributes:方法适用于不换行的情况,宽度不受限制的情况 /// 根据指定文本和字体计算尺寸 - (CGSize)sizeWithText:(N ...
Windows命令 dos
1.dos下运行netstat -na 查看本机开启的端口
【BZOJ-4173】数学欧拉函数 + 关于余数的变换
4173: 数学 Time Limit: 10 Sec Memory Limit: 256 MBSubmit: 306 Solved: 163[Submit][Status][Discuss] D ...
【poj2478】 Farey Sequence
http://poj.org/problem?id=2478 (题目链接) 题意求分母小于等于n的真分数的个数. Solution 现在只能做做水题了,唉,思维僵化. 细节前缀和开LL 代码 // ...
BZOJ1049 [HAOI2006]数字序列0
本文版权归ljh2000和博客园共有,欢迎转载,但须保留此声明,并给出原文链接,谢谢合作. 本文作者:ljh2000作者博客:http://www.cnblogs.com/ljh2000-jump/转 ...
mtd零星记录
查看Flash分区情况: root@DD-WRT:~# cat /proc/mtd dev: size erasesize name mtd0: "RedBoot" mtd1: 0 ...
使用SecureCRT的SFTP在WINDOWS与LINUX之间传输文件
景: 有一台主机,安装了windows7,在其安装了virtualbox,然后安装了ubuntu虚拟机.在windows7上安装SecureCRT来ssh连接ubuntu虚拟机.一般在windows上 ...

爬虫4 html输出器 html_outputer.py

爬虫4 html输出器 html_outputer.py的更多相关文章

随机推荐

热门专题