爬虫3 html解析器 html

#coding:utf8

import urlparse

from bs4 import BeautifulSoup

import re

__author__ = 'wang'

class HtmlParser(object):

    def parse(self, page_url, html_cont):

        if page_url is None or html_cont is None:

            return

        soup = BeautifulSoup(html_cont, 'html.parser', from_encoding = 'utf-8')

        new_urls = self._get_new_urls(page_url, soup)

        new_data = self._get_new_data(page_url, soup)

        return new_urls, new_data;

    def _get_new_urls(self, page_url, soup):

        new_urls = set()

        links = soup.find_all('a', href=re.compile(r"/view/\d+\.htm"))

        for link in links:

            new_url = link['href']

            new_full_url = urlparse.urljoin(page_url, new_url);

            new_urls.add(new_full_url)

        return new_urls

    def _get_new_data(self, page_url, soup):

        res_data = {}

        res_data['url'] = page_url

        title_node = soup.find('dd', class_='lemmaWgt-lemmaTitle-title').find("h1")

        res_data['title'] = title_node.get_text()

        summary_node = soup.find('div', class_ = 'lemma-summary')

        res_data['summary'] = summary_node.get_text()

        return res_data

爬虫3 html解析器 html_parser.py的更多相关文章

Python爬虫——使用 lxml 解析器爬取汽车之家二手车信息
本次爬虫的目标是汽车之家的二手车销售信息,范围是全国,不过很可惜,汽车之家只显示100页信息,每页48条,也就是说最多只能够爬取4800条信息. 由于这次爬虫的主要目的是使用lxml解析器,所以在信息 ...
爬虫中BeautifulSoup4解析器
CSS 选择器:BeautifulSoup4 和 lxml 一样,Beautiful Soup 也是一个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据. lxml 只会 ...
爬虫5 html下载器 html_downloader.py
#coding:utf8 import urllib2 __author__ = 'wang' class HtmlDownloader(object): def download(self, url ...
爬虫4 html输出器 html_outputer.py
#coding:utf8 __author__ = 'wang' class HtmlOutputer(object): def __init__(self): self.datas = []; de ...
爬虫2 url管理器 url_manager.py
#coding:utf8 class UrlManager(object): def __init__(self): self.new_urls = set() self.old_urls = set ...
python爬虫主要就是五个模块：爬虫启动入口模块，URL管理器存放已经爬虫的URL和待爬虫URL列表，html下载器，html解析器，html输出器同时可以掌握到urllib2的使用、bs4（BeautifulSoup）页面解析器、re正则表达式、urlparse、python基础知识回顾（set集合操作）等相关内容。
本次python爬虫百步百科,里面详细分析了爬虫的步骤,对每一步代码都有详细的注释说明,可通过本案例掌握python爬虫的特点: 1.爬虫调度入口(crawler_main.py) # coding: ...
Django-restframework之路由控制、解析器及响应器
django-restframework之路由控制.解析器及响应器一前言本篇博客介绍 restframework 框架的剩下几个组件,路由控制有三种:传统路由.半自动路由及全自动路由:解析器是用 ...
爬虫Scrapy框架-Crawlspider链接提取器与规则解析器
Crawlspider 一:Crawlspider简介 CrawlSpider其实是Spider的一个子类,除了继承到Spider的特性和功能外,还派生除了其自己独有的更加强大的特性和功能.其中最显著 ...
Python爬虫(十四)_BeautifulSoup4 解析器
CSS选择器:BeautifulSoup4 和lxml一样,Beautiful Soup也是一个HTML/XML的解析器,主要的功能也是如何解析和提取HTML/XML数据. lxml只会局部遍历,而B ...

随机推荐

十天冲刺---Day1
站立式会议由于第一天冲刺,所以有些没有昨天完成项和遇到的问题. 站立式会议内容总结: git上Issues内容: 燃尽图(做错了,将每天的燃尽图误以为是每天添加任务然后到一天结束后生成燃尽图(?)) ...
Msyql-检测数据库版本
show variables like '%version%'; 数据库版本结果: "protocol_version","" "version&qu ...
【POJ 1112】Team Them Up!（二分图染色+DP）
Description Your task is to divide a number of persons into two teams, in such a way, that: everyone ...
perl split 的一种特殊用法
参考 http://blog.chinaunix.net/uid-1919528-id-2792055.html split 函数的正规语法应该是: split /PATTERN/, EXPR 而使用 ...
ubuntu安装pip3
当初入门Linux 使用的是centos,那个时候是6.0版本,当然现在主流在使用的也是6.0系列的,现在都到6.7了,那个时候centos还是独立的,现在被redhat收购,本来一个红蓝就差不多,个 ...
Win7下完全卸载Oracle 11g的步骤
1 右击“计算机”-->管理-->服务和应用程序-->服务,停掉所有Oracle相关的服务(以Oracle打头的,比如OracleDBConsoleorcl). 2 开始--> ...
python之简单POST模拟登录
宿舍自从换了校园网的认证系统就不再用客户端了,只能在网页登录.每次上网都要打开浏览器的话很不方便,而且我有时在ubuntu控制台上想联网但终端文本浏览器似乎不支持页面跳转,既然如此,何不写个客户端呢? ...
hdu 5023 线段树
成端更新+统计区间内的值挺模板的题... 一开始没想起来用set统计,傻傻地去排序了[大雾 #include<iostream> #include<cstdio> #incl ...
更改codeblocks的配色方案
codeblocks默认只有一种配色方案, 不过我们可以手动添加. 在终端下输入如下命令: cd ~/.codeblocks sudo gedit default.conf 在打开的配置文件中, 找到 ...
ApsCMS AspCms_SettingFun.asp、AspCms-qqkfFun.asp、AspCms_Slide.asp、AspCms_StyleFun.asp、login.asp、AspCms_CommonFun.asp Vul
catalog . 漏洞描述 . 漏洞触发条件 . 漏洞影响范围 . 漏洞代码分析 . 防御方法 . 攻防思考 1. 漏洞描述 AspCMS管理系统有较多漏洞,涉及到SQL注入.密码泄漏.后台写SHE ...

爬虫3 html解析器 html_parser.py

爬虫3 html解析器 html_parser.py的更多相关文章

随机推荐

热门专题