Web::Scraper 页面提取分析

一组用来提取HTML文档中元素内容的工具集，它能够理解HTML和CSS选择器以及XPath表达式。

语法

use URI;

  use Web::Scraper;

  # First, create your scraper block

  my $tweets = scraper {

      # Parse all LIs with the class "status", store them into a resulting

      # array 'tweets'.  We embed another scraper for each tweet.

      process "li.status", "tweets[]" => scraper {

          # And, in that array, pull in the elementy with the class

          # "entry-content", "entry-date" and the link

          process ".entry-content", body => 'TEXT';

          process ".entry-date", when => 'TEXT';

          process 'a[rel="bookmark"]', link => '@href';

      };

  };

  my $res = $tweets->scrape( URI->new("http://twitter.com/miyagawa") );

  # The result has the populated tweets array

  for my $tweet (@{$res->{tweets}}) {

      print "$tweet->{body} $tweet->{when} (link: $tweet->{link})\n";

  }

process

scraper {

      process "tag.class", key => 'TEXT';

      process '//tag[contains(@foo, "bar")]', key2 => '@attr';

      process '//comment()', 'comments[]' => 'TEXT';

  };

如果你传入的参数是URI或HTTP response，那Web::Scaper自动去寻找Content-Type header和META标签以判断文件编码。否则你压根先把HTML内容解码为Unicode后再传给scape函数。

  $res = $scraper->scrape(URI->new($uri));

  $res = $scraper->scrape($html_content);

  $res = $scraper->scrape(\$html_content);

  $res = $scraper->scrape($http_response);

  $res = $scraper->scrape($html_element);

当你把HTML内容作为参数传给scrape函数时，你还要考虑一个问题：HTML文档中出现相对路径怎么办？所以这个时候你可以把base url一并作为参数传进去。

res= scraper->scrape($html_content, "http://example.com/foo");

它有两个参数，当第一个参数以"//"或"id("开头时作为XPath对待；否则作为HTML或CSS选择器对待。

# <span class="date">2008/12/21</span>

  # date => "2008/12/21"

  process ".date", date => 'TEXT';

  # <div class="body"><a href="http://example.com/">foo</a></div>

  # link => URI->new("http://example.com/")

  process ".body > a", link => '@href';

  # <div class="body"><!-- HTML Comment here --><a href="http://example.com/">foo</a></div>

  # comment => " HTML Comment here "

  #

  # NOTES: A comment nodes are accessed when installed

  # the HTML::TreeBuilder::XPath (version >= 0.14) and/or

  # the HTML::TreeBuilder::LibXML (version >= 0.13)

  process "//div[contains(@class, 'body')]/comment()", comment => 'TEXT';

  # <div class="body"><a href="http://example.com/">foo</a></div>

  # link => URI->new("http://example.com/"), text => "foo"

  process ".body > a", link => '@href', text => 'TEXT';

  # <ul><li>foo</li><li>bar</li></ul>

  # list => [ "foo", "bar" ]

  process "li", "list[]" => "TEXT";

  # <ul><li id="1">foo</li><li id="2">bar</li></ul>

  # list => [ { id => "1", text => "foo" }, { id => "2", text => "bar" } ];

  process "li", "list[]" => { id => '@id', text => "TEXT" };

Web::Scraper 页面提取分析的更多相关文章

web scraper 抓取分页数据和二级页面内容
如果是刚接触 web scraper 的,可以看第一篇文章. web scraper 是一款免费的,适用于普通用户(不需要专业 IT 技术的)的爬虫工具,可以方便的通过鼠标和简单配置获取你所想要数据. ...
简易数据分析 13 | Web Scraper 抓取二级页面
这是简易数据分析系列的第 13 篇文章. 不知不觉,web scraper 系列教程我已经写了 10 篇了,这 10 篇内容,基本上覆盖了 Web Scraper 大部分功能.今天的内容算这个系列的最 ...
web scraper无法解决爬虫问题？通通可以交给python！
今天一位粉丝的需求所涉及的问题值得和大家分享分享~~~ 背景问题是这样的,他看了公号里的关于web scraper的系列文章后,希望用它来爬取一个网站搜索关键词后的文章标题和链接,如下图按照教程, ...
使用 Chrome 浏览器插件 Web Scraper 10分钟轻松实现网页数据的爬取
web scraper 下载:Web-Scraper_v0.2.0.10 使用 Chrome 浏览器插件 Web Scraper 可以轻松实现网页数据的爬取,不写代码,鼠标操作,点哪爬哪,还不用考虑爬 ...
简易数据分析 02 | Web Scraper 的下载与安装
这是简易数据分析系列的第 2 篇文章. 上篇说了数据分析在生活中的重要性,从这篇开始,我们就要进入分析的实战内容了.数据分析数据分析,没有数据怎么分析?所以我们首先要学会采集数据. 我调研了很多采集数 ...
web scraper 抓取数据并做简单数据分析
其实 web scraper 说到底就是那点儿东西,所有的网站都是大同小异,但是都还不同.这也是好多同学总是遇到问题的原因.因为没有统一的模板可用,需要理解了 web scraper 的原理并且对目标 ...
简易数据分析 12 | Web Scraper 翻页——抓取分页器翻页的网页
这是简易数据分析系列的第 12 篇文章. 前面几篇文章我们介绍了 Web Scraper 应对各种翻页的解决方法,比如说修改网页链接加载数据.点击"更多按钮"加载数据和下拉自动加载 ...
Web Scraper 翻页——利用 Link 选择器翻页 | 简易数据分析 14
这是简易数据分析系列的第 14 篇文章. 今天我们还来聊聊 Web Scraper 翻页的技巧. 这次的更新是受一位读者启发的,他当时想用 Web scraper 爬取一个分页器分页的网页,却发现我之 ...
tiny web服务器源码分析
tiny web服务器源码分析正如csapp书中所记,在短短250行代码中,它结合了许多我们已经学习到的思想,如进程控制,unix I/O,套接字接口和HTTP.虽然它缺乏一个实际服务器所具备的功能 ...

随机推荐

15个值得开发人员关注的jQuery开发技巧和心得
在这篇文章中,我们将介绍15个让你的jQuery更加有效的技巧,大部分关于性能提升的,希望大家能够喜欢! 1. 尽量使用最新版本的jQuery类库 jQuery项目中使用了大量的创新.最好的方法来提高 ...
POJ --2104
K-th Number Time Limit: 20000MS Memory Limit: 65536K Total Submissions: 34935 Accepted: 11134 Ca ...
Microsoft JScript 运行时错误: Automation 服务器不能创建对象
var WshShell = new ActiveXObject('WScript.Shell') WshShell.SendKeys( '{F11}'); 问题: ...
Dev控件用法 aspxTreeList 无刷新 aspxGridView 数据
主要是利用 ASPxTreeList 点击事件回发服务器进行数据重新绑定 ASPxTreeList: <SettingsBehavior ExpandCollapseAction="N ...
折腾iPhone的生活——运营商信号显示数据化
iOS7以后iphone的信号都是用5个小圆圈显示的,像这样但是还有种显示方法可以用数字信号显示信号量,比较适合很专注于生活品质的人和对数字有偏爱的人,像这样: 这样还有个好处是可以节约顶部状态栏的 ...
NTP服务器
server 1.cn.pool.ntp.org server 1.asia.pool.ntp.org server 0.asia.pool.ntp.org pool.ntp.org
【转】SVN linux命令及 windows相关操作（二）
转自这里:http://www.uml.org.cn/pzgl/200904246.asp 1 安装及下载client 端 2 什么是SVN(Subversion)? 3 为甚么要用SVN? 4 怎么 ...
Grandpa's Estate - POJ 1228(稳定凸包)
刚开始看这个题目不知道是什么东东,后面看了大神的题解才知道是稳定凸包问题,什么是稳定凸包呢?所谓稳定就是判断能不能在原有凸包上加点,得到一个更大的凸包,并且这个凸包包含原有凸包上的所有点.知道了这个东 ...
poj 3620 Avoid The Lakes【简单dfs】
Avoid The Lakes Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 6795 Accepted: 3622 D ...
VS2012的自动生成测试的插件 Unit Test Generator
Unit Test Generator extension是一个VS2012的插件,可以为C#的public方法很方便的自动生成unit test.安装这个插件后点击TEST菜单可以配置,如下所示: ...

Web::Scraper 页面提取分析

语法

process

Web::Scraper 页面提取分析的更多相关文章

随机推荐

热门专题