网络爬虫2：使用crawler4j爬取网络内容

https://github.com/yasserg/crawler4j

需要两个包：

　　crawler4j-4.1-jar-with-dependencies.jar

　　slf4j-simple-1.7.22.jar（如果不加，会有警告：SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".）

相关包下载：

http://download.csdn.net/detail/talkwah/9747407

（crawler4j-4.1-jar-with-dependencies.jar相关资料少，github下载半天还失败，故整理了一下）

参考资料：

http://blog.csdn.net/zjm131421/article/details/13093869

http://favccxx.blog.51cto.com/2890523/1691079/

import java.util.Set;

import java.util.regex.Pattern;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;

import edu.uci.ics.crawler4j.crawler.CrawlController;

import edu.uci.ics.crawler4j.crawler.Page;

import edu.uci.ics.crawler4j.crawler.WebCrawler;

import edu.uci.ics.crawler4j.fetcher.PageFetcher;

import edu.uci.ics.crawler4j.parser.HtmlParseData;

import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;

import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

import edu.uci.ics.crawler4j.url.WebURL;

public class AhCrawler extends WebCrawler {

    // 三要素：

    // _访问谁？

    // _怎么访？

    // _访上了怎么处置？

    private static final String C_URL = "http://www.ximalaya.com";

    @Override

    public boolean shouldVisit(Page referringPage, WebURL url) {

        String href = url.getURL().toLowerCase();

        // 不匹配：MP3|jpg|png结尾的资源

        Pattern p = Pattern.compile(".*(\\.(MP3|jpg|png))$");

        return !p.matcher(href).matches() && href.startsWith(C_URL);

    }

    @Override

    public void visit(Page page) {

        String url = page.getWebURL().getURL();

        String parentUrl = page.getWebURL().getParentUrl();

        String anchor = page.getWebURL().getAnchor();

        System.out.println("↓↓↓↓↓↓↓↓↓");

        System.out.println("URL        :" + url);

        System.out.println("Parent page:" + parentUrl);

        System.out.println("Anchor text:" + anchor);

        logger.info("URL: {}", url);

        logger.debug("Parent page: {}", parentUrl);

        logger.debug("Anchor text: {}", anchor);

        if (page.getParseData() instanceof HtmlParseData) {

            HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();

            String text = htmlParseData.getText();

            String html = htmlParseData.getHtml();

            Set<WebURL> links = htmlParseData.getOutgoingUrls();

            System.out.println("--------------------------");

            // System.out.println("Text length: " + text.length());

            // System.out.println("Html length: " + html.length());

            System.out.println("Number of outgoing links: " + links.size());

        }

        System.out.println("↑↑↑↑↑↑↑");

    }

    public static void main(String[] args) throws Exception {

        // 源代码例子中，这两位是两只参数

        // 配置个路径，这个路径相当于Temp文件夹，不用先建好，

        String crawlStorageFolder = "/data/crawl/root";

        int numberOfCrawlers = 7;

        CrawlConfig crawlConf = new CrawlConfig();

        crawlConf.setCrawlStorageFolder(crawlStorageFolder);

        PageFetcher pageFetcher = new PageFetcher(crawlConf);

        RobotstxtConfig robotConf = new RobotstxtConfig();

        RobotstxtServer robotServ = new RobotstxtServer(robotConf, pageFetcher);

        // 控制器

        CrawlController c = new CrawlController(crawlConf, pageFetcher,

                robotServ);

        // 添加网址

        c.addSeed(C_URL);

        // 启动爬虫

        c.start(AhCrawler.class, numberOfCrawlers);

    }

}

CrawlController c 的来历：

结果示例：

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/5333001/sound/25320285

Parent page:http://www.ximalaya.com/dq/music-ACG/

Anchor text:俊豪演奏 - 琵琶版《刀劍如夢》

[Crawler 3] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/5333001/sound/25320285

--------------------------

Number of outgoing links: 131

↑↑↑↑↑↑↑

[Crawler 7] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/30119950/sound/12181402

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/30119950/sound/12181402

Parent page:http://www.ximalaya.com/dq/book-果麦文化/

Anchor text:第二十六集 人生的意思不在于留下什么，而在于经历

--------------------------

Number of outgoing links: 134

↑↑↑↑↑↑↑

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/zhubo/56833971/

Parent page:http://www.ximalaya.com/4932085/sound/21902925

Anchor text:null

[Crawler 1] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/zhubo/56833971/

--------------------------

Number of outgoing links: 68

↑↑↑↑↑↑↑

[Crawler 4] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/5413571/sound/2349697

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/5413571/sound/2349697

Parent page:http://www.ximalaya.com/dq/renwen-新知/

Anchor text:41-方明-西江月·夜行黄沙道中 南宋 辛弃疾

--------------------------

Number of outgoing links: 134

↑↑↑↑↑↑↑

[Crawler 6] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/5011186/sound/30650945

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/5011186/sound/30650945

Parent page:http://www.ximalaya.com/dq/finance-大咖/

Anchor text:03

--------------------------

Number of outgoing links: 111

↑↑↑↑↑↑↑

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/1000144/album/3559805

Parent page:http://www.ximalaya.com/dq/music-文艺/

Anchor text:null

[Crawler 2] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/1000144/album/3559805

--------------------------

Number of outgoing links: 85

↑↑↑↑↑↑↑

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/4932085/sound/21902925/liker

Parent page:http://www.ximalaya.com/4932085/sound/21902925

Anchor text:更多

[Crawler 1] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/4932085/sound/21902925/liker

--------------------------

Number of outgoing links: 96

↑↑↑↑↑↑↑

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/30895669/sound/19945445

Parent page:http://www.ximalaya.com/dq/music-ACG/

Anchor text:宫崎骏-久石让

[Crawler 3] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/30895669/sound/19945445

--------------------------

Number of outgoing links: 131

↑↑↑↑↑↑↑

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/9112346/album/2903291

Parent page:http://www.ximalaya.com/dq/book-果麦文化/

Anchor text:null

[Crawler 7] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/9112346/album/2903291

--------------------------

Number of outgoing links: 90

↑↑↑↑↑↑↑

网络爬虫2：使用crawler4j爬取网络内容的更多相关文章

iOS开发——网络使用技术OC篇&网络爬虫－使用正则表达式抓取网络数据
网络爬虫-使用正则表达式抓取网络数据关于网络数据抓取不仅仅在iOS开发中有,其他开发中也有,也叫网络爬虫,大致分为两种方式实现 1:正则表达 2:利用其他语言的工具包:java/Python 先来看 ...
iOS开发——网络实用技术OC篇&网络爬虫－使用青花瓷抓取网络数据
网络爬虫-使用青花瓷抓取网络数据由于最近在研究网络爬虫相关技术,刚好看到一篇的的搬了过来! 望谅解..... 写本文的契机主要是前段时间有次用青花瓷抓包有一步忘了,在网上查了半天也没找到写的完整的教 ...
python网络爬虫第三弹(<爬取get请求的页面数据>)
一.urllib库 urllib是python自带的一个用于爬虫的库,其主要作用就是通过代码模拟浏览器发送请求,其常被用到的子模块在 python3中的为urllib.request 和 urllib ...
Python网络爬虫入门实战（爬取最近7天的天气以及最高/最低气温）
_ 前言本文文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 作者: Bo_wen 最近两天学习了一下python,并自己写了一个 ...
Python网络爬虫（6）--爬取淘宝模特图片
经过前面的一些基础学习,我们大致知道了如何爬取并解析一个网页中的信息,这里我们来做一个更有意思的事情,爬取MM图片并保存.网址为https://mm.taobao.com/json/request_t ...
网络爬虫之网站图片爬取-python实现
版本1.5 本次简单添加了四路多线程(由于我电脑CPU是四核的),速度飙升.本想试试xPath,但发现反倒是多此一举,故暂不使用 #-*- coding:utf-8 -*- import re,url ...
python 网络爬虫（一）爬取天涯论坛评论
我是一个大二的学生,也是刚接触python,接触了爬虫感觉爬虫很有趣就爬了爬天涯论坛,中途碰到了很多问题,就想把这些问题分享出来, 都是些简单的问题,希望大佬们以宽容的眼光来看一个小菜鸟
python3编写网络爬虫16-使用selenium 爬取淘宝商品信息
一.使用selenium 模拟浏览器操作爬取淘宝商品信息之前我们已经成功尝试分析Ajax来抓取相关数据,但是并不是所有页面都可以通过分析Ajax来完成抓取.比如,淘宝,它的整个页面数据确实也是通过A ...
Python网络爬虫——Appuim+夜神模拟器爬取得到APP课程数据
一.背景介绍随着生产力和经济社会的发展,温饱问题基本解决,人们开始追求更高层次的精神文明,开始愿意为知识和内容付费.从2016年开始,内容付费渐渐成为时尚. 罗辑思维创始人罗振宇全力打造" ...
python网络爬虫（7）爬取静态数据详解
目的爬取http://seputu.com/数据并存储csv文件导入库 lxml用于解析解析网页HTML等源码,提取数据.一些参考:https://www.cnblogs.com/zhangxin ...

随机推荐

C#中Socket关闭 Close、Dispose、Shutdown、Disconnect
An answer on StackOverflow made me think I have finally reached some glimpse of an understanding. Th ...
HADOOP1.X中HDFS工作原理
转载自:http://www.daniubiji.cn/archives/596 HDFS(Hadoop Distributed File System )Hadoop分布式文件系统.是根据googl ...
禁止ajax访问shiro管理的登录页面
在使用shiro的时候,对于用户权限的管理,相信很多人都已经很熟悉了.今天,我这里简单的记录一下我自己调试过程中遇到的问题.主要是登录的操作,禁止通过ajax的方式进行访问. shiro中,登录过程拒 ...
jquery位置问题
在页面中添加jquery时,一定要把jquery放在其他js文件前面!不然会出现"$ is not defined", 导致js无法正常运行.
msp430学习笔记-msp430g2553
C语言例程:http://wenku.baidu.com/link?url=49JzNSvt3m0fRuf8SWTEM8yEw1yzqr4lBR-QbX8FddcmjTVYnDhuR97wB60HNf ...
[InfluxDB] 安装与配置
[InfluxDB] 安装与配置 1- 下载 ubtuntu: wget https://dl.influxdata.com/influxdb/releases/influxdb_1.5.2_amd6 ...
svn hooks post-commit钩子自动部署
#!/bin/sh #修改为服务编码 export LANG=zh_CN.utf- #Set variable REPOS="$1" REV="$2" SVN= ...
WPF DataGrid添加编号列
WPF DataGrid添加编号列? 第一步:<DataGridTemplateColumn Header="编号" Width="50" MinWidt ...
接口文档管理系统mindoc安装手册
硬件: centos6.9-64 mysql5.6 首先确保系统安装gcc套件 yum -y gcc 第一步,安装mysql(如果不会在Linux安装mysql,请看下面文章) http://www. ...
Facet with Lucene
Facets with Lucene Posted on August 1, 2014 by Pascal Dimassimo in Latest Articles During the develo ...

网络爬虫2：使用crawler4j爬取网络内容

网络爬虫2：使用crawler4j爬取网络内容的更多相关文章

随机推荐

热门专题