网络爬虫2：使用crawler4j爬取网络内容

https://github.com/yasserg/crawler4j

需要两个包：

　　crawler4j-4.1-jar-with-dependencies.jar

　　slf4j-simple-1.7.22.jar（如果不加，会有警告：SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".）

相关包下载：

http://download.csdn.net/detail/talkwah/9747407

（crawler4j-4.1-jar-with-dependencies.jar相关资料少，github下载半天还失败，故整理了一下）

参考资料：

http://blog.csdn.net/zjm131421/article/details/13093869

http://favccxx.blog.51cto.com/2890523/1691079/

import java.util.Set;

import java.util.regex.Pattern;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;

import edu.uci.ics.crawler4j.crawler.CrawlController;

import edu.uci.ics.crawler4j.crawler.Page;

import edu.uci.ics.crawler4j.crawler.WebCrawler;

import edu.uci.ics.crawler4j.fetcher.PageFetcher;

import edu.uci.ics.crawler4j.parser.HtmlParseData;

import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;

import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

import edu.uci.ics.crawler4j.url.WebURL;

public class AhCrawler extends WebCrawler {

    // 三要素：

    // _访问谁？

    // _怎么访？

    // _访上了怎么处置？

    private static final String C_URL = "http://www.ximalaya.com";

    @Override

    public boolean shouldVisit(Page referringPage, WebURL url) {

        String href = url.getURL().toLowerCase();

        // 不匹配：MP3|jpg|png结尾的资源

        Pattern p = Pattern.compile(".*(\\.(MP3|jpg|png))$");

        return !p.matcher(href).matches() && href.startsWith(C_URL);

    }

    @Override

    public void visit(Page page) {

        String url = page.getWebURL().getURL();

        String parentUrl = page.getWebURL().getParentUrl();

        String anchor = page.getWebURL().getAnchor();

        System.out.println("↓↓↓↓↓↓↓↓↓");

        System.out.println("URL        :" + url);

        System.out.println("Parent page:" + parentUrl);

        System.out.println("Anchor text:" + anchor);

        logger.info("URL: {}", url);

        logger.debug("Parent page: {}", parentUrl);

        logger.debug("Anchor text: {}", anchor);

        if (page.getParseData() instanceof HtmlParseData) {

            HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();

            String text = htmlParseData.getText();

            String html = htmlParseData.getHtml();

            Set<WebURL> links = htmlParseData.getOutgoingUrls();

            System.out.println("--------------------------");

            // System.out.println("Text length: " + text.length());

            // System.out.println("Html length: " + html.length());

            System.out.println("Number of outgoing links: " + links.size());

        }

        System.out.println("↑↑↑↑↑↑↑");

    }

    public static void main(String[] args) throws Exception {

        // 源代码例子中，这两位是两只参数

        // 配置个路径，这个路径相当于Temp文件夹，不用先建好，

        String crawlStorageFolder = "/data/crawl/root";

        int numberOfCrawlers = 7;

        CrawlConfig crawlConf = new CrawlConfig();

        crawlConf.setCrawlStorageFolder(crawlStorageFolder);

        PageFetcher pageFetcher = new PageFetcher(crawlConf);

        RobotstxtConfig robotConf = new RobotstxtConfig();

        RobotstxtServer robotServ = new RobotstxtServer(robotConf, pageFetcher);

        // 控制器

        CrawlController c = new CrawlController(crawlConf, pageFetcher,

                robotServ);

        // 添加网址

        c.addSeed(C_URL);

        // 启动爬虫

        c.start(AhCrawler.class, numberOfCrawlers);

    }

}

CrawlController c 的来历：

结果示例：

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/5333001/sound/25320285

Parent page:http://www.ximalaya.com/dq/music-ACG/

Anchor text:俊豪演奏 - 琵琶版《刀劍如夢》

[Crawler 3] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/5333001/sound/25320285

--------------------------

Number of outgoing links: 131

↑↑↑↑↑↑↑

[Crawler 7] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/30119950/sound/12181402

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/30119950/sound/12181402

Parent page:http://www.ximalaya.com/dq/book-果麦文化/

Anchor text:第二十六集 人生的意思不在于留下什么，而在于经历

--------------------------

Number of outgoing links: 134

↑↑↑↑↑↑↑

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/zhubo/56833971/

Parent page:http://www.ximalaya.com/4932085/sound/21902925

Anchor text:null

[Crawler 1] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/zhubo/56833971/

--------------------------

Number of outgoing links: 68

↑↑↑↑↑↑↑

[Crawler 4] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/5413571/sound/2349697

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/5413571/sound/2349697

Parent page:http://www.ximalaya.com/dq/renwen-新知/

Anchor text:41-方明-西江月·夜行黄沙道中 南宋 辛弃疾

--------------------------

Number of outgoing links: 134

↑↑↑↑↑↑↑

[Crawler 6] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/5011186/sound/30650945

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/5011186/sound/30650945

Parent page:http://www.ximalaya.com/dq/finance-大咖/

Anchor text:03

--------------------------

Number of outgoing links: 111

↑↑↑↑↑↑↑

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/1000144/album/3559805

Parent page:http://www.ximalaya.com/dq/music-文艺/

Anchor text:null

[Crawler 2] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/1000144/album/3559805

--------------------------

Number of outgoing links: 85

↑↑↑↑↑↑↑

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/4932085/sound/21902925/liker

Parent page:http://www.ximalaya.com/4932085/sound/21902925

Anchor text:更多

[Crawler 1] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/4932085/sound/21902925/liker

--------------------------

Number of outgoing links: 96

↑↑↑↑↑↑↑

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/30895669/sound/19945445

Parent page:http://www.ximalaya.com/dq/music-ACG/

Anchor text:宫崎骏-久石让

[Crawler 3] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/30895669/sound/19945445

--------------------------

Number of outgoing links: 131

↑↑↑↑↑↑↑

↓↓↓↓↓↓↓↓↓

URL        :http://www.ximalaya.com/9112346/album/2903291

Parent page:http://www.ximalaya.com/dq/book-果麦文化/

Anchor text:null

[Crawler 7] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/9112346/album/2903291

--------------------------

Number of outgoing links: 90

↑↑↑↑↑↑↑

网络爬虫2：使用crawler4j爬取网络内容的更多相关文章

iOS开发——网络使用技术OC篇&网络爬虫－使用正则表达式抓取网络数据
网络爬虫-使用正则表达式抓取网络数据关于网络数据抓取不仅仅在iOS开发中有,其他开发中也有,也叫网络爬虫,大致分为两种方式实现 1:正则表达 2:利用其他语言的工具包:java/Python 先来看 ...
iOS开发——网络实用技术OC篇&网络爬虫－使用青花瓷抓取网络数据
网络爬虫-使用青花瓷抓取网络数据由于最近在研究网络爬虫相关技术,刚好看到一篇的的搬了过来! 望谅解..... 写本文的契机主要是前段时间有次用青花瓷抓包有一步忘了,在网上查了半天也没找到写的完整的教 ...
python网络爬虫第三弹(<爬取get请求的页面数据>)
一.urllib库 urllib是python自带的一个用于爬虫的库,其主要作用就是通过代码模拟浏览器发送请求,其常被用到的子模块在 python3中的为urllib.request 和 urllib ...
Python网络爬虫入门实战（爬取最近7天的天气以及最高/最低气温）
_ 前言本文文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 作者: Bo_wen 最近两天学习了一下python,并自己写了一个 ...
Python网络爬虫（6）--爬取淘宝模特图片
经过前面的一些基础学习,我们大致知道了如何爬取并解析一个网页中的信息,这里我们来做一个更有意思的事情,爬取MM图片并保存.网址为https://mm.taobao.com/json/request_t ...
网络爬虫之网站图片爬取-python实现
版本1.5 本次简单添加了四路多线程(由于我电脑CPU是四核的),速度飙升.本想试试xPath,但发现反倒是多此一举,故暂不使用 #-*- coding:utf-8 -*- import re,url ...
python 网络爬虫（一）爬取天涯论坛评论
我是一个大二的学生,也是刚接触python,接触了爬虫感觉爬虫很有趣就爬了爬天涯论坛,中途碰到了很多问题,就想把这些问题分享出来, 都是些简单的问题,希望大佬们以宽容的眼光来看一个小菜鸟
python3编写网络爬虫16-使用selenium 爬取淘宝商品信息
一.使用selenium 模拟浏览器操作爬取淘宝商品信息之前我们已经成功尝试分析Ajax来抓取相关数据,但是并不是所有页面都可以通过分析Ajax来完成抓取.比如,淘宝,它的整个页面数据确实也是通过A ...
Python网络爬虫——Appuim+夜神模拟器爬取得到APP课程数据
一.背景介绍随着生产力和经济社会的发展,温饱问题基本解决,人们开始追求更高层次的精神文明,开始愿意为知识和内容付费.从2016年开始,内容付费渐渐成为时尚. 罗辑思维创始人罗振宇全力打造" ...
python网络爬虫（7）爬取静态数据详解
目的爬取http://seputu.com/数据并存储csv文件导入库 lxml用于解析解析网页HTML等源码,提取数据.一些参考:https://www.cnblogs.com/zhangxin ...

随机推荐

家庭记账本web开发
这个系统的整体结构: GitHub:https://github.com/lq1998lq/Test.git com.action包: package com.action; import java. ...
1、Zookeeper安装及问题与集群
1.下载zookeeper.tat.gz压缩包 2.解压 tar –xvf file.tar //解压 tar包 tar -xzvf file.tar.gz //解压tar.gz tar -xjvf ...
Maven项目main和test文件夹说明
需要自己来手动调整项目目录, Maven项目通常划分为 main 和 test 两部分,main 中存放实际项目资源,test 存放测试项目资源,二者内部同时又划分为 source 和 resourc ...
【转】Lambda表达式详解
前言 1.天真热,程序员活着不易,星期天,也要顶着火辣辣的太阳,总结这些东西. 2.夸夸lambda吧:简化了匿名委托的使用,让你让代码更加简洁,优雅.据说它是微软自c#1.0后新增的最重要的功能之一 ...
Go mysql使用举例
下载MySQL驱动 $ go get github.com/go-sql-driver/mysql 或者下载源码放到GOPATH中,下载地址:https://github.com/go-sql-dri ...
Spring Boot 监控与管理
在微服务架构中,我们将原本庞大的单体系统拆分为多个提供不同服务的应用,虽然,各个应用的内部逻辑因分解而简化,但由于部署的应用数量成倍增长,使得系统的维护复杂度大大提升,为了让运维系统能够获取各个为服务 ...
google浏览器查看源码快捷键 ctrl+U
google浏览器查看源码快捷键 ctrl+U 或则在地址栏的网址前加上:view-source:
ML: 聚类算法R包 - 模型聚类
模型聚类 mclust::Mclust RWeka::Cobweb mclust::Mclust EM算法也称为期望最大化算法,在是使用该算法聚类时,将数据集看作一个有隐形变量的概率模型,并实现模型最 ...
MySQL Binlog解析
https://yq.aliyun.com/articles/238364?spm=5176.8067842.tagmain.52.73PjU3 摘要: 概述 MySQL的安装可以参考:Linux(C ...
messageBox 的几种显示方式
1.最简单的,只显示提示信息 MessageBox.Show("Hello~~~~"); 2. 可以给消息框加上标题. MessageBox.Show("There ar ...

网络爬虫2：使用crawler4j爬取网络内容

网络爬虫2：使用crawler4j爬取网络内容的更多相关文章

随机推荐

热门专题