JAVA爬虫实践（实践四：webMagic和phantomjs和淘宝爬虫）

webMagic虽然方便，但是也有它不适用的地方，比如定向的某个单页面爬虫，或者存在大量ajax请求，页面的跳转请求全都混淆在js里。

这时可以用webMagic结合phantomjs来真实模拟页面请求，即不仅仅获取数据，而是将整个页面完整渲染出来。虽然这样会使爬虫速度变慢很多，但是不失为一种快捷方便的解决方法。

PhantomJS是一个基于 WebKit 的服务器端JavaScript API。它全面支持web而不需浏览器支持，其快速，原生支持各种Web标准： DOM 处理, CSS 选择器, JSON, Canvas, 和 SVG。 PhantomJS 可以用于页面自动化，网络监测，网页截屏，以及无界面测试等。

淘宝就是这种难以用普通爬虫方法爬取的网站。直接发送GET请求到淘宝基本获取不到什么有效的内容和链接。

还好webMagic虽然默认使用httpClient获取网页，但是它也将它获取网页的方法Downloader开放出来。这样可以在Downloader里使用phantomjs获取页面。

phantomjs使用方法

1.下载安装phantomjs

2.编写js脚本

system = require('system')   //传递一些需要的参数给js文件  

address = system.args[1];//获得命令行第二个参数 ，也就是指定要加载的页面地址，接下来会用到    

var page = require('webpage').create();  

var url = address;  

page.open(url, function (status) {  

    if (status !== 'success') {  

        console.log('Unable to post!');

    } else {  

        var encodings = ["euc-jp", "sjis", "utf8", "System"];//这一步是用来测试输出的编码格式，选择合适的编码格式很重要，不然你抓取下来的页面会乱码o(╯□╰)o，给出的几个编码格式是官网上的例子，根据具体需要自己去调整。  

        for (var i = 3; i < encodings.length; i++) {//我这里只要一种编码就OK啦  

            phantom.outputEncoding = encodings[i];  

            console.log(phantom.outputEncoding+page.content);//最后返回webkit加载之后的页面内容

        }  

    }

    phantom.exit();

});

3.测试

package util;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileWriter;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.PrintWriter;

import us.codecraft.webmagic.Page;

import us.codecraft.webmagic.Request;

import us.codecraft.webmagic.selector.PlainText;

public class GetAjaxHtml {

    public static String getAjaxContent(String url) throws Exception {

        Runtime rt = Runtime.getRuntime();

        Process p = rt

                .exec("D:/phantomjs-2.1.1-windows/bin/phantomjs.exe D:/s.js "

                        + url);

        InputStream is = p.getInputStream();

        BufferedReader br = new BufferedReader(new InputStreamReader(is));

        StringBuffer sbf = new StringBuffer();

        String tmp = "";

        while ((tmp = br.readLine()) != null) {

            sbf.append(tmp + "\n");

        }

        return sbf.toString();

    }

    public static Page download(Request request) {

        Page page = new Page();

        try {

            String url = request.getUrl();

            String html = getAjaxContent(url);

            page.setRawText(html);

            page.setUrl(new PlainText(url));

            page.setRequest(request);

            return page;

        } catch (Exception e) {

            System.out.println("download出错了!");

            return page;

        }

    }

    public static void main(String[] args) throws Exception {

        long start = System.currentTimeMillis();

        String result = getAjaxContent("http://www.taobao.com");

        System.out.println(result);

        // 创建新文件

        String path = "D:\\testFile\\taobao.html";

        PrintWriter printWriter = null;

        printWriter = new PrintWriter(new FileWriter(new File(path)));

        printWriter.write(result);

        printWriter.close();

        long end = System.currentTimeMillis();

        System.out.println("===============耗时：" + (end - start)

                + "===============");

    }

}

webMagic结合phantomjs淘宝爬虫

package taobao;

import us.codecraft.webmagic.Page;

import us.codecraft.webmagic.Request;

import us.codecraft.webmagic.Site;

import us.codecraft.webmagic.Spider;

import us.codecraft.webmagic.Task;

import us.codecraft.webmagic.downloader.Downloader;

import us.codecraft.webmagic.processor.PageProcessor;

import util.GetAjaxHtml;

import util.UuidUtil;

import csdnblog.dao.TaobaoDao;

import csdnblog.model.Taobao;

public class TaobaoPageProcessor implements PageProcessor {

    private TaobaoDao taobaoDao = new TaobaoDao();

    // 抓取网站的相关配置，包括：编码、抓取间隔、重试次数等

    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override

    public Site getSite() {

        return site;

    }

    @Override

    public void process(Page page) {

        page.addTargetRequests(page.getHtml().links()

                .regex(".*item\\.taobao\\.com/item\\.htm\\?id=.*")

                .all());

        page.addTargetRequests(page.getHtml().links()

                .regex("https://s\\.taobao\\.com/list.*")

                .all());

        //如果是详情页

        if(page.getUrl().regex("https://item\\.taobao\\.com/item\\.htm\\?id=.*").match()) {

            Taobao taobao = new Taobao();

            taobao.setId(UuidUtil.getId());

            taobao.setUrl(page.getUrl().toString());

            taobao.setMaintitle(page.getHtml().xpath("//h3[@class='tb-main-title']/text()").get());

            taobao.setSubtitle(page.getHtml().xpath("//p[@class='tb-subtitle']/text()").get());

            taobao.setPrice(page.getHtml().xpath("//strong[@id='J_StrPrice']/em[@class='tb-rmb-num']/text()").get());

            taobao.setTaobaoprice(page.getHtml().xpath("//em[@id='J_PromoPriceNum']/text()").get());

            taobao.setRatecounter(page.getHtml().xpath("//strong[@id='J_RateCounter']/text()").get());

            taobao.setSellcounter(page.getHtml().xpath("//strong[@id='J_SellCounter']/text()").get());

            // 把对象存入数据库

            taobaoDao.addTaobao(taobao);

            // 把对象输出控制台

            System.out.println(taobao.toString());

        }

    }

    public static void main(String[] args) {

        Spider.create(new TaobaoPageProcessor()).setDownloader(new Downloader() {

            @Override

            public void setThread(int threadNum) {

            }

            @Override

            public Page download(Request request, Task task) {

                return GetAjaxHtml.download(request);

            }

        }).addUrl("https://s.taobao.com/list?q=%E5%A4%B9%E5%85%8B&cat=50344007&style=grid&seller_type=taobao").thread(5).run();

    }

}

Model

package csdnblog.model;

public class Taobao {

    private String id;

    private String maintitle;

    private String subtitle;

    // url

    private String url;

    // 价格

    private String price;

    // 淘宝价

    private String taobaoprice;

    // 累计评价

    private String ratecounter;

    // 交易成功

    private String sellcounter;

    public String getId() {

        return id;

    }

    public void setId(String id) {

        this.id = id;

    }

    public String getMaintitle() {

        return maintitle;

    }

    public void setMaintitle(String maintitle) {

        this.maintitle = maintitle;

    }

    public String getSubtitle() {

        return subtitle;

    }

    public void setSubtitle(String subtitle) {

        this.subtitle = subtitle;

    }

    public String getPrice() {

        return price;

    }

    public void setPrice(String price) {

        this.price = price;

    }

    public String getTaobaoprice() {

        return taobaoprice;

    }

    public void setTaobaoprice(String taobaoprice) {

        this.taobaoprice = taobaoprice;

    }

    public String getRatecounter() {

        return ratecounter;

    }

    public void setRatecounter(String ratecounter) {

        this.ratecounter = ratecounter;

    }

    public String getSellcounter() {

        return sellcounter;

    }

    public void setSellcounter(String sellcounter) {

        this.sellcounter = sellcounter;

    }

    public Taobao(String id, String maintitle, String subtitle, String url,

            String price, String taobaoprice, String ratecounter,

            String sellcounter) {

        super();

        this.id = id;

        this.maintitle = maintitle;

        this.subtitle = subtitle;

        this.url = url;

        this.price = price;

        this.taobaoprice = taobaoprice;

        this.ratecounter = ratecounter;

        this.sellcounter = sellcounter;

    }

    public Taobao() {

        super();

    }

    @Override

    public String toString() {

        return "Taobao [id=" + id + ", maintitle=" + maintitle + ", subtitle="

                + subtitle + ", url=" + url + ", price=" + price

                + ", taobaoprice=" + taobaoprice + ", ratecounter="

                + ratecounter + ", sellcounter=" + sellcounter + "]";

    }

    public String getUrl() {

        return url;

    }

    public void setUrl(String url) {

        this.url = url;

    }

}

JAVA爬虫实践（实践四：webMagic和phantomjs和淘宝爬虫）的更多相关文章

Python爬虫实战八之利用Selenium抓取淘宝匿名旺旺
更新其实本文的初衷是为了获取淘宝的非匿名旺旺,在淘宝详情页的最下方有相关评论,含有非匿名旺旺号,快一年了淘宝都没有修复这个. 可就在今天,淘宝把所有的账号设置成了匿名显示,SO,获取非匿名旺旺号已经 ...
Python网页信息采集：使用PhantomJS采集淘宝天猫商品内容
1,引言最近一直在看Scrapy 爬虫框架,并尝试使用Scrapy框架写一个可以实现网页信息采集的简单的小程序.尝试过程中遇到了很多小问题,希望大家多多指教. 本文主要介绍如何使用Scrapy结合P ...
23个Python爬虫开源项目代码，包含微信、淘宝、豆瓣、知乎、微博等
今天为大家整理了23个Python爬虫项目.整理的原因是,爬虫入门简单快速,也非常适合新入门的小伙伴培养信心,所有链接指向GitHub,微信不能直接打开,老规矩,可以用电脑打开. 关注公众号「Pyth ...
爬虫实战4：用selenium爬取淘宝美食
方案1:一次性爬取全部淘宝美食信息 1. spider.py文件如下 __author__ = 'Administrator' from selenium import webdriver from ...
[Python爬虫] 之十四：Selenium +phantomjs抓取媒介360数据
具体代码如下: # coding=utf-8import osimport refrom selenium import webdriverimport selenium.webdriver.supp ...
Python 爬虫实例（9）—— 搜索爬取淘宝
# coding:utf- import json import redis import time import requests session = requests.session() impo ...
爬虫实战--使用Selenium模拟浏览器抓取淘宝商品美食信息
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.common.exce ...
python爬虫实例，一小时上手爬取淘宝评论(附代码)
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 1 明确目的通过访问天猫的网站,先搜索对应的商品,然后爬取它的评论数据. ...
学习用java基于webMagic+selenium+phantomjs实现爬虫Demo爬取淘宝搜索页面
由于业务需要,老大要我研究一下爬虫. 团队的技术栈以java为主,并且我的主语言是Java,研究时间不到一周.基于以上原因固放弃python,选择java为语言来进行开发.等之后有时间再尝试pytho ...

随机推荐

ArcGIS API for JavaScript 4.3学习笔记[新] AJS4.3和AJS3.20新特性
今天"ArcGIS极客说"公众号推送了这两个大版本的更新,吓得我赶紧撸了一篇新博客. 这里就不写代码验证了,作为新特性小节简单介绍一下!~ AJS 4.3 1. 更强大的Featu ...
bzoj 3620: 似乎在梦中见过的样子
Description "Madoka,不要相信 QB!"伴随着 Homura 的失望地喊叫,Madoka 与 QB 签订了契约. 这是 Modoka 的一个噩梦,也同时是上个轮回 ...
Mac下nvm管理node.js版本问题
本篇文章主要是针对已经安装了node.js和nvm管理工具小伙伴遇到的问题. 管理工具有两个,一个是nvm,还有一个是nnvm的好处就是可以管理多个node版本,而且可以切换想要的版本,可以安装一个稳 ...
百度文库的实现——java利用openoffice，批量word转pdf
百度文库的主要功能就是将上传的word文档,转码成pdf格式再展示出来.其中有四种方法可以实现这样的操作: 方法一:用apache pio 读取doc文件,然后转成html文件用Jsoup格式化htm ...
vue2.0父子组件以及非父子组件如何通信
1.父组件传递数据给子组件父组件数据如何传递给子组件呢?可以通过props属性来实现父组件: <parent> <child :child-msg="msg" ...
PHP重要知识点
1 获取文件名或目录路径 getcwd() :显示是在哪个文件里调用此文件的目录 __DIR__ :当前内容写在哪个文件就显示这个文件目录 __FILE__ : 当前内容写在哪个文件就显示这个文件 ...
鸟哥的linux私房菜学习-（一）优缺点分析以及主机规划与磁盘分区
一.linux的优缺点那干嘛要使用Linux做为我们的主机系统呢?这是因为Linux有底下这些优点: 稳定的系统:Linux本来就是基于Unix概念而发展出来的操作系统,因此,Linux具有与Uni ...
Mongodb百亿级数据添加，修改，删除，查询等性能测试【四】
集群的结构,大家可以查看我的另一遍文章,Mongodb的三种集群在最后一种集群中,介绍到. 目前使用的数据就是最后一个测试集群,留下的数据. 简单介绍一下,四个分片的配置 192.168.99.6 ...
【Dijkstra堆优化】洛谷P2243电路维修
题目背景 Elf 是来自Gliese 星球的少女,由于偶然的原因漂流到了地球上.在她无依无靠的时候,善良的运输队员Mark 和James 收留了她.Elf 很感谢Mark和James,可是一直也没能给 ...
Android系统上如何实现easyconfig(airkiss)
刚买回来一个智能音箱和博联,需要给音箱和博联配置联网,音箱需要先打开蓝牙,然后在手机app中填写wifi的ssid和密码,通过蓝牙发送到音箱,音箱收到后连接到wifi. 博联就比较奇怪,进入联网模式以 ...

JAVA爬虫实践（实践四：webMagic和phantomjs和淘宝爬虫）

JAVA爬虫实践（实践四：webMagic和phantomjs和淘宝爬虫）的更多相关文章

随机推荐

热门专题