java学习之爬虫

0x00前言

对比与Python的爬虫机制和java的爬虫机制来详解一下java的爬虫，对于一般性的需求无论java还是python都可以胜任。

如需要模拟登陆、对抗防采集选择python更方便些，如果需要处理复杂的网页，解析网页内容生成结构化数据或者对网页内容精细的解析则可以选择java，简单一点的数据采集我们可以选择python的爬虫，需要具体到结构的化采集存储最好采用java

0x01基础的get和post爬取

0x1post和get的基础代码

public class JAVA_TEST1 {

    public static void main(String[] args) throws IOException {

        CloseableHttpClient httpClient= HttpClients.createDefault();//创建一个默认对象

        HttpGet httpGet=new HttpGet("https://www.itcast.cn");

        httpClient.execute(httpGet);//对象去调用

        CloseableHttpResponse response=httpClient.execute(httpGet);

        if (response.getStatusLine().getStatusCode()==200){

           HttpEntity httpEntity =response.getEntity();

            String s=EntityUtils.toString(response.getEntity(),"utf-8");

            System.out.println(s);

        }

    }

}

public class JAVA_test02 {

    public static void main(String[] args) throws Exception {

        //创建httpclient对象

        //创建httpget对象

        //发起请求

        //爬到数据进行

        CloseableHttpClient httpClinet = HttpClients.createDefault();

        URIBuilder uriBuilder=new URIBuilder("https://www.baidu.com");

        uriBuilder.setParameter("question","hellow");

        System.out.println(uriBuilder.build().toString());

        HttpGet httpGet=new HttpGet(uriBuilder.build());

        httpClinet.execute(httpGet);

        CloseableHttpResponse response=httpClinet.execute(httpGet);

        if (response.getStatusLine().getStatusCode()==200){

            String s= EntityUtils.toString(response.getEntity(),"utf-8");

            System.out.println(s.length());

        }

        //关闭response

        response.close();

        httpClinet.close();

    }

}

0x2常用方法

1.CloseableHttpClient httpClinet = HttpClients.createDefault();创造一个HttpClients

2.HttpGet httpGet=new HttpGet()创建一个Httpgetm,里面可以直接更如String形的网址

a.也可以更输入一个URI对象，URI对象可以增加额外的参数

URIBuilder uriBuilder=new URIBuilder("https://www.baidu.com");

uriBuilder.setParameter("param","value");跟入参数的方式是：创建对象调用调用方法setParameter前面是参数名后面是

HttpPost httpPost=new HttpPost(uriBuilder.build());

参数值。类似sql注入的id=1=====param=value

b. CloseableHttpResponse response=httpClient.execute(httpPost);

创建一个CloseableHttpResponse去接受返回值

3.类方法

EntityUtils.toString(response.getEntity())//获取的返回值输出出来

httpClient.execute(httpPost);//调用HttpClient对象去执行请求

response.getStatusLine().getStatusCode()==200//判断返回的页面状态

4.post携带参数请求

List<NameValuePair> pairList=new ArrayList<NameValuePair>();

        pairList.add(new BasicNameValuePair("qustion","wwww"));

            UrlEncodedFormEntity formEntity=new UrlEncodedFormEntity(pairList,"utf-8");

            httpPost.setEntity(formEntity);

因为post传输的是表格所有我们需要创建一个list来构建一个表格泛型选择NameValuePair，然后把list转换成功一个UrlEncodedFormEntity 表格

0x03连接池

每次创建一个HttpClient对象需要打开关闭很麻烦，所以有一个连接池实现自动化管理

PoolingHttpClientConnectionManager cm=new PoolingHttpClientConnectionManager();创建连接池

public void setMaxTotal(int max)设置最大连接数

public void setDefaultMaxPerRoute(int max)设置每个主机的并发数

HttpClient的配置

setConnectTimeout(1000) // 设置创建连接的最长时间

setConnectionRequestTimeout(500) //设置获取连接最长时间

setSocketTimeout(500).build();//设置数据传输最长时间

点击查看代码

package is.text;

import org.apache.http.client.config.RequestConfig;

import org.apache.http.client.methods.CloseableHttpResponse;

import org.apache.http.client.methods.HttpGet;

import org.apache.http.impl.client.CloseableHttpClient;

import org.apache.http.impl.client.HttpClients;

import org.apache.http.util.EntityUtils;

import java.io.IOException;

public class gethttp1params {

    public static void main(String[] args) throws IOException {

        CloseableHttpClient client = HttpClients.createDefault();

        HttpGet httpGet = new HttpGet("http://www.baidu.com");

        RequestConfig config = RequestConfig.custom().setConnectTimeout(1000) // 设置创建连接的最长时间

                .setConnectionRequestTimeout(500) //设置获取连接最长时间

                .setSocketTimeout(500).build();  //设置数据传输最长时间

        httpGet.setConfig(config);

        CloseableHttpResponse response  = client.execute(httpGet);

        String s = EntityUtils.toString(response.getEntity());

        System.out.println(s);

    }

}

public class JAVA_test05 {

    public static void main(String[] args) {

        PoolingHttpClientConnectionManager cm=new PoolingHttpClientConnectionManager();

        cm.setMaxTotal(10);

        cm.setDefaultMaxPerRoute(2);

        doget(cm);

        doget(cm);

        dopost(cm);

    }

    private static void dopost(PoolingHttpClientConnectionManager cm) {

        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();

        HttpPost httpPost = new HttpPost("https://www.baidu.com");

        try {

            CloseableHttpResponse response = httpClient.execute(httpPost);

            if (response.getStatusLine().getStatusCode()==200);

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

    private static void doget(PoolingHttpClientConnectionManager cm) {

       CloseableHttpClient httpClient =HttpClients.custom().setConnectionManager(cm).build();

        HttpGet httpGet=new HttpGet("http://www.baicu.com");

        try {

            httpClient.execute(httpGet);

            CloseableHttpResponse response=httpClient.execute(httpGet);

            if (response.getStatusLine().getStatusCode()==200){

                String string = EntityUtils.toString(response.getEntity());

                System.out.println(string);

            }

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

}

0x04后文和总结

准备手写一个爬虫去尝试爬出一些文章，java的爬虫主要依赖于Jsoup，它可以实现的python爬虫的正则表达式功能和获取html解析分解内容，需要设计css和html的一些内容这里就把它放在新的板块