简介 : HttpClient是Apache Jakarta Common下的子项目,用于提供高效的,功能丰富的支持HTTP协议的客户编程工具包,其主要功能如下:

  1. 实现了所有HTTP的方法 : GET,POST,PUT,HEAD ..
  2. 支持自动重定向
  3. 支持HTTPS协议
  4. 支持代理服务器

关于Http请求的方法说明,参考大佬整理的博客:

https://www.cnblogs.com/williamjie/p/9099940.html

一、环境准备

1 JDK1.8

2 IntelliJ IDEA

3 IDEA自带的Maven

创建Maven工程itcast-crawler-first并给pom.xml加入依赖

关于xml语言的介绍

<dependencies>
<!-- HttpClient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency> <!-- 日志 -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
</dependencies>

关于日志的配置文件

log4j.rootLogger=DEBUG,A1
log4j.logger.cn.itcast = DEBUG log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n

log4j可以将日志以文件的形式输出,也可以输出打印在控制台上,同时可以设置输出的日志内容显示格式、日志文件的生成方式(追加、覆盖、设置日志文件大小等等)。我这里就是直接将日志打印到控制台上。org.apache.log4j.ConsoleAppender

二、编写代码

编写最简单的爬虫,抓取传智播客首页:http://www.itcast.cn/

public class CrawlerFirst {

    public static void main(String[] args) throws Exception {
//1. 打开浏览器,创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault(); //2. 输入网址,发起get请求创建HttpGet对象
HttpGet httpGet = new HttpGet("http://www.itcast.cn"); //3.按回车,发起请求,返回响应,使用HttpClient对象发起请求
CloseableHttpResponse response = httpClient.execute(httpGet); //4. 解析响应,获取数据
//判断状态码是否是200
if (response.getStatusLine().getStatusCode() == 200) {
HttpEntity httpEntity = response.getEntity();
String content = EntityUtils.toString(httpEntity, "utf8"); System.out.println(content);
}
}
}

三、Get请求

package cn.itcast.crawler.test;

import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils; import java.io.IOException; public class HttpGetTest { public static void main(String[] args) {
//创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault(); //创建HttpGet对象,设置url访问地址
HttpGet httpGet = new HttpGet("http://www.itcast.cn"); CloseableHttpResponse response = null;
try {
//使用HttpClient发起请求,获取response
response = httpClient.execute(httpGet); //解析响应
if (response.getStatusLine().getStatusCode() == 200) {
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
} } catch (IOException e) {
e.printStackTrace();
}finally {
//关闭response
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

四、带参数的GET请求

public class HttpGetParamTest {

    public static void main(String[] args) throws Exception {
//创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault(); //设置请求地址是:http://yun.itheima.com/search?keys=Java
//创建URIBuilder
URIBuilder uriBuilder = new URIBuilder("http://yun.itheima.com/search");
//设置参数
uriBuilder.setParameter("keys","Java"); //创建HttpGet对象,设置url访问地址
HttpGet httpGet = new HttpGet(uriBuilder.build()); System.out.println("发起请求的信息:"+httpGet); CloseableHttpResponse response = null;
try {
//使用HttpClient发起请求,获取response
response = httpClient.execute(httpGet); //解析响应
if (response.getStatusLine().getStatusCode() == 200) {
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
} } catch (IOException e) {
e.printStackTrace();
}finally {
//关闭response
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

五、POST请求

public class HttpPostTest {

    public static void main(String[] args)  {
//创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault(); //创建HttpPost对象,设置url访问地址
HttpPost httpPost = new HttpPost("http://www.itcast.cn"); CloseableHttpResponse response = null;
try {
//使用HttpClient发起请求,获取response
response = httpClient.execute(httpPost); //解析响应
if (response.getStatusLine().getStatusCode() == 200) {
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
} } catch (IOException e) {
e.printStackTrace();
}finally {
//关闭response
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

六、带参数的POST请求

public class HttpPostParamTest {

    public static void main(String[] args) throws Exception {
//创建HttpClient对象
CloseableHttpClient httpClient = HttpClients.createDefault(); //创建HttpPost对象,设置url访问地址
HttpPost httpPost = new HttpPost("http://yun.itheima.com/search"); //声明List集合,封装表单中的参数
List<NameValuePair> params = new ArrayList<NameValuePair>();
//设置请求地址是:http://yun.itheima.com/search?keys=Java
params.add(new BasicNameValuePair("keys","Java")); //创建表单的Entity对象,第一个参数就是封装好的表单数据,第二个参数就是编码
UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params,"utf8"); //设置表单的Entity对象到Post请求中
httpPost.setEntity(formEntity); CloseableHttpResponse response = null;
try {
//使用HttpClient发起请求,获取response
response = httpClient.execute(httpPost); //解析响应
if (response.getStatusLine().getStatusCode() == 200) {
String content = EntityUtils.toString(response.getEntity(), "utf8");
System.out.println(content.length());
} } catch (IOException e) {
e.printStackTrace();
}finally {
//关闭response
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

七、HTTP连接池

如果每次请求都要创建HttpClient,会有频繁创建和销毁的问题,可以使用连接池来解决这个问题。

测试以下代码,并断点查看每次获取的HttpClient都是不一样的。

public class HttpClientPoolTest {

    public static void main(String[] args) {
//创建连接池管理器
PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager(); //设置最大连接数
cm.setMaxTotal(100); //设置每个主机的最大连接数
cm.setDefaultMaxPerRoute(10); //使用连接池管理器发起请求
doGet(cm);
doGet(cm);
} private static void doGet(PoolingHttpClientConnectionManager cm) {
//不是每次创建新的HttpClient,而是从连接池中获取HttpClient对象
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build(); HttpGet httpGet = new HttpGet("http://www.itcast.cn"); CloseableHttpResponse response = null;
try {
response = httpClient.execute(httpGet); if (response.getStatusLine().getStatusCode() == 200) {
String content = EntityUtils.toString(response.getEntity(), "utf8"); System.out.println(content.length());
} } catch (IOException e) {
e.printStackTrace();
}finally {
if (response != null) {
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
//不能关闭HttpClient,由连接池管理HttpClient
//httpClient.close();
}
} }
}

八、RequestConfig配置连接信息

在构建网络爬虫时,经常需要配置很多信息,例如RequestTimeout(连接池获取到连接的超时时间)、ConnectTimeout(建立连接的超时)、SocketTimeout(获取数据的超时时间)、代理、是否允许重定向等信息。
在HttpClient,实现这些配置需要使用到RequestConfig类的一个内部类Builder。

如下为Builder的源码如下,代码太长了,我直接折叠了。

    public static class Builder {

        private boolean expectContinueEnabled;
private HttpHost proxy;
private InetAddress localAddress;
private boolean staleConnectionCheckEnabled;
private String cookieSpec;
private boolean redirectsEnabled;
private boolean relativeRedirectsAllowed;
private boolean circularRedirectsAllowed;
private int maxRedirects;
private boolean authenticationEnabled;
private Collection<String> targetPreferredAuthSchemes;
private Collection<String> proxyPreferredAuthSchemes;
private int connectionRequestTimeout;
private int connectTimeout;
private int socketTimeout;
private boolean contentCompressionEnabled; Builder() {
super();
this.staleConnectionCheckEnabled = false;
this.redirectsEnabled = true;
this.maxRedirects = 50;
this.relativeRedirectsAllowed = true;
this.authenticationEnabled = true;
this.connectionRequestTimeout = -1;
this.connectTimeout = -1;
this.socketTimeout = -1;
this.contentCompressionEnabled = true;
} public Builder setExpectContinueEnabled(final boolean expectContinueEnabled) {
this.expectContinueEnabled = expectContinueEnabled;
return this;
} public Builder setProxy(final HttpHost proxy) {
this.proxy = proxy;
return this;
} public Builder setLocalAddress(final InetAddress localAddress) {
this.localAddress = localAddress;
return this;
} /**
* @deprecated (4.4) Use {@link
* org.apache.http.impl.conn.PoolingHttpClientConnectionManager#setValidateAfterInactivity(int)}
*/
@Deprecated
public Builder setStaleConnectionCheckEnabled(final boolean staleConnectionCheckEnabled) {
this.staleConnectionCheckEnabled = staleConnectionCheckEnabled;
return this;
} public Builder setCookieSpec(final String cookieSpec) {
this.cookieSpec = cookieSpec;
return this;
} public Builder setRedirectsEnabled(final boolean redirectsEnabled) {
this.redirectsEnabled = redirectsEnabled;
return this;
} public Builder setRelativeRedirectsAllowed(final boolean relativeRedirectsAllowed) {
this.relativeRedirectsAllowed = relativeRedirectsAllowed;
return this;
} public Builder setCircularRedirectsAllowed(final boolean circularRedirectsAllowed) {
this.circularRedirectsAllowed = circularRedirectsAllowed;
return this;
} public Builder setMaxRedirects(final int maxRedirects) {
this.maxRedirects = maxRedirects;
return this;
} public Builder setAuthenticationEnabled(final boolean authenticationEnabled) {
this.authenticationEnabled = authenticationEnabled;
return this;
} public Builder setTargetPreferredAuthSchemes(final Collection<String> targetPreferredAuthSchemes) {
this.targetPreferredAuthSchemes = targetPreferredAuthSchemes;
return this;
} public Builder setProxyPreferredAuthSchemes(final Collection<String> proxyPreferredAuthSchemes) {
this.proxyPreferredAuthSchemes = proxyPreferredAuthSchemes;
return this;
} public Builder setConnectionRequestTimeout(final int connectionRequestTimeout) {
this.connectionRequestTimeout = connectionRequestTimeout;
return this;
} public Builder setConnectTimeout(final int connectTimeout) {
this.connectTimeout = connectTimeout;
return this;
} public Builder setSocketTimeout(final int socketTimeout) {
this.socketTimeout = socketTimeout;
return this;
} /**
* @deprecated (4.5) Set {@link #setContentCompressionEnabled(boolean)} to {@code false} and
* add the {@code Accept-Encoding} request header.
*/
@Deprecated
public Builder setDecompressionEnabled(final boolean decompressionEnabled) {
this.contentCompressionEnabled = decompressionEnabled;
return this;
} public Builder setContentCompressionEnabled(final boolean contentCompressionEnabled) {
this.contentCompressionEnabled = contentCompressionEnabled;
return this;
} public RequestConfig build() {
return new RequestConfig(
expectContinueEnabled,
proxy,
localAddress,
staleConnectionCheckEnabled,
cookieSpec,
redirectsEnabled,
relativeRedirectsAllowed,
circularRedirectsAllowed,
maxRedirects,
authenticationEnabled,
targetPreferredAuthSchemes,
proxyPreferredAuthSchemes,
connectionRequestTimeout,
connectTimeout,
socketTimeout,
contentCompressionEnabled);
} }

超时相关配置

HttpClient中可设置三个超时:RequestTimeout(连接池获取到连接的超时时间)、ConnectTimeout(建立连接的超时)、SocketTimeout(获取数据的超时时间)。使用RequestConfig进行配置的示例程序如下:

        //全部设置为10秒
RequestConfig requestConfig = RequestConfig.custom()
.setSocketTimeout(10000)
.setConnectTimeout(10000)
.setConnectionRequestTimeout(10000)
.build();
//配置httpClient
HttpClient httpClient = HttpClients.custom()
.setDefaultRequestConfig(requestConfig)
.build();

代理配置

RequestConfig defaultRequestConfig = RequestConfig.custom()
.setProxy(new HttpHost("171.97.67.160", 3128, null))
.build(); //添加代理
HttpClient httpClient = HttpClients.custom().
setDefaultRequestConfig(defaultRequestConfig).build(); //配置httpClient

Java网络爬虫 HttpClient的更多相关文章

  1. 学 Java 网络爬虫,需要哪些基础知识?

    说起网络爬虫,大家想起的估计都是 Python ,诚然爬虫已经是 Python 的代名词之一,相比 Java 来说就要逊色不少.有不少人都不知道 Java 可以做网络爬虫,其实 Java 也能做网络爬 ...

  2. Java 网络爬虫,就是这么的简单

    这是 Java 网络爬虫系列文章的第一篇,如果你还不知道 Java 网络爬虫系列文章,请参看 学 Java 网络爬虫,需要哪些基础知识.第一篇是关于 Java 网络爬虫入门内容,在该篇中我们以采集虎扑 ...

  3. Java网络爬虫笔记

    Java网络爬虫笔记 HttpClient来代替浏览器发起请求. select找到的是元素,也就是elements,你想要获取具体某一个属性的值,还是要用attr("")方法.标签 ...

  4. Java 网络爬虫获取网页源代码原理及实现

    Java 网络爬虫获取网页源代码原理及实现 1.网络爬虫是一个自动提取网页的程序,它为搜索引擎从万维网上下载网页,是搜索引擎的重要组成.传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL ...

  5. java网络爬虫基础学习(三)

    尝试直接请求URL获取资源 豆瓣电影 https://movie.douban.com/explore#!type=movie&tag=%E7%83%AD%E9%97%A8&sort= ...

  6. java网络爬虫基础学习(一)

    刚开始接触java爬虫,在这里是搜索网上做一些理论知识的总结 主要参考文章:gitchat 的java 网络爬虫基础入门,好像要付费,也不贵,感觉内容对新手很友好. 一.爬虫介绍 网络爬虫是一个自动提 ...

  7. 开源的49款Java 网络爬虫软件

    参考地址 搜索引擎 Nutch Nutch 是一个开源Java 实现的搜索引擎.它提供了我们运行自己的搜索引擎所需的全部工具.包括全文搜索和Web爬虫. Nutch的创始人是Doug Cutting, ...

  8. 【转】44款Java 网络爬虫开源软件

    原帖地址 http://www.oschina.net/project/lang/19?tag=64&sort=time 极简网络爬虫组件 WebFetch WebFetch 是无依赖极简网页 ...

  9. [原创]一款基于Reactor线程模型的java网络爬虫框架

    AJSprider 概述 AJSprider是笔者基于Reactor线程模式+Jsoup+HttpClient封装的一款轻量级java多线程网络爬虫框架,简单上手,小白也能玩爬虫, 使用本框架,只需要 ...

随机推荐

  1. javascript基础修炼(13)——记一道有趣的JS脑洞练习题【华为云技术分享】

    版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/devcloud/article/detai ...

  2. 挑战10个最难的Java面试题(附答案)【上】【华为云技术分享】

    版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明.本文链接:https://blog.csdn.net/devcloud/article/deta ...

  3. 机器学习笔记(六) ---- 支持向量机(SVM)

    支持向量机(SVM)可以说是一个完全由数学理论和公式进行应用的一种机器学习算法,在小批量数据分类上准确度高.性能好,在二分类问题上有广泛的应用. 同样是二分类算法,支持向量机和逻辑回归有很多相似性,都 ...

  4. 19.JAVA-从文件中解析json、并写入Json文件(详解)

    1.json介绍 json与xml相比, 对数据的描述性比XML较差,但是数据体积小,传递速度更快. json数据的书写格式是"名称:值对",比如: "Name" ...

  5. 使用 nginx 实现虚拟主机

    当多个系统需要部署的时候,有系统访问很小,为了节省成本,就需要将多个系统部署到同一台服务器上,怎么在同一台服务器上,完成不同系统的部署和访问,就需要使用虚拟主机实现. 使用端口实现虚拟主机 配置 ng ...

  6. [TimLinux] CSS 纯CSS实现动画展开/收起功能

    内容转自CSS世界,理解之后进行了简化,简化后代码: <!DOCTYPE html> <html> <head> <meta charset=utf-8 /& ...

  7. 2019牛客全国多校第八场A题 All-one Matrices(单调栈)

    题意:让你找最大不可扩展全1子矩阵的数量: 题解:考虑枚举每一行为全1子矩阵的的底,然后从左到右枚举:up[i][j]:表示(i,j)这个位置向上可扩展多少,同时还有记录每个位置(i,j)向左最多可扩 ...

  8. ACM-ICPC 2018 焦作赛区网络预赛 I题 Save the Room

    Bob is a sorcerer. He lives in a cuboid room which has a length of AA, a width of BB and a height of ...

  9. 经典常用SQL语句大全

    创建表 --删除表 --DROP TABLE [dbo].[Test] --创建表 CREATE TABLE [dbo].[Test] ( ,) PRIMARY KEY, ----自增主键 ) NUL ...

  10. Android中实现异步轮询上传文件

    前言 前段时间要求项目中需要实现一个刷卡考勤的功能,因为涉及到上传图片文件,为加快考勤的速度,封装了一个异步轮询上传文件的帮助类 效果  先上效果图 设计思路 数据库使用的框架是GreenDao,一个 ...