HrrpClient使用

使用HttpClient获取网页内容的过程

　　1、创建一个CloseableHttpClient类的实例；

　　2、使用这个实例执行HTTP请求，得到一个HttpResponse的实例；

　　3、最后，通过HttpResponse的实例得到返回的二进制流，二进制流封装在HttpEntity中。根据指定的字符集把二进制流转
换成字符串，完成下载。

CloseableHttpClient类中存储了一些全局信息。创建CloseableHttpClient类的实例的代码如
下所示。

CloseableHttpClient httpclient = HttpClientBuilder.create().build();

创建一个客户端，类似于打开一个浏览器。HttpClient支持所有定义在HTTP/1.1版本中的
HTTP方法。对于每个方法类型都有一个特定的类，爬虫最常用的是表示HTTP GET方法的
org.apache.http.client.methods.HttpGet,这样是为了避免误抓登录后才能看到数据。

//创建一个GET方法，类似于在浏览器地址栏中输入一个地址

HttpGet httpget = new HttpGet("http://www.lietu.com/");

使用 CloseableHttpClient 执行 GET 请求。

//类似于在浏览器地址栏中输入回车，获得网页内容

HttpResponse response = httpclient.execute(httpget);

查看返回的内容，类似于在浏览器查看网页源代码。

HttpEntity entity = response.getEntity();

if (entity != null) {

　　//读入内容流，并以字符串形式返回，这里指定网页编码是UTF-8

　　System.out .println (EntityUtils. toString (entity, "utf-8")); //网页的 Meta 标签中指定了编码

　　　EntityUtils. consume (entity);//关闭内容流
}

最后需要释放和Web服务器建立的连接。

httpclient.close();

把使用HttpClient下载的网页封装成一个方法。

public static String downloadPage(String path) throws IOException{

//创建一个客户端，类似于打开一个浏览器

CloseableHttpClient httpclient = HttpClientBuilder.create().build();

//创建一个GET方法，类似于在浏览器地址栏中输入一个地址

HttpGet httpget = new HttpGet(path);

//类似于在浏览器地址栏中输入回车，获得网页内容

HttpResponse response = httpclient.execute(httpget);

//查看返回的内容，类似于在浏览器查看网页源代码

HttpEntity entity = response.getEntity();

if (entity != null) {

//读入内容流，并以字符串形式返回，这里指定网页编码是UTF-8

String html = EntityUtils .toString (entity, "GBK");//网页的 Meta 标签中指定了编码

EntityUtils. consume (entity);//关闭内容流

return html;

}

return null;

调用EntityUtils.consume方法是为了关闭内容流，更好的方法是调用
EntityUtils.consumeQuietly(entity)方法保证完全消费了实体对象。

这个程序中，爬虫程序发出下面这样的GET请求得到网页。

GET / HTTP/1.1

从返回的请求得到字符串最简单的方法。

BasicResponseHandler handler = new BasicResponseHandler ();

String content = httpclient.execute(httpget, handler);

如果使用BasicResponseHandler,则需要自己处理碰到的异常。例如碰到Service Unavailable
时，需要自己写重试的代码。

BasicResponseHandler handler = new BasicResponseHandler {);

String content = null;

do{

try{

content = httpclient.execute(httpget, handler);

}catch(org.apache.http.client.HttpResponseException ex){

ex.printStackTrace();

System.out.println ("retry..");

Thread.sleep(3000);

}

}while(content == null);

当我们不希望在某个网址上花太多时间去等待下载完成时，要设置超时。

//配置参数

int socketTimeout = 9000; //读数据超时

int connectionTimeout = 9000; //连接超时

//请求配置

RequestConfig requestConfig = RequestConfig.custom()

　　　　　　　　　　　　　　　　　　.setConnectTimeout(connectionTimeout)

　　　　　　　　　　　　　　　　　　.setConnectionRequestTimeout(connectionTimeout)

　　　　　　　　　　　　　　　　　　.setSocketTimeout(socketTimeout).build();

CloseableHttpClient httpClient = HttpClientBuilder.create().setDefaultRequestConfig(requestConfig).build();

表单post

。提交一个参数包括名字和值两项。NameValuePair是一个接口，而
BasicNameValuePair则是这个接口的实现，使用BasicNameValuePair封装名字/值对。例如，参
数名cityld对应的值是1，代码如下所示。

new BasicNameValuePair("cityld", "1");

模拟提交表单并返回结果的代码如下所示。

　　　　HttpClient httpclient = new DefaultHttpClient();

        //使用HttpPost发送POST请求

        HttpPost httppost = new HttpPost(’'http://hotels.ctrip.com/Domestic/ShowHotelList.aspx");

        //POST数据

        List<NameValuePair> nameValuePairs = new ArrayList<NameValuePair> ⑶；"3个参数

        nameValuePairs. add (new BasicNameValuePair ("checkin", "2011-4-15")); //入住日期

        nameValuePairs.add (new BasicNameValuePair ("checkout", "2011-4-25"}); //离店日期

        nameValuePairs. add (new BasicNameValuePair ("cityld", " 1")); //城市编码

        httppost.setEntity(new UrlEncodedFormEntity(nameValuePairs));

        //执行HTTP POST请求

        HttpResponse response = httpclient.execute(httppost);

        //取得内容流

        HttpEntity entity = response.getEntity();

        InputStream is = entity.getContent();

        BufferedlnputStream bis = new BufferedlnputStream(is);

        ByteArrayBuffer baf = new ByteArrayBuffer(20);

        //按字节读入内容流到字节数组缓存

        int current = 0;

        while ((current = bis.read()) != -1) {

        baf.append((byte) current);

        }

        String text = new String (baf. toByteArray (), "gb2312"); //指定编码

        System.out.println(text);

上面的例子说明了如何使用POST方法来访问Web资源。与GET方法不同，POST方法可
以提交二进制格式的数据，因此可以传递“无限”多的参数。而GET方法釆用把参数写在URL
里面的方式，由于URL有长度限制，因此传递参数的长度会有限制。