Httpclient 和jsoup结和提取网页内容(某客学院视频链接）

最近在极客学院获得体验会员3个月，然后就去上面看了看，感觉课程讲的还不错。整好最近学习Android，然后去上面找点视频看看。发现只有使用RMB买的会员才能在上面下载视频。抱着试一试的态度，去看他的网页源码，不巧发现有视频地址链接。然后想起来jsoup提取网页元素挺方便的，没事干就写了一个demo。

jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

jsoup的主要功能如下：

1. 从一个URL，文件或字符串中解析HTML；

2. 使用DOM或CSS选择器来查找、取出数据；

3. 可操作HTML元素、属性、文本；

jsoup的用法中文文档地址：http://www.open-open.com/jsoup/

使用jsoup提取网页中指定的内容需要提前做好网页分析工作。我找到在极客学院一个课程的页面源码，很快找到了视频链接部分;如下图：<scource/> 标签中就是视频链接，通过这个链接我们可以通过迅雷下载。

  <source src="http://cv3.jikexueyuan.com/201508081934/f8f3f9f8088f1ba0a6c75594448d96ab/course/1501-1600/1557/video/4278_b_h264_sd_960_540.mp4" type="video/mp4"></source>

我们获取整个html源码，然后根据<scource/>对源码进行提取，很容易获取下载链接。

接着通过分析网页，我们可以得到一门课程所有视频信息。网页源码如下：

 <dl class="lessonvideo-list">

    <dd class="playing">

     <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_1.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=1.1">1.编写自己的自定义 View（上）</a> <span class="lesson-time">00:10:24</span> </h2>

     <blockquote>

      本课时主要讲解最简单的自定义 View，然后加入绘制元素（文字、图形等），并且可以像使用系统控件一样在布局中使用。

     </blockquote>

    </dd>

    <dd>

     <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_2.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=2.2">2.编写自己的自定义 View（下）</a> <span class="lesson-time">00:12:05</span> </h2>

     <blockquote>

      本课时主要讲解最简单的自定义 View，然后加入绘制元素（文字、图形等），并且可以像使用系统控件一样在布局中使用。

     </blockquote>

    </dd>

    <dd>

     <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_3.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=3.3">3.加入逻辑线程</a> <span class="lesson-time">00:20:34</span> </h2>

     <blockquote>

      本课时需要让绘制的元素动起来，但是又不阻塞主线程，所以引入逻辑线程。在子线程更新 UI 是不被允许的，但是 View 提供了方法。让我们来看看吧。

     </blockquote>

    </dd>

    <dd>

     <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_4.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=4.4">4.提取和封装自定义 View</a> <span class="lesson-time">00:15:41</span> </h2>

     <blockquote>

      本课时主要讲解在上个课程的基础上，进行提取代码来构造自定义 View 的基类，主要目的是：创建新的自定义 View 时，只需继承此类并只关心绘制和逻辑，其他工作由父类完成。这样既减少重复编码，也简化了逻辑。

     </blockquote>

    </dd>

    <dd>

     <h2> <span class="sm-icon "></span> <a href="http://www.jikexueyuan.com/course/1748_5.html?ss=1" jktag="&amp;posGP=103001&amp;posArea=0002&amp;posOper=8005&amp;posColumn=5.5">5.在 xml 中定义样式来影响显示效果</a> <span class="lesson-time">00:14:05</span> </h2>

     <blockquote>

      本课时主要讲解的是在 xml 中定义样式及其属性，怎么来影响自定义 View 中的显示效果的过程和步骤。

     </blockquote>

    </dd>

   </dl>

通过 Elements results1 = doc.getElementsByClass("lessonvideo-list"); 我们可以获得视频列表。然后我们接着对从视频列表获取课程每节课视频地址使用jsoup遍历获取视频链接。

以上是主要思路，另外使用jsoup get方法获取网页Docment是是没有cooike状态的，有些视频需要VIP会员登录才能获取到视频播放地址。因此我们需要用httpclient来模拟用户登录状态。

一下是整个工程源码。

1 、课程course类，用于存储课程每一节课的课程名和课程url地址。

 public class Course {

     /**

      * 链接的地址

      */

     private String linkHref;

     /**

      * 链接的标题

      */

     private String linkText;

     public String getLinkHref() {

         return linkHref;

     }

     public void setLinkHref(String linkHref) {

         this.linkHref = linkHref;

     }

     public String getLinkText() {

         return linkText;

     }

     public void setLinkText(String linkText) {

         this.linkText = linkText;

     }

     @Override

     public String toString() {

         return "Video [linkHref=" + linkHref + ", linkText=" + linkText + "]";

     }

 }

2、HttpUtils类，用于模拟用户登录状态。

 import java.io.IOException;

 import java.io.InputStream;

 import java.io.UnsupportedEncodingException;

 import org.apache.http.Header;

 import org.apache.http.HttpEntity;

 import org.apache.http.HttpHeaders;

 import org.apache.http.HttpResponse;

 import org.apache.http.HttpStatus;

 import org.apache.http.client.ClientProtocolException;

 import org.apache.http.client.HttpClient;

 import org.apache.http.client.methods.CloseableHttpResponse;

 import org.apache.http.client.methods.HttpGet;

 import org.apache.http.client.methods.HttpPost;

 import org.apache.http.entity.StringEntity;

 import org.apache.http.impl.client.CloseableHttpClient;

 import org.apache.http.impl.client.DefaultHttpClient;

 import org.apache.http.impl.client.HttpClients;

 import org.apache.http.util.EntityUtils;

 @SuppressWarnings("deprecation")

 public class HttpUtils {

     String cookieStr = "";

     public String getCookieStr() {

         return cookieStr;

     }

     CloseableHttpResponse response = null;

     public CloseableHttpResponse getResponse() {

         return response;

     }

     public HttpUtils(String cookieStr) {

         this.cookieStr = cookieStr;

     }

     public HttpUtils() {

     }

     public String Get(String url) {

         CloseableHttpClient httpclient = HttpClients.createDefault();

         HttpGet httpget = new HttpGet(url);

         httpget.setHeader("cookie", cookieStr);

         httpget.setHeader(

                 HttpHeaders.USER_AGENT,

                 "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36");

         try {

             response = httpclient.execute(httpget);

             HttpEntity entity = response.getEntity();

             String res = EntityUtils.toString(entity, "UTF-8");

             return res;

         } catch (Exception e) {

             System.err.println(String.format("HTTP GET error %s",

                     e.getMessage()));

         } finally {

             try {

                 httpclient.close();

             } catch (IOException e) {

                 // e.printStackTrace();

             }

         }

         return null;

     }

     public String Post(String url) {

         CloseableHttpClient httpclient = HttpClients.createDefault();

         HttpPost httppost = new HttpPost(url.split("\\?")[0]);

         StringEntity reqEntity = null;

         try {

             reqEntity = new StringEntity(url.split("\\?")[1], "UTF-8");

         } catch (UnsupportedEncodingException e1) {

             // TODO Auto-generated catch block

             e1.printStackTrace();

         }

         httppost.setHeader("cookie", cookieStr);

         reqEntity

                 .setContentType("application/x-www-form-urlencoded;charset=UTF-8");

         httppost.setEntity(reqEntity);

         httppost.setHeader(

                 HttpHeaders.USER_AGENT,

                 "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36");

         try {

             response = httpclient.execute(httppost);

             Header[] headers = response.getAllHeaders();

             for (Header h : headers) {

                 String name = h.getName();

                 String value = h.getValue();

                 if ("Set-Cookie".equalsIgnoreCase(name)) {

                     cookieStr += subCookie(value);

                     //System.out.println(cookieStr);

                     // break;

                 }

             }

             HttpEntity entity = response.getEntity();

             return EntityUtils.toString(entity, "UTF-8");

         } catch (Exception e) {

             System.err.println(String.format("HTTP POST error %s",

                     e.getMessage()));

         } finally {

             try {

                 httpclient.close();

             } catch (IOException e) {

                 // e.printStackTrace();

             }

         }

         return null;

     }

     public String GetLoginCookie(String url) {

         CloseableHttpClient httpclient = HttpClients.createDefault();

         HttpGet httpget = new HttpGet(url);

         httpget.setHeader("Cookie", cookieStr);

         httpget.setHeader(

                 HttpHeaders.USER_AGENT,

                 "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36");

         try {

             response = httpclient.execute(httpget);

             Header[] headers = response.getAllHeaders();

             for (Header h : headers) {

                 String name = h.getName();

                 String value = h.getValue();

                 if ("Set-Cookie".equalsIgnoreCase(name)) {

                     cookieStr = subCookie(value);

                     return cookieStr;

                 }

             }

         } catch (Exception e) {

             System.err.println(String.format("HTTP GET error %s",

                     e.getMessage()));

         } finally {

             try {

                 httpclient.close();

             } catch (IOException e) {

                 // e.printStackTrace();

             }

         }

         return "4";// 错误码

     }

     public String subCookie(String value) {

         int end = value.indexOf(";");

         return value.substring(0, end + 1);

     }

     public InputStream GetImage(String url) {

         InputStream is = null;

         HttpClient httpclient = new DefaultHttpClient();

         HttpGet httpGet = new HttpGet(url);

         if (cookieStr != null)

             httpGet.setHeader("Cookie", cookieStr);

         HttpResponse response;

         try {

             response = httpclient.execute(httpGet);

             if (HttpStatus.SC_OK == response.getStatusLine().getStatusCode()) {

                 HttpEntity entity = response.getEntity();

                 if (entity != null) {

                     //System.out.println(entity.getContentType());

                     // 可以判断是否是文件数据流

                     //System.out.println(entity.isStreaming());

                     // File storeFile = new File("F:\\code.jpg");

                     // FileOutputStream output = new

                     // FileOutputStream(storeFile);

                     // 得到网络资源并写入文件

                     InputStream input = entity.getContent();

                     is = input;

                     // byte b[] = new byte[1024];

                     // int j = 0;

                     // while ((j = input.read(b)) != -1) {

                     // output.write(b, 0, j);

                     // }

                     // output.flush();

                     // output.close();

                 }

             }

         } catch (ClientProtocolException e) {

             // TODO Auto-generated catch block

             e.printStackTrace();

         } catch (IOException e) {

             // TODO Auto-generated catch block

             e.printStackTrace();

         }

         return is;

     }

3、简单的测试Test类。

 package com.debughao.down;

 import java.util.ArrayList;

 import java.util.List;

 import java.util.Scanner;

 import org.jsoup.Jsoup;

 import org.jsoup.nodes.Document;

 import org.jsoup.nodes.Element;

 import org.jsoup.select.Elements;

 import com.debughao.bean.Course;

 public class Test {

     public static void main(String[] args) {

         HttpUtils http = new HttpUtils("stat_uuid=1436867409341663197461; uname=qq_rwe4zg5t; uid=3812752; code=LZ8XF1; "

                 + "authcode=b809MIxLGp8syQcnuAAdIT9PuCEH2%2FuiyvRuuLALSxb6z6iGoM3xcihNJKzHK%2BAZWzVIGFAW0QrBYiSLmHN1qnhi0YQLmBeWeqkJHXh5xsoylWuRCFmRDJZyUtAGr3U; "

                 + "level_id=3; is_expire=0; domain=debughao; stat_fromWebUrl=; stat_ssid=1439813138264;"

                 + " connect.sid=s%3A5xux57xcLyCBheevR40DUa0beJD_ok-S.0aTnwfjSvm7A49zydLGbtXy7vdCGfH7lB7MwmZURppQ; "

                 + "QINGCLOUDELB=37e16e60f0cd051b754b0acf9bdfd4b5d562b81daa2a899c46d3a1e304c7eb2b|VcWiq|VcWiq; "

                 + "_ga=GA1.2.889563867.1436867384; _gat=1; Hm_lvt_f3c68d41bda15331608595c98e9c3915=1438945833,1438947627,1438995076,1438995133;"

                 + " Hm_lpvt_f3c68d41bda15331608595c98e9c3915=1439015591; MECHAT_LVTime=1439015591174; MECHAT_CKID=cookieVal=006600143686858016573509; "

                 + "undefined=; stat_isNew=0");

         Scanner sc=new Scanner(System.in);

         String url= sc.nextLine();

         sc.close();

         String res = http.Get(url);

         Document doc = getDocByRes(res);

         List<Course> videos = getVideoList(doc);

         for (Course video : videos) {

             System.out.println(video.getLinkText());

         }

         for (Course video : videos) {

             String urls = video.getLinkHref();

             String res2 = http.Get(urls);

             Document doc1 = getDocByRes(res2);

             getVideoLink(doc1);

         }

     }

     private static Document getDocByRes(String res) {

         // TODO Auto-generated method stub

         Document doc = null;

         doc = Jsoup.parse(res);

         return doc;

     }

     public static List<Course> getVideoList(Element doc) {

         Elements links;

         List<Course> courses = new ArrayList<Course>();

         Course course = null;

         Elements results1 = doc.getElementsByClass("lessonvideo-list");

         String title = doc.getElementsByTag("title").text();

         System.out.println(title);

         for (Element element : results1) {

             links = element.getElementsByTag("a");

             for (Element link : links) {

                 String linkList = link.attr("href");

                 String linkText = link.text();

                 // System.out.println(linkText);

                 course = new Course();

                 course.setLinkHref(linkList);

                 course.setLinkText(linkText);

                 courses.add(course);

             }

         }

         return courses;

     }

     public static void getVideoLink(Document doc) {

         Elements results2 = doc.select("source");

         String mp4Links = results2.attr("src");

         System.out.println(mp4Links);

     }

 }

4、以下是运行结果：

 1 http://www.jikexueyuan.com/course/1748.html

 2 自定义 View 基础和原理-极客学院

 3 1.编写自己的自定义 View（上）

 4 2.编写自己的自定义 View（下）

 5 3.加入逻辑线程

 6 4.提取和封装自定义 View

 7 5.在 xml 中定义样式来影响显示效果

 8 http://cv3.jikexueyuan.com/201508082007/99549fa37069a39a2e128278ee60768c/course/1501-1600/1557/video/4278_b_h264_sd_960_540.mp4

 9 http://cv3.jikexueyuan.com/201508082007/a068be74f7f31900e128f109523b0925/course/1501-1600/1557/video/4279_b_h264_sd_960_540.mp4

10 http://cv3.jikexueyuan.com/201508082008/bf216e06770e9a9b0adda34ea4d01dfc/course/1501-1600/1557/video/4280_b_h264_sd_960_540.mp4

11 http://cv3.jikexueyuan.com/201508082008/75b51573a75458848136e61e848d1ae7/course/1501-1600/1557/video/4281_b_h264_sd_960_540.mp4

12 http://cv3.jikexueyuan.com/201508082008/ca20fad3e1bc622aa64bbfa7d2b768dd/course/1501-1600/1557/video/5159_b_h264_sd_960_540.mp4

打开迅雷新建任务就可以下载。