使用jsoup抓取新闻信息

1，jsoup简介

jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。jsoup 是基于 MIT 协议发布的，可放心使用于商业项目。

jsoup 的主要功能如下：

1. 从一个 URL，文件或字符串中解析 HTML；

2. 使用 DOM 或 CSS 选择器来查找、取出数据；

3. 可操作 HTML 元素、属性、文本；

2,jsoup使用

1，下载jsoup的jar包：http://jsoup.org/download

2, jsoup英文的开发手册：http://jsoup.org/cookbook/

3，jsoup的jsoup cookbook中文版：http://www.open-open.com/jsoup/

下面是一个简单例子

1，获取新浪财经的website 以及标题，打印输出。

2，获取1中一个wensite的正文信息，打印并输出。

代码实现：

package jSoupTesting;

import java.io.IOException;

import org.jsoup.Jsoup;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

public class GetSinaUrlAndTitle {

    public static void main(String[] args) {

        // TODO Auto-generated method stub

        getUrlAndTitle();

        getTextMes();

    }

    public static void getUrlAndTitle()

    {

        String url="http://finance.sina.com.cn/";

        try {

            Document doc=Jsoup.connect(url).timeout(10000).get();//get all infomation from url website

            //System.out.println(doc);

            Elements ListDiv = doc.getElementsByAttributeValue("class","fin_tabs0_c0");

            //System.out.println(ListDiv);

            for (Element div :ListDiv) {

                 Elements links = div.getElementsByTag("a");

                // System.out.println(links);

                 for (Element link : links) {

                     String linkHref = link.attr("href").trim();

                     String linkText = link.text().trim();

                     System.out.println(linkHref+"\t"+linkText);

                 }

             }

        } catch (IOException e) {

            // TODO Auto-generated catch block

            e.printStackTrace();

        }

    }

    public static void getTextMes()

    {

        String url="http://finance.sina.com.cn/hy/20140823/100220099682.shtml";

        String textMes="";

        try {

            Document doc=Jsoup.connect(url).timeout(10000).get();

            Elements ListDiv = doc.getElementsByAttributeValue("class","blkContainerSblkCon BSHARE_POP");

            //System.out.println(ListDiv);

            for(Element div:ListDiv)

            {

                Elements textInfos=div.getElementsByTag("p");

                //System.out.println(textInfos);

                for(Element textInfo:textInfos)

                {

                    String text=textInfo.text().trim();

                    textMes=textMes+text+"\n";

                }

            }

            System.out.println(textMes);

        } catch (IOException e) {

            // TODO Auto-generated catch block

            e.printStackTrace();

        }

    }

}

3，新闻抓取要求

新闻筛选过程：（以“新浪财经 “为例） http://finance.sina.com.cn/

1. 选择方向

（1）宏观新闻：宏观新闻：包括一些重大的国内外宏观调控，我国银监会等监管机构出台的一些文件，或者例如自贸区发展，金砖银行成立等国内重大金融新闻。

（2）公司新闻：包括客户公司或其他大型金融机构的管理层变动，兼并收购，战略转型，新推产品等新闻。

2. 网页选择

1.宏观新闻：进入http://finance.sina.com.cn/ -----》首页“要闻“

2.公司新闻：进入http://finance.sina.com.cn/ 选择“银行“ -》 ”要闻“

3，抓取要求

1，要求抓取要闻部分所有网址，标题，关键字。

2，要求抓取1中网址下的正文。

3，并且前一天看过的新闻不能存在于后一天。

4，要求抓好的新闻放在txt文档中。

4，代码实现

package sinaSpider;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileWriter;

import java.io.IOException;

import java.io.InputStreamReader;

import java.text.SimpleDateFormat;

import java.util.Calendar;

import java.util.Date;

import java.util.HashMap;

import java.util.Map;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

public class GetSinaInfo {

    public static void main(String[] args) throws IOException {

        // TODO Auto-generated method stub

        getSinaInforamtion();

    }

    public static  void getSinaInforamtion()

    {

        Map<String,String> pathMap=createNewFiles();

        try {

            getSinaYaoWen(pathMap);

            getSinaChangJing(pathMap);

            getSinaBank(pathMap);

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

    public static void getSinaYaoWen(Map<String,String> pathMap) throws IOException

    {

        String YaoWenTextPath=pathMap.get("yaowen")+"//yaowen"+GetDate()+"outputText.txt";

        String YaoWenTitlePath=pathMap.get("yaowen")+"//yaowen"+GetDate()+"outputTitle.txt";

        String YaoWenUrlPath=pathMap.get("yaowen")+"//"+GetDate()+"url.txt";

        FileWriter urlWriter = new FileWriter(YaoWenUrlPath);

        FileWriter textWriter = new FileWriter(YaoWenTextPath);

        FileWriter titleWriter = new FileWriter(YaoWenTitlePath);

        String oldUrlPath=pathMap.get("yaowen")+"//"+GetYesterday()+"url.txt";

        String[] oldUrls=GetYesterdayInfo(oldUrlPath);

        Document doc = Jsoup.connect("http://finance.sina.com.cn/").timeout(5000).get();

        Elements ListDiv = doc.getElementsByAttributeValue("class","fin_tabs0_c0");

        //System.out.println(ListDiv);

                 for (Element element :ListDiv) {

             Elements links = element.getElementsByTag("a");

             for (Element link : links) {

                 String linkHref = link.attr("href").trim();

                 String linkText = link.text().trim();

                 if(judgeDup(oldUrls,linkHref))

                 {

                     getWebText(linkHref,linkText,textWriter,titleWriter,urlWriter);

                 }     

             }

         }

         textWriter.close();

         titleWriter.close();

         urlWriter.close();

    }

    public static void getSinaChangJing(Map<String,String> pathMap) throws IOException

    {

         String ChanJingTextPath=pathMap.get("chanjing")+"//chanjing"+GetDate()+"outputText.txt";

         String ChanJingTitlePath=pathMap.get("chanjing")+"//chanjing"+GetDate()+"outputTitle.txt";

         String ChanJingUrlPath=pathMap.get("chanjing")+"//"+GetDate()+"url.txt";

         FileWriter urlWriter = new FileWriter(ChanJingUrlPath);

         FileWriter textWriter = new FileWriter(ChanJingTextPath);

         FileWriter titleWriter = new FileWriter(ChanJingTitlePath);

         String oldUrlPath=pathMap.get("chanjing")+"//"+GetYesterday()+"url.txt";

         String[] oldUrls=GetYesterdayInfo(oldUrlPath);

         Document doc = Jsoup.connect("http://finance.sina.com.cn/chanjing/").timeout(5000).get();

         Elements ListDiv = doc.getElementsByAttributeValue("class","blk_03");

        //System.out.println(ListDiv);

         for (Element element :ListDiv) {

             Elements links = element.getElementsByTag("a");

             for (Element link : links) {

                 String linkHref = link.attr("href").trim();

                 String linkText = link.text().trim();

                 if(judgeDup(oldUrls,linkHref))

                 {

                     getWebText(linkHref,linkText,textWriter,titleWriter,urlWriter);

                 }

             }

         }

         textWriter.close();

         titleWriter.close();

         urlWriter.close();

    }

    public static void getSinaBank(Map<String,String> pathMap) throws IOException

    {

         String bankTextPath=pathMap.get("bank")+"//bank"+GetDate()+"outputText.txt";

         String bankTitlePath=pathMap.get("bank")+"//bank"+GetDate()+"outputTitle.txt";

         String bankUrlPath=pathMap.get("bank")+"//"+GetDate()+"url.txt";

         FileWriter urlWriter = new FileWriter(bankUrlPath);

         FileWriter textWriter = new FileWriter(bankTextPath);

         FileWriter titleWriter = new FileWriter(bankTitlePath);

         String oldUrlPath=pathMap.get("bank")+"//"+GetYesterday()+"url.txt";

         String[] oldUrls=GetYesterdayInfo(oldUrlPath);

         Document doc = Jsoup.connect("http://finance.sina.com.cn/money/bank/").timeout(5000).get();

         Elements ListDiv = doc.getElementsByAttributeValue("class","blk05");

        //System.out.println(ListDiv);

         for (Element element :ListDiv) {

            Elements links = element.getElementsByTag("a");

            for (Element link : links) {

                String linkHref = link.attr("href").trim();

                String linkText = link.text().trim();

                if(judgeDup(oldUrls,linkHref))

                {

                    getWebText(linkHref,linkText,textWriter,titleWriter,urlWriter);

                }

            }

        }

         textWriter.close();

         titleWriter.close();

         urlWriter.close();

    }

    public static void getWebText(String url,String subTitle,

                                  FileWriter textWriter,FileWriter titleWriter,

                                  FileWriter urlWriter) throws IOException

    {

        Document doc;

        doc = Jsoup.connect(url).timeout(10000).get();

        Elements ListDiv = doc.getElementsByAttributeValue("class","blkContainerSblkCon BSHARE_POP");

        if(ListDiv.isEmpty()!=true)

        {

            String webTitleKeywords=getTitleAndWebsite(url,subTitle)+getKeyWords(doc);

            System.out.println(webTitleKeywords);

            writeSTK(webTitleKeywords, titleWriter);

            textWriter.write(webTitleKeywords+"\n");

            urlWriter.write(url+"\n");

            for (Element element :ListDiv) {

                 Elements links = element.getElementsByTag("p");

                 for (Element link : links) {

                     String linkText = link.text().trim();

                     textWriter.write(linkText+"\n");

                   //  System.out.println(linkText);

                 }

             }

        }

    }

    public static String getTitleAndWebsite(String url,String subTitle)

    {

        String titleAndWebsite;

        titleAndWebsite=url+"\t"+subTitle;

        return titleAndWebsite;

    }

    public static void writeSTK(String webTitleKeywords,FileWriter writeWebTitle)

    {

        try {

            writeWebTitle.write(webTitleKeywords+"\n");

        } catch (IOException e) {

            // TODO Auto-generated catch block

            e.printStackTrace();

        }

    }

    public static String getKeyWords(Document doc)

    {

        Elements listKey=doc.getElementsByAttributeValue("class","art_keywords");

        String keywords ="\t keywords:";

        for(Element element:listKey)

        {

            Elements links = element.getElementsByTag("a");

             for (Element link : links) {

                 String linkText = link.text().trim();

                 keywords = keywords+linkText+",";

             }

        }

        return keywords;

    }

    public static String GetDate()

    {

         Date dt=new Date();

         SimpleDateFormat simpleDate=new SimpleDateFormat("yyyy-MM-dd");

        // System.out.println(simpleDate.format(dt));

         return simpleDate.format(dt);

    }

    public static String GetYesterday()

    {

        Calendar calendar = Calendar.getInstance();

        calendar.add(Calendar.DATE, -1);

        String  yestedayDate = new SimpleDateFormat("yyyy-MM-dd").format(calendar.getTime());

       // System.out.println(yestedayDate);

        return yestedayDate;

    }

    public static String[] GetYesterdayInfo(String oldFilePath) throws IOException

    {

        String encoding="Utf-8";

        File file=new File(oldFilePath);

        if(file.exists())

        {

            return getOldUrls(file,encoding);

        }

        else

        {

            file.createNewFile();

            return getOldUrls(file,encoding);

        }

    }

    public static String[] getOldUrls(File file,String encoding) throws IOException

    {

            FileInputStream fis=new FileInputStream(file);

            InputStreamReader inStream=new InputStreamReader(fis,encoding);

            BufferedReader input=new BufferedReader(inStream);

            String url=input.readLine();

            StringBuilder sb = new StringBuilder("");

             while(url!=null){

                sb.append(url.trim());

                sb.append(",");

                url=input.readLine();

             }

            String sbStr = sb.toString();

            String oldUrls[]=sbStr.split(",");

            return oldUrls;

    }

    public static boolean judgeDup(String[] oldUrls ,String newUrl)

    {

        for(int i=0;i<oldUrls.length;i++)

        {

            if(newUrl.equals(oldUrls[i])==true)

            {

                return false;

            }

        }

        return true;

    }

    public static Map<String,String> createNewFiles()

    {

        String path=getWorkPath()+"//output";

        String [] fileNames = {"yaowen","chanjing","bank"};

        Map<String,String> pathMap=new HashMap<String,String>();

        String pathArray[] = new String[fileNames.length];

        for(int i=0;i<fileNames.length;i++)

        {

            String filePath=path+"//"+fileNames[i];

            File file=new File(filePath);

            if(!file.exists())

            {

                file.mkdirs();

            }

            pathArray[i]=file.getPath().replace("\\", "//");

            pathMap.put(fileNames[i], pathArray[i]);

        }

        return pathMap;

    }

    public static String getWorkPath()

    {

        String workspacePath = null;

        try {

            File directory = new File("");//参数为空

            workspacePath = directory.getCanonicalPath() ;

            //System.out.println(workspacePath);

            workspacePath = workspacePath.replace("\\", "//");

            //System.out.println(workspacePath);

        } catch (IOException e) {

            // TODO Auto-generated catch block

            e.printStackTrace();

        }

        return workspacePath;

    }

}

使用jsoup抓取新闻信息的更多相关文章

HttpClient+Jsoup 抓取网页信息（网易贵金属为例）
废话不多说直接讲讲今天要做的事. 利用HttpClient和Jsoup技术抓取网页信息.HttpClient是支持HTTP协议的客户端编程工具包,并且它支持HTTP协议. jsoup 是一款基于 Ja ...
Jsoup抓取网页数据完成一个简易的Android新闻APP
前言:作为一个篮球迷,每天必刷NBA新闻.用了那么多新闻APP,就想自己能不能也做个简易的新闻APP.于是便使用Jsoup抓取了虎扑NBA新闻的数据,完成了一个简易的新闻APP.虽然没什么技术含量,但 ...
jsoup抓取网页+具体解说
jsoup抓取网页+具体解说 Java 程序在解析 HTML 文档时,相信大家都接触过 htmlparser 这个开源项目.我以前在 IBM DW 上发表过两篇关于 htmlparser 的文章.各自 ...
网络爬虫: 从allitebooks.com抓取书籍信息并从amazon.com抓取价格(3): 抓取amazon.com价格
通过上一篇随笔的处理,我们已经拿到了书的书名和ISBN码.(网络爬虫: 从allitebooks.com抓取书籍信息并从amazon.com抓取价格(2): 抓取allitebooks.com书籍信息 ...
网络爬虫: 从allitebooks.com抓取书籍信息并从amazon.com抓取价格(2): 抓取allitebooks.com书籍信息及ISBN码
这一篇首先从allitebooks.com里抓取书籍列表的书籍信息和每本书对应的ISBN码. 一.分析需求和网站结构 allitebooks.com这个网站的结构很简单,分页+书籍列表+书籍详情页. ...
PHP快速抓取快递信息
<?php header("Content-type:text/html;charset=utf-8"); /** * Express.class.php 快递查询类 * @ ...
.net抓取网页信息 - Jumony框架使用1
往往在实际开发中,经常会用到一些如抓取网站信息之类的的操作,往往大家采用的是用一些正则的方式获取,但是有时候正则是很死板的,我们常常试想能不能使用jquery的选择器,获取符合自己要求的元素,然后进行 ...
SpringCloud系列九：SpringCloudConfig 基础配置（SpringCloudConfig 的基本概念、配置 SpringCloudConfig 服务端、抓取配置文件信息、客户端使用 SpringCloudConfig 进行配置、单仓库目录匹配、应用仓库自动选择、仓库匹配模式）
1.概念:SpringCloudConfig 基础配置 2.具体内容通过名词就可以发现,SpringCloudConfig 核心作用一定就在于进行配置文件的管理上.也就是说为了更好的进行所有微服务的 ...
使用轻量级JAVA 爬虫Gecco工具抓取新闻DEMO
写在前面最近看到Gecoo爬虫工具,感觉比较简单好用,所有写个DEMO测试一下,抓取网站 http://zj.zjol.com.cn/home.html,主要抓取新闻的标题和发布时间做为抓取测试对象 ...

随机推荐

Python CRM项目四
实现Django Admin的多对多的复选框效果效果:左边显示的是未选中的字段,右边显示的是已选中的字段,两边点击的标签可以互相更换首先在king_admin.py中增加filter_horizo ...
iOS-Runtime之关于页面跳转的捷径【Runtime获取当前ViewController】
写在前面在我们操作页面跳转时,如果当前的类不是UIViewcontroller(下面用VC表示),你会不会写一个代理,或者block给VC传递信息,然后在VC里面进行 ///假如targetVc是将 ...
ABP官方文档翻译 8.1 通知系统
通知系统介绍发送模型通知类型通知数据通知严重性关于通知持久化订阅通知发布通知用户通知管理实时通知客户端通知存储通知定义介绍在系统中通知用来基于特定的事件告知用户.ABP提 ...
洛谷 [P2296] 寻找道路
反向BFS预处理,求出所有符合题意的点,再正向BFS,(注意对于边权恒为一的点,BFS,比SPFA高效) 输入时n与m分清 #include <iostream> #include < ...
BZOJ 3994: [SDOI2015]约数个数和 [莫比乌斯反演转化]
2015 题意:\(d(i)\)为i的约数个数,求\(\sum\limits_{i=1}^n \sum\limits_{j=1}^m d(ij)\) \(ij\)都爆int了.... 一开始想容斥一下 ...
汇编之EBP的认识。
说到EBP就不能忽略了ESP.ESP是一个指针,始终执行堆栈的栈顶.而EBP就是那个所谓的堆栈了. 先看几个例子吧. push ebp ; 把ebp,堆栈的0地址压入堆栈 mov ebp,esp ; ...
CentOS上安装GitBlit服务
简单介绍在上一篇文章中,已经简单的介绍了如何在CentOS的服务器上搭建git服务器.但是这种方式实现的服务器功能比较弱,操作起来也比较繁琐.在网上搜索了一圈,感觉Gitblit比较符合我的需求.接 ...
open-falcon-agent插件使用
说明 Plugin可以看做是对agent功能的扩充.使用插件可以对采集脚本进行统一管理,方便定制修改,也可以免去在crontab中添加计划任务. 开启plugin功能 # 修改agent配置文件 &q ...
typedef介绍
1.typedef是什么? typedef是C中的类似于extern/static的一个关键字,用于为一种类型引入一个新的名字.并不会分配内存. 2.typedef常见用法? 1) typedef i ...
记录一次CentOS环境升级Python2.6到Python2.7并安装最新版pip
背景介绍一次实验中需要安装python-etcd包.安装这个包时要求的python和pip版本比目前系统的版本高. 系统是centos6.6 64位 1 2 3 4 5 6 7 [root@m ...

使用jsoup抓取新闻信息

使用jsoup抓取新闻信息的更多相关文章

随机推荐

热门专题