asp.net 网页抓取内容

网页抓取代码

using System;

using System.Collections.Generic;

using System.Linq;

using System.Web;

//

using System.Net;

using System.IO;

using System.Text.RegularExpressions;

using System.Text;

namespace WSYL.Web.Common

{

    public static class GetSteamShipInfo

    {

        public static string GetWebSite(string steamshipname,int itype)

        {

            if (steamshipname == null || steamshipname.Trim() == "")

                return null;

            //step1: get html from url

            string urlToCrawl = @"网址";

            //generate http request

            HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlToCrawl);

            //use GET method to get url's html

            req.Method = "GET";

            //use request to get response

            HttpWebResponse resp = (HttpWebResponse)req.GetResponse();

            // 二〇一五年八月十二日 18:14:45 需要增加判断网页解析超时问题 防止网页假死

            // string htmlCharset = "UTF-8";

            string htmlCharset = "utf-8";

            //use songtaste's html's charset GB2312 to decode html

            //otherwise will return messy code

            Encoding htmlEncoding = Encoding.GetEncoding(htmlCharset);

            StreamReader sr = new StreamReader(resp.GetResponseStream(), htmlEncoding);

            //read out the returned html

            string respHtml = sr.ReadToEnd();

            //第三种获取内容

            //Match TitleMatch = Regex.Match(rtbExtractedHtml.Text.ToString(), "<td width=\"30%\">([^<]*)</td>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
//需要获取的代码开始和结尾内容

            Match TitleMatch2 = Regex.Match(respHtml.ToString(), "<td align=\"left\" bgcolor=\"#EEEEEE\">([^<]*)</td>", RegexOptions.IgnoreCase | RegexOptions.Multiline);

            // txbExtractedInfo.Text = TitleMatch2.Groups[1].Value+"/"+ TitleMatch2.Groups[2].Value;

            if (TitleMatch2.Groups[1].Value.Length == 0 || TitleMatch2.Groups[1].Value=="")

               return respHtml = "";

            if(itype==0)

            {

                respHtml = TitleMatch2.Groups[1].Value.ToString();

            }

             if(itype==1)

            {

                respHtml = StripHtml(TitleMatch2.NextMatch().Value.ToString());

            }

             if (itype == 2)

             {

                 respHtml = TitleMatch2.Groups[1].Value + "/" + StripHtml(TitleMatch2.NextMatch().Value.ToString());

             }

            return  respHtml;

        }

        /// <summary>

        /// 去除html标签和空格有些例外会使得去除不干净，所以建议连续两次转化。这样将Html标签转化为了空格。太多连续的空格会影响之后对字符串的操作

        /// </summary>

        /// <param name="strHtml">标签内容</param>

        /// <returns></returns>

        private static string StripHtml(string strHtml)

        {

            Regex objRegExp = new Regex("<(.|\n)+?>");

            string strOutput = objRegExp.Replace(strHtml, "");

            strOutput = strOutput.Replace("<", "&lt;");

            strOutput = strOutput.Replace(">", "&gt;");

            //把所有空格变为一个空格

            Regex r = new Regex(@"\s+");

            strOutput = r.Replace(strOutput, " ");

            return strOutput.Trim();

        }

    }

}

asp.net 网页抓取内容的更多相关文章

ASP.NET网页抓取数据
我的数据通过一个TextBox输入,这些代码是写在一个button的点击事件里的. 网页数据抓取大概分为两步,第一步是获取网页源代码: 具体注释如下: var currentUrl = TextBox ...
分享一个c#t的网页抓取类
using System; using System.Collections.Generic; using System.Web; using System.Text; using System.Ne ...
java网页抓取
网页抓取就是,我们想要从别人的网站上得到我们想要的,也算是窃取了,有的网站就对这个网页抓取就做了限制,比如百度直接进入正题 //要抓取的网页地址 String urlStr = "http ...
网页抓取：PHP实现网页爬虫方式小结
来源:http://www.ido321.com/1158.html 抓取某一个网页中的内容,需要对DOM树进行解析,找到指定节点后,再抓取我们需要的内容,过程有点繁琐.LZ总结了几种常用的.易于实现 ...
Asp.Net 之抓取网页内容
一.获取网页内容——html ASP.NET 中抓取网页内容是非常方便的,而其中更是解决了 ASP 中困扰我们的编码问题. 需要三个类:WebRequest.WebResponse.StreamRea ...
基于Casperjs的网页抓取技术【抓取豆瓣信息网络爬虫实战示例】
CasperJS is a navigation scripting & testing utility for the PhantomJS (WebKit) and SlimerJS (Ge ...
Python开发爬虫之动态网页抓取篇：爬取博客评论数据——通过Selenium模拟浏览器抓取
区别于上篇动态网页抓取,这里介绍另一种方法,即使用浏览器渲染引擎.直接用浏览器在显示网页时解析 HTML.应用 CSS 样式并执行 JavaScript 的语句. 这个方法在爬虫过程中会打开一个浏览器 ...
php爬虫入门 - 登录抓取内容
PHP 写爬虫说实话我也想用Python的,毕竟人家招牌.无奈我Python还停留在看语法的阶段,实在太惭愧,鞭笞一下自己加油学习.这里用php的CURL库进行页面抓取. 同事使用的系统需要先登录, ...
Python实现简单的网页抓取
现在开源的网页抓取程序有很多,各种语言应有尽有. 这里分享一下Python从零开始的网页抓取过程第一步:安装Python 点击下载适合的版本https://www.python.org/ 我这里选择 ...

随机推荐

前端开发者进阶之函数反柯里化unCurrying
函数柯里化,是固定部分参数,返回一个接受剩余参数的函数,也称为部分计算函数,目的是为了缩小适用范围,创建一个针对性更强的函数. 那么反柯里化函数,从字面讲,意义和用法跟函数柯里化相比正好相反,扩大适用 ...
Modbus通讯协议学习 - 串口调试
概述我们在做任何事情之前都需要获取很多调试步骤: 1:485转换器连接硬件 2:485转换器上的USB接口连接电脑. 3:打开设备管理器 ->查看端口 4:打开串口调试工具,在串口配置的地方 ...
LeetCode: Lowest Common Ancestor of a Binary Search Tree 解题报告
https://leetcode.com/submissions/detail/32662938/ Given a binary search tree (BST), find the lowest ...
IPv6 app适配
参考资料: https://developer.apple.com/library/mac/documentation/NetworkingInternetWeb/Conceptual/Network ...
C#3.0新特性之扩展方法介绍
C#3.0扩展方法是给现有类型添加一个方法.现在类型即可是基本数据类型(如int,String等),也可以是自己定义的类.以下是引用片段: //Demo--1 //扩展基本类型 namespace T ...
juniper 550M訪问自身公网IP回流内部IP
拓扑图示意: 网关设备juniper 550M, untrust 区: 公网地址段22.22.22.22/29 trust区: 内部员工PC地址:172.16.4.x /24 trust区: ...
nn_slow和nn_fast
#define nn_fast(x) __builtin_expect ((x), 1) #define nn_slow(x) __builtin_expect ((x), 0) __builtin_ ...
ArchLinux 下架设PPTPD VPN服务
直接上命令吧: 安装: pacman -Sy pacman -S pptpd 配置: vim /etc/pptpd.conf option /etc/ppp/options.pptpd stimeou ...
[原]如何在Android用FFmpeg解码图像
前一篇[原]如何用Android NDK编译FFmpeg 我们知道了如何使用NDK来编译Android平台下使用的FFmpeg动态库.这篇文章我们就可以使用Android下的JNI来调用FFMpeg进 ...
Mina、Netty、Twisted一起学（四）：定制自己的协议
在前面的博文中,介绍一些消息分割的方案,以及MINA.Netty.Twisted针对这些方案提供的相关API.例如MINA的TextLineCodecFactory.PrefixedStringCod ...

asp.net 网页抓取内容

asp.net 网页抓取内容的更多相关文章

随机推荐

热门专题