C#获取网页信息核心方法（入门一）

下面记录的是我自己整理的C#请求页面核心类，主要有如下几个方法

1.HttpWebRequest Get请求获得页面html

2.HttpWebRequest Post请求获得页面html

3.模拟登录获得cookie内容

4.模拟登录获得cookie字符串

5.代理的设置

6.利用webbrowser 获取js生成的页面

7.为webbrowser设置cookie，模拟登录

8.使用demo

HttpWebRequest Get请求获得页面html

注意点：以前抓取觉得很慢，最后发现是代理的问题，没有代理就设置为null,这样就不用每次去找代理，影响执行效率，还有一些参数可以自习设置，比如模拟浏览器等。

        /// <summary>

        /// get请求获得页面的html

        /// </summary>

        /// <param name="url">需要获取的url</param>

        /// <param name="proxy">代理，没有设置为null,不然每次去读代理造成请求很慢</param>

        /// <param name="cookie">该网站所需要的cookie</param>

        /// <param name="timeout">超时时间</param>

        /// <returns>页面请求后的html</returns>

        public static string Crawl(string url, WebProxy proxy, CookieContainer cookie, int timeout = )

        {

            string result = string.Empty;

            HttpWebRequest request = null;

            WebResponse response = null;

            StreamReader streamReader = null;

            try

            {

                request = (HttpWebRequest)HttpWebRequest.Create(url);

                request.Proxy = proxy;

                request.Timeout = timeout;

                request.AllowAutoRedirect = true;

                request.CookieContainer = cookie;

                response = (HttpWebResponse)request.GetResponse();

                streamReader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);

                result = streamReader.ReadToEnd();

            }

            catch (Exception ex)

            {

                throw ex;

            }

            finally

            {

                if (request != null)

                {

                    request.Abort();

                }

                if (response != null)

                {

                    response.Close();

                }

                if (streamReader != null)

                {

                    streamReader.Dispose();

                }

            }

            return result;

        }

HttpWebRequest Post请求获得页面html

        /// <summary>

        /// post请求获得页面

        /// </summary>

        /// <param name="url">需要获取的url</param>

        /// <param name="postdata">post的数据字符串，如id=1&name=test</param>

        /// <param name="proxy">代理</param>

        /// <param name="cookie">coolie</param>

        /// <param name="timeout">超时</param>

        /// <returns></returns>

        public static string Crawl(string url, string postdata,WebProxy proxy, CookieContainer cookie, int timeout = )

        {

            string result = string.Empty;

            HttpWebRequest request = null;

            WebResponse response = null;

            StreamReader streamReader = null;

            try

            {

                request = (HttpWebRequest)HttpWebRequest.Create(url);

                request.Proxy = proxy;

                request.Timeout = timeout;

                request.AllowAutoRedirect = true;

                request.CookieContainer = cookie;

                byte[] bs = Encoding.ASCII.GetBytes(postdata);

                string responseData = String.Empty;

                request.Method = "POST";

                request.ContentType = "application/x-www-form-urlencoded";

                request.ContentLength = bs.Length;

                using (Stream reqStream = request.GetRequestStream())

                {

                    reqStream.Write(bs, , bs.Length);

                    reqStream.Close();

                }

                response = (HttpWebResponse)request.GetResponse();

                streamReader = new StreamReader(response.GetResponseStream(), Encoding.UTF8);

                result = streamReader.ReadToEnd();

            }

            catch (Exception ex)

            {

                throw ex;

            }

            finally

            {

                if (request != null)

                {

                    request.Abort();

                }

                if (response != null)

                {

                    response.Close();

                }

                if (streamReader != null)

                {

                    streamReader.Dispose();

                }

            }

            return result;

        }

模拟登录获得cookie内容

先找到登录的页面，分析登录页面的post参数和链接，获得cookie后可以直接传到上面的方法

        /// <summary>

        ///根据模拟请求页面获得cookie

        /// </summary>

        /// <param name="url">模拟的url</param>

        /// <returns>cookie</returns>

        public static CookieContainer GetCookie(string url, WebProxy proxy, int timeout = )

        {

            HttpWebRequest request = null;

            HttpWebResponse response = null;

            try

            {

                CookieContainer cc = new CookieContainer();

                request = (HttpWebRequest)HttpWebRequest.Create(url);

                request.Proxy = proxy;

                request.Timeout = timeout;

                request.AllowAutoRedirect = true;

                request.CookieContainer = cc;

                response = (HttpWebResponse)request.GetResponse();

                response.Cookies = request.CookieContainer.GetCookies(request.RequestUri);

                return cc;

            }

            catch (Exception ex)

            {

                throw ex;

            }

            finally

            {

                if (request != null)

                {

                    request.Abort();

                }

                if (response != null)

                {

                    response.Close();

                }

            }

        }

模拟登录获得cookie字符串

        /// <summary>

        /// 获得cookie字符串，webbrowser可以使用

        /// </summary>

        /// <param name="url"></param>

        /// <param name="proxy"></param>

        /// <param name="timeout"></param>

        /// <returns></returns>

        public static string GetCookieString(string url, WebProxy proxy, int timeout = )

        {

            HttpWebRequest request = null;

            HttpWebResponse response = null;

            try

            {

                CookieContainer cc = new CookieContainer();

                request = (HttpWebRequest)HttpWebRequest.Create(url);

                request.Proxy = proxy;

                request.Timeout = timeout;

                request.AllowAutoRedirect = true;

                request.CookieContainer = cc;

                response = (HttpWebResponse)request.GetResponse();

                response.Cookies = request.CookieContainer.GetCookies(request.RequestUri);

                string strcrook = request.CookieContainer.GetCookieHeader(request.RequestUri);

                return strcrook;

            }

            catch (Exception ex)

            {

                throw ex;

            }

            finally

            {

                if (request != null)

                {

                    request.Abort();

                }

                if (response != null)

                {

                    response.Close();

                }

            }

        }

代理的设置

       /// <summary>

        /// 创建代理

        /// </summary>

        /// <param name="port">代理端口</param>

        /// <param name="user">用户名</param>

        /// <param name="password">密码</param>

        /// <returns></returns>

        public static WebProxy CreatePorxy(string port, string user, string password)

        {

            WebProxy proxy = new WebProxy();

            proxy.Address = new Uri(port);

            proxy.Credentials = new NetworkCredential(user, password);

            return proxy;

        }

利用webbrowser 获取js生成的页面

说明：由于不知道页面什么时候执行完成，这里是等待5s，默认执行完成，效率有待提高。

另外执行需要线程安全添加[STAThread]

        /// <summary>

        /// 抓取js生成的页面

        /// </summary>

        /// <param name="url"></param>

        /// <returns></returns>

        public static string CrawlDynamic(string url)

        {

            WebBrowser browser = new WebBrowser();

            browser.ScriptErrorsSuppressed = true;

            browser.Navigate(url);

            //先要等待加载完毕

            while (browser.ReadyState != WebBrowserReadyState.Complete)

            {

                Application.DoEvents();

            }

            System.Timers.Timer timer = new System.Timers.Timer();

            var isComplete = false;

            timer.Elapsed += new System.Timers.ElapsedEventHandler((sender, e) =>

            {

                //加载完毕

                isComplete = true;

                timer.Stop();

            });

            timer.Interval =  * ;

            timer.Start();

            //继续等待 5s，等待js加载完

            while (!isComplete)

                Application.DoEvents();

            var htmldocument = browser.Document;

            return htmldocument.ActiveElement.InnerHtml;

        }

为webbrowser设置cookie，模拟登录

刚开始始终不成功以为这个方法不能用，后面发现原来是doain设置有问题，我的例子是www.aa.xxx.com,设置的为http://xx.com可以使用，这个地方可能需要根据自己的情况来选择域名。

        [DllImport("wininet.dll", CharSet = CharSet.Auto, SetLastError = true)]

        public static extern bool InternetSetCookie(string lpszUrlName, string lbszCookieName, string lpszCookieData);

        /// <summary>

        /// 为webbrowser设置cookie

        /// </summary>

        /// <param name="cookieStr">cookie字符串，可以从上面方法获得</param>

        /// <param name="domain">需要设置的域名</param>

        public static void SetCookie(string cookieStr,string domain)

        {

            foreach (string c in cookieStr.Split(';'))

            {

                string[] item = c.Split('=');

                if (item.Length == )

                {

                    string name = item[];

                    string value = item[];

                    InternetSetCookie(domain, name, value);

                }

            }

        }

使用demo

            //代理,没有就直接传null

            WebProxy proxy = WebCrawl.WebRequestHelper.CreatePorxy("xx.com", "user", "password");

            //根据登录页得到cookie

            CookieContainer cookie = WebCrawl.WebRequestHelper.GetCookie("http://xxxx.login.com", proxy);

            //获取页面

            string html = WebCrawl.WebRequestHelper.Crawl("http://xxx.index.com", proxy, cookie);

            //根据登录页得到cookie字符串

            string cookiestr = WebCrawl.WebRequestHelper.GetCookieString("http://xxxx.login.com", proxy);

            //为webbrowser设置cookie

            WebCrawl.WebRequestHelper.SetCookie(cookiestr, "https://xx.com");

            //获取需要登录切用js生成的页面，当然普通页面也可以

            string htmlWithJs = WebCrawl.WebRequestHelper.CrawlDynamic("http://xxx.index.com");

C#获取网页信息核心方法（入门一）的更多相关文章

C# HttpWebRequest 绝技根据URL地址获取网页信息
如果要使用中间的方法的话,可以访问我的帮助类完全免费开源:C# HttpHelper,帮助类,真正的Httprequest请求时无视编码,无视证书,无视Cookie,网页抓取 1.第一招,根据URL地 ...
使用URLConnection获取网页信息的基本流程
参考自core java v2, chapter3 Networking. 注:URLConnection的子类HttpURLConnection被广泛用于Android网络客户端编程,它与apach ...
PHP版微信第三方实现一键登录及获取用户信息的方法
本文实例讲述了PHP版微信第三方实现一键登录及获取用户信息的方法.分享给大家供大家参考,具体如下: 注意,要使用微信在第三方网页登录是需要“服务号”才可以哦,所以必须到官方申请. 一开始你需要进入微信 ...
使用URLConnection获取网页信息的基本流程分类： H1_ANDROID 2013-10-12 23:51 3646人阅读评论(0) 收藏
参考自core java v2, chapter3 Networking. 注:URLConnection的子类HttpURLConnection被广泛用于Android网络客户端编程,它与apach ...
JS获取网页宽高方法集合
JS获取网页宽高等方法的集合:document.body.clientWidth - 网页可见区域宽document.body.clientHeight - 网页可见区域高 document.body ...
在php中分别使用curl的post提交数据的方法和get获取网页数据的方法
在php中分别使用curl的post提交数据的方法和get获取网页数据的方法整理分享一下额,具体代码如下: (1)使用php curl获取网页数据的方法: $ch=curl_init(); //设置选 ...
Oracle 和 SQLSERVER 重新获取统计信息的方法
1. Oracle 重新获取统计信息的命令 exec dbms_stats.gather_schema_stats(ownname =>) # 需要修改 ownername options 指定 ...
JavaScript获取浏览器信息的方法
Window有navigator对象让我们得知浏览器的全部信息.我们可以利用一系列的API函数得知浏览器的信息. JavaScript代码如下: ? 1 2 3 4 5 6 7 8 9 10 11 1 ...
C#获取网页信息并存入数据库
1,获取以及商品分类信息给一网页获取网页上商品信息的分类 using Skay.WebBot; using System; using System.Collections.Generic; usi ...

随机推荐

mysql绿色版安装，多实例安装
1.为什么要装多个mysql多实例? 关于这个的原因,我目前了解为建立一个主数据库,一个或者多个从库,实现一主多从或者主从复制的目的. 2.设么是mysql的多实例? MySQL多实例就是在一台机器上 ...
Service学习
一.采用startService方式开启服务 1.写一个服务类 public class PhoneService extends Service { private static final Str ...
FreeRTOS - 程序开发阶段建议
1.创建任务.定时器等都需要耗用分配给FreeRTOS的heap,由于RAM有限,分配作为FreeRTOS的heap量有限,一不小心就不够用了,所以应该有检测任务.定时器等是否创建成功,如下图: 2. ...
redis安装----非基于lnmp安装
在 Ubuntu 系统安装 Redi 可以使用以下命令: $sudo apt-get update $sudo apt-get install redis-server 启动 Redis $ redi ...
编译skia静态库时，图片解码库无法注册的问题
转载:http://www.cnblogs.com/imlucky/archive/2012/08/01/2617851.html 今天编译skia库,增加图片解码库时总是无效.按照此博客的方法修改后 ...
【Linux】NAT模式下关于主机ping不通虚拟机的问题
今天打开虚拟机,然后用Xshell远程连接,发现连接不上.按照以下顺序检查了一遍. 1.虚拟机网络连接采用的是NAT模式 2.虚拟机IP采用的是自动获取. IP:192.168.191.130 子 ...
基本控件文档-UISegment属性----iOS-Apple苹果官方文档翻译
本系列所有开发文档翻译链接地址:iOS7开发-Apple苹果iPhone开发Xcode官方文档翻译PDF下载地址 //转载请注明出处--本文永久链接:http://www.cnblogs.com/Ch ...
C# 关于调用微信接口的代码
调用微信接口前需要准备的内容. 1.微信公众平台的appid 2.微信公众平台的secret 3..获取tokenid 4.获取ticket 5.生成签名的随机串 6.生成签名的时间戳 7.生成签名 ...
灵活使用ARM汇编的WEAK关键字
//=====================================================================//TITLE:// 灵活使用ARM汇编的WEAK关 ...
android 内核调试
这篇文档给出使用android emulator 和 arm-linux-androideabi-gdb 调试 android kernel 的方法 1. checkout goldfish 源码: ...

C#获取网页信息核心方法（入门一）

C#获取网页信息核心方法（入门一）的更多相关文章

随机推荐

热门专题