c# & Fizzler to crawl web page in a certain website domain
使用fizzler [HtmlAgilityPackExtension]和c#进行网页数据提取;fizzler是HtmlAgilityPack的一个扩展,支持jQuery Selector;
提取数据一般都是有规律url拼凑,然后挨个儿发request得到response进行解析:
1.假如一个website下的所有xxx.sample.com/contactus.html里边存在邮箱字段(准备提取的数据)
a)当有子域名的时候,比如:a.sample.com, aadr.sample.com, 135dj.sample.com,随机性比较强;
解决方法:bing search engine中使用 site:b2b.sample.com搜索得到的result页面可以提取所有子域名,然后拼凑成xxx.sample.com/contactus.html,继而发送请求到这个url,得 到response进行解析;
NOTE:关于site:b2b.sample.com的搜索url拼凑如下,
http://www.bing.com/search?q=site%3A{b2b.sample.com}&go=Submit&qs=n&form=QBRE&pq=site%3A{b2b.sample.com}&sc=1-19&sp=-1&sk=&cvid=6165a189f5354b1982fb8cd6933abb6f&first={pageIndex}&FORM=PERE
2.像www.sample.com/1456.html的页面可以直接平凑1456.html/1457.html/1458.html etc.此处不列举;
Fizzler使用方法:
1.从nuget上安装Fizzler;
2.使用方法参考code.google.com;
3.使用bing提取website下的所有子域:
private static List<string> GetSubdomains(string websiteDomain, int startPageIndex = , int pageCount = , int pageSize = )
{
var list = new List<string>();
//using bind to search subdomains in a certain website
var bingSearchUrlFormat = "http://www.bing.com/search?q=site%3a{0}&go=Submit&qs=n&pq=site%3a{0}&sc=1-100&sp=-1&sk=&cvid=a9b36439006f4b05b09f9202c5b784bd&first={1}&FORM=PQRE"; WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
var doc = new HtmlDocument(); var first = (startPageIndex / ) * + ;
var stopIndex = first + pageCount*pageSize;
var currentPageIndex = startPageIndex;
for (var startItemSquenceNumber = first; startItemSquenceNumber < stopIndex; startItemSquenceNumber = startItemSquenceNumber + pageSize)
{
var response = client.DownloadString(string.Format(bingSearchUrlFormat, websiteDomain, startItemSquenceNumber));
HtmlDocumentExtensions.LoadHtml2(doc, response);
var docNode = doc.DocumentNode;
var subDomains = docNode.QuerySelectorAll(".sb_meta cite");foreach (var subDomain in subDomains)
{
list.Add(subDomain.InnerText);
}
}return list;
}
4.获取网页节点:
private static List<HtmlNode> GetWebPageNodes(string url, string elementSelector, string attributeNameContained, string attributeNameContainedValueLike)
{
var client = new WebClient();
client.Encoding = Encoding.UTF8;
var response = client.DownloadString(url);
var doc = new HtmlDocument();
HtmlDocumentExtensions.LoadHtml2(doc, response);
var docNode = doc.DocumentNode;
var emailNode = docNode.QuerySelectorAll(elementSelector).Where(node => node.Attributes.Where(attr => attr.Name == attributeNameContained).FirstOrDefault().Value.Contains(attributeNameContainedValueLike)).FirstOrDefault(); var nodes = (from node in docNode.QuerySelectorAll(elementSelector)
where node.HasAttributes && node.GetAttributeValue(attributeNameContained, string.Empty).Contains(attributeNameContainedValueLike)
select node).ToList(); return nodes;
}
5.获取某个网页中邮箱的方法:
var subdomains = GetSubdomains("b2b.sample.com", stopPageIndex, );
var urlFormat = "http://{0}/contactus.html";
GetWebPageNodes(string.Format(urlFormat, item), "body table a", "href", "mailto").FirstOrDefault();
最后的问题:当通过bing搜索子域时会有限制,发送100~150个请求后获取到的response就不是我想要的页面,而是要求输入验证码防止攻击的html;此问题暂时未解决,望大神指点!
c# & Fizzler to crawl web page in a certain website domain的更多相关文章
- How To Crawl A Web Page with Scrapy and Python 3
sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...
- 解读Web Page Diagnostics网页细分图
解读Web Page Diagnostics网页细分图 http://blog.sina.com.cn/s/blog_62b8fc330100red5.html Web Page Diagnostic ...
- 网页细分图结果分析(Web Page Diagnostics)
Discuz开源论坛网页细分图结果分析(Web Page Diagnostics) 续LR实战之Discuz开源论坛项目,之前一直是创建虚拟用户脚本(Virtual User Generator)和场 ...
- Atitit.web三大编程模型 Web Page Web Forms 和 MVC
Atitit.web三大编程模型 Web Page Web Forms 和 MVC 1. 编程模型是 Web Forms 和 MVC (Model, View, Controller). 2. ...
- [转]Calling Web Service Functions Asynchronously from a Web Page 异步调用WebServices
本文转自:http://www.codeproject.com/Articles/70441/Calling-Web-Service-Functions-Asynchronously-from Ove ...
- Tutorial: Importing and analyzing data from a Web Page using Power BI Desktop
In this tutorial, you will learn how to import a table of data from a Web page and create a report t ...
- Android WebView常见问题的解决方案总结----例如Web page not available
之前android虚拟机一直都可以直接联网,今天写了一个WebView之后,突然报出了Web page not available的错误,但是查看虚拟机自带的浏览器,是可以上网的,所以检查还是代码的问 ...
- LR实战之Discuz开源论坛——网页细分图结果分析(Web Page Diagnostics)
续LR实战之Discuz开源论坛项目,之前一直是创建虚拟用户脚本(Virtual User Generator)和场景(Controller),现在,终于到了LoadRunner性能测试结果分析(An ...
- Home | eMine: Web Page Transcoding Based on Eye Tracking Project Page
Home | eMine: Web Page Transcoding Based on Eye Tracking Project Page The World Wide Web (web) has m ...
随机推荐
- 如何快速编写Vim语法高亮文件
这里有一份中文的入门文档,但是太长了,不想读,所以有本文 最快的办法,就是找一个语法相近的语法高亮文件,稍微改一下 自己从头写的话,首先搞定关键字: syntax case match "是 ...
- jQuery UI Autocomplete是jQuery UI的自动完成组件(share)
官网:http://jqueryui.com/autocomplete/ 以下分享自:http://www.cnblogs.com/yuzhongwusan/archive/2012/06/04/25 ...
- 台球游戏的核心算法和AI(2)
前言: 最近研究了box2dweb, 觉得自己编写Html5版台球游戏的时机已然成熟. 这也算是圆自己的一个愿望, 一个梦想. 承接该序列的相关博文: • 台球游戏核心算法和AI(1) 同时结合htm ...
- SE03 打包请求
请求打包,是传输请求时候常用的一种方法.步骤如下:注意点:1.打包请求一定仔细检查,不要遗漏 如果请求不多,分开传是更好的方式,不容易遗漏,导致问题, 2.请求释放后才 ...
- [Mac]关闭所有打开finder
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Light'} span.Apple-tab-span {white-sp ...
- mm/mmap.c
/* * linux/mm/mmap.c * * Written by obz. */#include <linux/stat.h>#include <linux/sched. ...
- PHP curl传 json字符串
$ch = curl_init(); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_URL, $url); curl_seto ...
- 数迹学——Asp.Net MVC4入门指南(4):添加一个模型
一.添加模型类 二.添加MovieDBContext类,连接数据库 DbContext类继承自 System.Data.Entity; 负责在数据库中获取,存储,更新,处理实例 MovieDBCont ...
- Logistic回归原理及公式推导[转]
原文见 http://blog.csdn.net/acdreamers/article/details/27365941 Logistic回归为概率型非线性回归模型,是研究二分类观察结果与一些影响因素 ...
- knockout 学习实例7 foreach
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title> ...