Nutch网页抓取速度优化

Here are the things that could potentially slow down fetching

1) DNS setup

2) The number of crawlers you have, too many, too few.

3) Bandwidth limitations

4) Number of threads per host (politeness)

5) Uneven distribution of urls to fetch and politeness.

6) High crawl-delays from robots.txt (usually along with an uneven distribution of urls).

7) Many slow websites (again usually with an uneven distribution).

8) Downloading lots of content (PDFS, very large html pages, again possibly an uneven distribution).

9) Others

Now how do we fix them

1) Have a DNS setup on each local crawling machine, if multiple crawling machines and a single centralized DNS it can act like a DOS attack on the DNS server slowing the entire system. We always did a two layer setup hitting first to the local DNS cache then to a large DNS cache like OpenDNS or Verizon.

2) This would be number of map tasks * fetcher.threads.fetch. So 10 map tasks * 20 threads = 200 fetchers at once. Too many and you overload your system, too few and other factors and the machine sites idle. You will need to play around with this setting for your setup.

3) Bandwidth limitations. Use ntop, ganglia, and other monitoring tools to determine how much bandwidth you are using. Account for in and out bandwidth. A simple test, from a server inside the fetching network but not itself fetching, if it is very slow connecting to or downloading content when fetching is occurring, it is a good bet you are maxing out bandwidth. If you set http timeout as we describe later and are maxing your bandwidth, you will start seeing many http timeout errors.

4) Politeness along with uneven distribution of urls is probably the biggest limiting factor. If one thread is processing a single site and there are a lot of urls from that site to fetch all other threads will sit idle while that one thread finishes. Some solutions, use fetcher.server.delay to shorten the time between page fetches and use fetcher.threads.per.host to increase the number of threads fetching for a single site (this would still be in the same map task though and hence the same JVM ChildTask process). If increasing this > 0 you could also set fetcher.server.min.delay to some value > 0 for politeness to min and max bound the process.

5) Fetching a lot of pages from a single site or a lot of pages from a few sites will slow down fetching dramatically. For full web crawls you want an even distribution so all fetching threads can be active. Setting generate.max.per.host to a value > 0 will limit the number of pages from a single host/domain to fetch.

6) Crawl-delay can be used and is obeyed by nutch in robots.txt. Most sites don't use this setting but a few (some malicious do). I have seen crawl-delays as high as 2 days in seconds. The fetcher.max.crawl.delay variable will ignore pages with crawl delays > x. I usually set this to 10 seconds, default is 30. Even at 10 seconds if you have a lot of pages from a site from which you can only crawl 1 page every 10 seconds it is going to be slow. On the flip side, setting this to a low value will ignore and not fetch those pages.

7) Sometimes, manytimes websites are just slow. Setting a low value for http.timeout helps. The default is 10 seconds. If you don't care and want as many pages as fast as possible, set it lower. Some websites, digg for instance, will bandwidth limit you on their side only allowing x connections per given time frame. So even if you only have say 50 pages from a single site (which I still think is to many). It may be waiting 10 seconds on each page. The ftp.timeout can also be set if fetching ftp content.

8) Lots of content means slower fetching. If downloading PDFs and other non-html documents this is especially true. To avoid non-html content you can use the url filters. I prefer the prefix and suffix filters. The http.content.limit and ftp.content.limit can be used to limit the amount of content downloaded for a single document.

9) Other things that could be causing slow fetching:

Max the number of open sockets/files on a machine. You will start seeing IO errors or can't open socket errors.
Poor routing. Bad routers or home routers might not be able to handle the number of connections going through at once. An incorrect routing setup could also be causing problems but those are usually much more complex to diagnose. Use network trace and mapping tools if you think this is happening. Upstream routing can also be a problem from your network provider.
Bad network cards. I have seen network cards flip once they reach a certain bandwidth point. This was more prevalent on, at the time, newer gigabit cards. Not usually my first thought but always a possibility. Use tcpdump and network monitoring tools on the single interface.

Nutch网页抓取速度优化的更多相关文章

Heritrix源码分析(三) 修改配置文件order.xml加快你的抓取速度(转）
本博客属原创文章,欢迎转载!转载请务必注明出处:http://guoyunsky.iteye.com/blog/629891 本博客已迁移到本人独立博客: http://www.yun5u ...
基于Casperjs的网页抓取技术【抓取豆瓣信息网络爬虫实战示例】
CasperJS is a navigation scripting & testing utility for the PhantomJS (WebKit) and SlimerJS (Ge ...
Python网络爬虫笔记（一）：网页抓取方式和LXML示例
(一) 三种网页抓取方法 1. 正则表达式: 模块使用C语言编写,速度快,但是很脆弱,可能网页更新后就不能用了. 2. Beautiful Soup 模块使用Python编写,速度慢. ...
PID控制器的应用：控制网络爬虫抓取速度
一.初识PID控制器冬天乡下人喜欢烤火取暖,常见的情形就是四人围着麻将桌,桌底放一盆碳火.有人觉得火不够大,那加点木炭吧,还不够,再加点.片刻之后,又觉得火太大,脚都快被烤熟了,那就取出一些木碳…… ...
实现织梦dedecms百度主动推送(实时)网页抓取
做百度推广的时候,如何让百度快速收录呢,下面提供了三种方式,今天我们主要讲的是第一种. 如何选择链接提交方式 1.主动推送:最为快速的提交方式,推荐您将站点当天新产出链接立即通过此方式推送给百度,以保 ...
分享一个c#t的网页抓取类
using System; using System.Collections.Generic; using System.Web; using System.Text; using System.Ne ...
java网页抓取
网页抓取就是,我们想要从别人的网站上得到我们想要的,也算是窃取了,有的网站就对这个网页抓取就做了限制,比如百度直接进入正题 //要抓取的网页地址 String urlStr = "http ...
网页抓取：PHP实现网页爬虫方式小结
来源:http://www.ido321.com/1158.html 抓取某一个网页中的内容,需要对DOM树进行解析,找到指定节点后,再抓取我们需要的内容,过程有点繁琐.LZ总结了几种常用的.易于实现 ...
Java实现网页抓取的一个Demo
这个小案例的话我是存放在我的github 上. 下面给出链接自己可以去看下,也可以直接下载源码.有具体的说明 <Java网页抓取>

随机推荐

未能加载文件或程序集 XXX 或它的一个依赖项。参数错误
引发原因 :电脑突然蓝屏重启解决方法:删除 C:\Windows\Microsoft.NET\Framework\v4.0.30319\Temporary ASP.NET Files 下的所有文件 ...
Python学习之路15☞socket编程
一客户端/服务器架构即C/S架构,包括 1.硬件C/S架构(打印机) 2.软件C/S架构(web服务) C/S架构与socket的关系: 我们学习socket就是为了完成C/S架构的开发二 os ...
Java练习 SDUT-2728_最佳拟合直线
最佳拟合直线 Time Limit: 1000 ms Memory Limit: 65536 KiB Problem Description 在很多情况下,天文观测得到的数据是一组包含很大数量的序列点 ...
UVA_490:Rotating Sentences
"R Ie n te h iD ne kc ,a r tt he es r eo fn oc re e s Ia i ad m, . ...
@codeforces - 1086F@ Forest Fires
目录 @description@ @solution@ @accepted code@ @details@ @description@ 一个无穷大的方格图,每个方格内都种了棵树. 一开始点燃了 n 棵 ...
补充：css制作三角
梯形图案看下面这段样式: .test{width:10px; height:10px; border:10px solid; border-color:#ff3300 #0000ff #339966 ...
11-2 css盒模型和浮动以及矢量图用法
一盒模型 1属性 width:内容的宽度 height: 内容的高度 padding:内边距,边框到内容的距离 border: 边框,就是指的盒子的宽度 margin:外边距,盒子边框到附近最近盒子 ...
Python两个字典键同值相加的几种方法
版权声明:本文为博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明.本文链接:https://blog.csdn.net/Jerry_1126/article/de ...
oracle函数 TRANSLATE(c1,c2,c3)
[功能]将字符表达式值中,指定字符替换为新字符 [说明]多字节符(汉字.全角符等),按1个字符计算 [参数] c1 希望被替换的字符或变量 c2 查询原始的字符集 c3 替换新的字符集,将 ...
oracle函数 LENGTHC(c1).LENGTH2(c1).LENGTH4(c1)
[功能]返回字符串的长度; [说明]多字节符(汉字.全角符等),按1个字符计算 [参数]C1 字符串 [返回]数值型 [示例] SQL> select length('高乾竞'),length( ...

Nutch网页抓取速度优化

Nutch网页抓取速度优化的更多相关文章

随机推荐

热门专题