Nutch网页抓取速度优化

Here are the things that could potentially slow down fetching

1) DNS setup

2) The number of crawlers you have, too many, too few.

3) Bandwidth limitations

4) Number of threads per host (politeness)

5) Uneven distribution of urls to fetch and politeness.

6) High crawl-delays from robots.txt (usually along with an uneven distribution of urls).

7) Many slow websites (again usually with an uneven distribution).

8) Downloading lots of content (PDFS, very large html pages, again possibly an uneven distribution).

9) Others

Now how do we fix them

1) Have a DNS setup on each local crawling machine, if multiple crawling machines and a single centralized DNS it can act like a DOS attack on the DNS server slowing the entire system. We always did a two layer setup hitting first to the local DNS cache then to a large DNS cache like OpenDNS or Verizon.

2) This would be number of map tasks * fetcher.threads.fetch. So 10 map tasks * 20 threads = 200 fetchers at once. Too many and you overload your system, too few and other factors and the machine sites idle. You will need to play around with this setting for your setup.

3) Bandwidth limitations. Use ntop, ganglia, and other monitoring tools to determine how much bandwidth you are using. Account for in and out bandwidth. A simple test, from a server inside the fetching network but not itself fetching, if it is very slow connecting to or downloading content when fetching is occurring, it is a good bet you are maxing out bandwidth. If you set http timeout as we describe later and are maxing your bandwidth, you will start seeing many http timeout errors.

4) Politeness along with uneven distribution of urls is probably the biggest limiting factor. If one thread is processing a single site and there are a lot of urls from that site to fetch all other threads will sit idle while that one thread finishes. Some solutions, use fetcher.server.delay to shorten the time between page fetches and use fetcher.threads.per.host to increase the number of threads fetching for a single site (this would still be in the same map task though and hence the same JVM ChildTask process). If increasing this > 0 you could also set fetcher.server.min.delay to some value > 0 for politeness to min and max bound the process.

5) Fetching a lot of pages from a single site or a lot of pages from a few sites will slow down fetching dramatically. For full web crawls you want an even distribution so all fetching threads can be active. Setting generate.max.per.host to a value > 0 will limit the number of pages from a single host/domain to fetch.

6) Crawl-delay can be used and is obeyed by nutch in robots.txt. Most sites don't use this setting but a few (some malicious do). I have seen crawl-delays as high as 2 days in seconds. The fetcher.max.crawl.delay variable will ignore pages with crawl delays > x. I usually set this to 10 seconds, default is 30. Even at 10 seconds if you have a lot of pages from a site from which you can only crawl 1 page every 10 seconds it is going to be slow. On the flip side, setting this to a low value will ignore and not fetch those pages.

7) Sometimes, manytimes websites are just slow. Setting a low value for http.timeout helps. The default is 10 seconds. If you don't care and want as many pages as fast as possible, set it lower. Some websites, digg for instance, will bandwidth limit you on their side only allowing x connections per given time frame. So even if you only have say 50 pages from a single site (which I still think is to many). It may be waiting 10 seconds on each page. The ftp.timeout can also be set if fetching ftp content.

8) Lots of content means slower fetching. If downloading PDFs and other non-html documents this is especially true. To avoid non-html content you can use the url filters. I prefer the prefix and suffix filters. The http.content.limit and ftp.content.limit can be used to limit the amount of content downloaded for a single document.

9) Other things that could be causing slow fetching:

Max the number of open sockets/files on a machine. You will start seeing IO errors or can't open socket errors.
    Poor routing. Bad routers or home routers might not be able to handle the number of connections going through at once. An incorrect routing setup could also be causing problems but those are usually much more complex to diagnose. Use network trace and mapping tools if you think this is happening. Upstream routing can also be a problem from your network provider.
    Bad network cards. I have seen network cards flip once they reach a certain bandwidth point. This was more prevalent on, at the time, newer gigabit cards. Not usually my first thought but always a possibility. Use tcpdump and network monitoring tools on the single interface.

Nutch网页抓取速度优化的更多相关文章

  1. Heritrix源码分析(三) 修改配置文件order.xml加快你的抓取速度(转)

    本博客属原创文章,欢迎转载!转载请务必注明出处:http://guoyunsky.iteye.com/blog/629891       本博客已迁移到本人独立博客: http://www.yun5u ...

  2. 基于Casperjs的网页抓取技术【抓取豆瓣信息网络爬虫实战示例】

    CasperJS is a navigation scripting & testing utility for the PhantomJS (WebKit) and SlimerJS (Ge ...

  3. Python网络爬虫笔记(一):网页抓取方式和LXML示例

    (一)   三种网页抓取方法 1.    正则表达式: 模块使用C语言编写,速度快,但是很脆弱,可能网页更新后就不能用了. 2.    Beautiful Soup 模块使用Python编写,速度慢. ...

  4. PID控制器的应用:控制网络爬虫抓取速度

    一.初识PID控制器 冬天乡下人喜欢烤火取暖,常见的情形就是四人围着麻将桌,桌底放一盆碳火.有人觉得火不够大,那加点木炭吧,还不够,再加点.片刻之后,又觉得火太大,脚都快被烤熟了,那就取出一些木碳…… ...

  5. 实现织梦dedecms百度主动推送(实时)网页抓取

    做百度推广的时候,如何让百度快速收录呢,下面提供了三种方式,今天我们主要讲的是第一种. 如何选择链接提交方式 1.主动推送:最为快速的提交方式,推荐您将站点当天新产出链接立即通过此方式推送给百度,以保 ...

  6. 分享一个c#t的网页抓取类

    using System; using System.Collections.Generic; using System.Web; using System.Text; using System.Ne ...

  7. java网页抓取

    网页抓取就是,我们想要从别人的网站上得到我们想要的,也算是窃取了,有的网站就对这个网页抓取就做了限制,比如百度 直接进入正题 //要抓取的网页地址 String urlStr = "http ...

  8. 网页抓取:PHP实现网页爬虫方式小结

    来源:http://www.ido321.com/1158.html 抓取某一个网页中的内容,需要对DOM树进行解析,找到指定节点后,再抓取我们需要的内容,过程有点繁琐.LZ总结了几种常用的.易于实现 ...

  9. Java实现网页抓取的一个Demo

    这个小案例的话我是存放在我的github 上. 下面给出链接自己可以去看下,也可以直接下载源码.有具体的说明 <Java网页抓取>

随机推荐

  1. 【时光回溯】【JZOJ3568】【GDKOI2014】小纪的作业题

    题目描述 输入 输出 有M行,每个询问一行,输出结果mod 1,000,000,007的值. 样例输入 10 3 3 5 1 2 3 1 3 5 2 1 7 9 3 9 2 3 样例输出 10 19 ...

  2. Linux使用及命令

    #命令模式下输入:光标移动到第34行第15个字符 <Enter>15l(这是小写的L) ctrl+u删除光标前面的字符 ctrl+j删除光标后面的字符 在Linux下用VIM打开大小几个G ...

  3. 洛谷2375 BZOJ 3670动物园题解

    题目链接 洛谷链接 我们发现题目要我们求的num[i]东西本质上其实是 求有多少以i结尾的非前缀且能与前缀匹配的字符串,而且要求字符串长度小于(i/2) 我们先不考虑字符串长度的限制,看所有以i结尾的 ...

  4. 破解fireworks_cs6、phoneshop_cs6、dreamweaver_cs6

    我的Adobe密码是绿尘枫加**0,首字母大写,在我的百度云盘有这三款软件的补丁,这三款软件安装和破解的方式都一样.先下载正常安装好正版软件>正常试用一遍之后,fireworks的补丁装错了文件 ...

  5. 18.libgdx制作预览图,背景移动循环,改变地图颜色

    经过构思,游戏将分为两部分, 1,预览图,只负责展示世界形势 2,根据预览图生成的战役项 现在要记录的是我制作预览图的部分 1.预览图只有实际地图的1/4,首先生成地图(建议不要缩放以前地图,由于误差 ...

  6. BLOB类型对应Long binary,CLOB对应Long characters

    BLOB类型对应Long binary,CLOB对应Long characters

  7. PHPstorm相关设置以及快捷键

    转自:http://blog.csdn.net/fenglailea/article/details/12166617 1.界面中文方框问题 Settings->Appearance中Theme ...

  8. hdu 1289 Hat’s IEEE

    Problem - 1289 好题.其实就是模拟IEEE754的格式,不过要注意的是,这里用的32位是float,用double就不对了. 代码如下: #include <cstdio> ...

  9. 使用php函数ini_set()重新设置某个配置的设置值

    使用PHP的ini_set()函数 ini_set (PHP 4, PHP 5, PHP 7) ini_set — 为一个配置选项设置值 说明 string ini_set ( string $var ...

  10. HTML的基本结构和标签分类

    HTML:超文本标记语言 HTML基本结构 <!DOCTYPE html> <html> <head> <meta charset="utf-8&q ...