Nutch网页抓取速度优化

Here are the things that could potentially slow down fetching

1) DNS setup

2) The number of crawlers you have, too many, too few.

3) Bandwidth limitations

4) Number of threads per host (politeness)

5) Uneven distribution of urls to fetch and politeness.

6) High crawl-delays from robots.txt (usually along with an uneven distribution of urls).

7) Many slow websites (again usually with an uneven distribution).

8) Downloading lots of content (PDFS, very large html pages, again possibly an uneven distribution).

9) Others

Now how do we fix them

1) Have a DNS setup on each local crawling machine, if multiple crawling machines and a single centralized DNS it can act like a DOS attack on the DNS server slowing the entire system. We always did a two layer setup hitting first to the local DNS cache then to a large DNS cache like OpenDNS or Verizon.

2) This would be number of map tasks * fetcher.threads.fetch. So 10 map tasks * 20 threads = 200 fetchers at once. Too many and you overload your system, too few and other factors and the machine sites idle. You will need to play around with this setting for your setup.

3) Bandwidth limitations. Use ntop, ganglia, and other monitoring tools to determine how much bandwidth you are using. Account for in and out bandwidth. A simple test, from a server inside the fetching network but not itself fetching, if it is very slow connecting to or downloading content when fetching is occurring, it is a good bet you are maxing out bandwidth. If you set http timeout as we describe later and are maxing your bandwidth, you will start seeing many http timeout errors.

4) Politeness along with uneven distribution of urls is probably the biggest limiting factor. If one thread is processing a single site and there are a lot of urls from that site to fetch all other threads will sit idle while that one thread finishes. Some solutions, use fetcher.server.delay to shorten the time between page fetches and use fetcher.threads.per.host to increase the number of threads fetching for a single site (this would still be in the same map task though and hence the same JVM ChildTask process). If increasing this > 0 you could also set fetcher.server.min.delay to some value > 0 for politeness to min and max bound the process.

5) Fetching a lot of pages from a single site or a lot of pages from a few sites will slow down fetching dramatically. For full web crawls you want an even distribution so all fetching threads can be active. Setting generate.max.per.host to a value > 0 will limit the number of pages from a single host/domain to fetch.

6) Crawl-delay can be used and is obeyed by nutch in robots.txt. Most sites don't use this setting but a few (some malicious do). I have seen crawl-delays as high as 2 days in seconds. The fetcher.max.crawl.delay variable will ignore pages with crawl delays > x. I usually set this to 10 seconds, default is 30. Even at 10 seconds if you have a lot of pages from a site from which you can only crawl 1 page every 10 seconds it is going to be slow. On the flip side, setting this to a low value will ignore and not fetch those pages.

7) Sometimes, manytimes websites are just slow. Setting a low value for http.timeout helps. The default is 10 seconds. If you don't care and want as many pages as fast as possible, set it lower. Some websites, digg for instance, will bandwidth limit you on their side only allowing x connections per given time frame. So even if you only have say 50 pages from a single site (which I still think is to many). It may be waiting 10 seconds on each page. The ftp.timeout can also be set if fetching ftp content.

8) Lots of content means slower fetching. If downloading PDFs and other non-html documents this is especially true. To avoid non-html content you can use the url filters. I prefer the prefix and suffix filters. The http.content.limit and ftp.content.limit can be used to limit the amount of content downloaded for a single document.

9) Other things that could be causing slow fetching:

Max the number of open sockets/files on a machine. You will start seeing IO errors or can't open socket errors.
    Poor routing. Bad routers or home routers might not be able to handle the number of connections going through at once. An incorrect routing setup could also be causing problems but those are usually much more complex to diagnose. Use network trace and mapping tools if you think this is happening. Upstream routing can also be a problem from your network provider.
    Bad network cards. I have seen network cards flip once they reach a certain bandwidth point. This was more prevalent on, at the time, newer gigabit cards. Not usually my first thought but always a possibility. Use tcpdump and network monitoring tools on the single interface.

Nutch网页抓取速度优化的更多相关文章

  1. Heritrix源码分析(三) 修改配置文件order.xml加快你的抓取速度(转)

    本博客属原创文章,欢迎转载!转载请务必注明出处:http://guoyunsky.iteye.com/blog/629891       本博客已迁移到本人独立博客: http://www.yun5u ...

  2. 基于Casperjs的网页抓取技术【抓取豆瓣信息网络爬虫实战示例】

    CasperJS is a navigation scripting & testing utility for the PhantomJS (WebKit) and SlimerJS (Ge ...

  3. Python网络爬虫笔记(一):网页抓取方式和LXML示例

    (一)   三种网页抓取方法 1.    正则表达式: 模块使用C语言编写,速度快,但是很脆弱,可能网页更新后就不能用了. 2.    Beautiful Soup 模块使用Python编写,速度慢. ...

  4. PID控制器的应用:控制网络爬虫抓取速度

    一.初识PID控制器 冬天乡下人喜欢烤火取暖,常见的情形就是四人围着麻将桌,桌底放一盆碳火.有人觉得火不够大,那加点木炭吧,还不够,再加点.片刻之后,又觉得火太大,脚都快被烤熟了,那就取出一些木碳…… ...

  5. 实现织梦dedecms百度主动推送(实时)网页抓取

    做百度推广的时候,如何让百度快速收录呢,下面提供了三种方式,今天我们主要讲的是第一种. 如何选择链接提交方式 1.主动推送:最为快速的提交方式,推荐您将站点当天新产出链接立即通过此方式推送给百度,以保 ...

  6. 分享一个c#t的网页抓取类

    using System; using System.Collections.Generic; using System.Web; using System.Text; using System.Ne ...

  7. java网页抓取

    网页抓取就是,我们想要从别人的网站上得到我们想要的,也算是窃取了,有的网站就对这个网页抓取就做了限制,比如百度 直接进入正题 //要抓取的网页地址 String urlStr = "http ...

  8. 网页抓取:PHP实现网页爬虫方式小结

    来源:http://www.ido321.com/1158.html 抓取某一个网页中的内容,需要对DOM树进行解析,找到指定节点后,再抓取我们需要的内容,过程有点繁琐.LZ总结了几种常用的.易于实现 ...

  9. Java实现网页抓取的一个Demo

    这个小案例的话我是存放在我的github 上. 下面给出链接自己可以去看下,也可以直接下载源码.有具体的说明 <Java网页抓取>

随机推荐

  1. TZOJ 4359: Partition the beans (二分)

    描述 Given an N x N square grid (2 <= N <= 15) and each grid has some beans in it. You want to w ...

  2. 2019-3-27-win10-uwp-动画移动滑动条的滑块

    title author date CreateTime categories win10 uwp 动画移动滑动条的滑块 lindexi 2019-03-27 10:51:32 +0800 2019- ...

  3. javascript中数字的一些常规操作

    1,禁止输入 - (减号.负号) // html <input type="number" class="no-negative"> // js $ ...

  4. 括号序列问题 uva 1626 poj 1141【区间dp】

    首先考虑下面的问题:Code[VS] 3657 我们用以下规则定义一个合法的括号序列: (1)空序列是合法的 (2)假如S是一个合法的序列,则 (S) 和[S]都是合法的 (3)假如A 和 B 都是合 ...

  5. Spring Security入门篇——标签sec:authorize的使用

    Security框架可以精确控制页面的一个按钮.链接,它在页面上权限的控制实际上是通过它提供的标签来做到的 Security共有三类标签authorize authentication accessc ...

  6. P2P需集齐四大证照

    今后做P2P需集齐四大证照 比牌照制还严 2016-09-05 11:53:24 分类:热点观察 作者:汪祖刚 8月24日,P2P网贷监管细则在千呼万唤中始出来,整个行业内外的关注热度可谓史无前例.有 ...

  7. Python 基础02 基本数据类型

    简单的数据类型以及赋值 变量不需要声明 Python的变量不需要声明,你可以直接输入: >>> a = 10 那么你的内存里就有了一个变量a,它的值是10,它的类型是 integer ...

  8. 详解Python中内置的NotImplemented类型的用法

    它是什么? ? 1 2 >>> type(NotImplemented) <type 'NotImplementedType'> NotImplemented 是Pyth ...

  9. 我钟爱的HTML5和CSS3在线工具【转】

    我真的喜欢上了HTML5, CSS3, JavaScript编程,但是有一些代码还是需要一些辅助工具来做才行,例如,CSS3的Gradient渐变如果手写代码的话真的不爽,还有像animation动画 ...

  10. @codeforces - 418D@ Big Problems for Organizers

    目录 @description@ @solution@ @accepted code@ @details@ @description@ n 个点连成一棵树,经过每条边需要花费 1 个单位时间. 现给出 ...