Spider剩下的CountableThreadPool

在上一篇的Spider中我们一定注意到了threadpool这个变量,这个变量是Spider中的线程池,具体代码

public class CountableThreadPool {

private int threadNum;

private AtomicInteger threadAlive = new AtomicInteger();

private ReentrantLock reentrantLock = new ReentrantLock();

private Condition condition = reentrantLock.newCondition();

public CountableThreadPool(int threadNum) {
this.threadNum = threadNum;
this.executorService = Executors.newFixedThreadPool(threadNum);
}

public CountableThreadPool(int threadNum, ExecutorService executorService) {
this.threadNum = threadNum;
this.executorService = executorService;
}

public void setExecutorService(ExecutorService executorService) {
this.executorService = executorService;
}

public int getThreadAlive() {
return threadAlive.get();
}

public int getThreadNum() {
return threadNum;
}

private ExecutorService executorService;

public void execute(final Runnable runnable) {

if (threadAlive.get() >= threadNum) {
try {
reentrantLock.lock();
while (threadAlive.get() >= threadNum) {
try {
condition.await();
} catch (InterruptedException e) {
}
}
} finally {
reentrantLock.unlock();
}
}
threadAlive.incrementAndGet();
executorService.execute(new Runnable() {
@Override
public void run() {
try {
runnable.run();
} finally {
try {
reentrantLock.lock();
threadAlive.decrementAndGet();
condition.signal();
} finally {
reentrantLock.unlock();
}
}
}
});
}

public boolean isShutdown() {
return executorService.isShutdown();
}

public void shutdown() {
executorService.shutdown();
}

}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
CountableThreadPool提供了设置Executor或者默认创建的方式,如果不是很懂Java的线程池先去补习一下~最主要的三个变量

private AtomicInteger threadAlive = new AtomicInteger();

private ReentrantLock reentrantLock = new ReentrantLock();

private Condition condition = reentrantLock.newCondition();
1
2
3
4
5
1
2
3
4
5
threadAlive表示目前正在执行的线程,reentrantLock是一个自旋锁,用于对条件变量操作的同步,condition用户唤醒阻塞线程的条件变量。

关键的方法:

public void execute(final Runnable runnable) {

if (threadAlive.get() >= threadNum) {
try {
reentrantLock.lock();
while (threadAlive.get() >= threadNum) {
try {
condition.await();
} catch (InterruptedException e) {
}
}
} finally {
reentrantLock.unlock();
}
}
threadAlive.incrementAndGet();
executorService.execute(new Runnable() {
@Override
public void run() {
try {
runnable.run();
} finally {
try {
reentrantLock.lock();
threadAlive.decrementAndGet();
condition.signal();
} finally {
reentrantLock.unlock();
}
}
}
});
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
使用threadAlive这个变量来控制目前的活动线程,如果超出定义的线程数就阻塞,为什么这样呢,因为我们创建的是固定大小的线程池,默认的newFixedThreadPool创建的最大线程数就是传入的参数,如果线程数量超过线程池中的数值,对于默认的操作就是抛异常了。可以看一下这篇博客:
http://uule.iteye.com/blog/1123185
http://blog.csdn.net/sd0902/article/details/8395677

Spider剩下的SpiderMonitor

先说一句

SpiderMonitor是负责监控Spider的运行状态的,建议仔细阅读官方文档
http://webmagic.io/docs/zh/posts/ch4-basic-page-processor/monitor.html
http://my.oschina.net/xpbug/blog/221547
所以如果这部分对你没什么用,你可以跳过去,我就没用到~

开始吧

在Spider的代码中我们看到了这个

public void run() {
try {
processRequest(requestFinal);
onSuccess(requestFinal);
} catch (Exception e) {
onError(requestFinal);
logger.error("process request " + requestFinal + " error", e);
} finally {
pageCount.incrementAndGet();
signalNewUrl();
}
}
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
对于onSuccess(requestFinal)和onError(requestFinal)这个方法名,如果你看的多了一眼就知道这是个接口,那么回调在哪里?也就是SpiderMonitor的内部类

public class SpiderMonitor {

private static SpiderMonitor INSTANCE = new SpiderMonitor();

private AtomicBoolean started = new AtomicBoolean(false);

private Logger logger = LoggerFactory.getLogger(getClass());

private MBeanServer mbeanServer;

private String jmxServerName;

private List<SpiderStatusMXBean> spiderStatuses = new ArrayList<SpiderStatusMXBean>();

protected SpiderMonitor() {
jmxServerName = "WebMagic";
mbeanServer = ManagementFactory.getPlatformMBeanServer();
}

/**
* Register spider for monitor.
*
* @param spiders spiders
* @return this
*/
public synchronized SpiderMonitor register(Spider... spiders) throws JMException {
for (Spider spider : spiders) {
MonitorSpiderListener monitorSpiderListener = new MonitorSpiderListener();
if (spider.getSpiderListeners() == null) {
List<SpiderListener> spiderListeners = new ArrayList<SpiderListener>();
spiderListeners.add(monitorSpiderListener);
spider.setSpiderListeners(spiderListeners);
} else {
spider.getSpiderListeners().add(monitorSpiderListener);
}
SpiderStatusMXBean spiderStatusMBean = getSpiderStatusMBean(spider, monitorSpiderListener);
registerMBean(spiderStatusMBean);
spiderStatuses.add(spiderStatusMBean);
}
return this;
}

protected SpiderStatusMXBean getSpiderStatusMBean(Spider spider, MonitorSpiderListener monitorSpiderListener) {
return new SpiderStatus(spider, monitorSpiderListener);
}

public static SpiderMonitor instance() {
return INSTANCE;
}

public class MonitorSpiderListener implements SpiderListener {

private final AtomicInteger successCount = new AtomicInteger(0);

private final AtomicInteger errorCount = new AtomicInteger(0);

private List<String> errorUrls = Collections.synchronizedList(new ArrayList<String>());

@Override
public void onSuccess(Request request) {
successCount.incrementAndGet();
}

@Override
public void onError(Request request) {
errorUrls.add(request.getUrl());
errorCount.incrementAndGet();
}

public AtomicInteger getSuccessCount() {
return successCount;
}

public AtomicInteger getErrorCount() {
return errorCount;
}

public List<String> getErrorUrls() {
return errorUrls;
}
}

protected void registerMBean(SpiderStatusMXBean spiderStatus) throws MalformedObjectNameException, InstanceAlreadyExistsException, MBeanRegistrationException, NotCompliantMBeanException {
ObjectName objName = new ObjectName(jmxServerName + ":name=" + spiderStatus.getName());
mbeanServer.registerMBean(spiderStatus, objName);
}

}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
我们看到了在onSuccess和onError做了一些记录,主要是为了监控,如果你希望在爬虫成功或者失败实现一些自己方法也可以实现这个接口

public interface SpiderListener {

public void onSuccess(Request request);

public void onError(Request request);
}
1
2
3
4
5
6
1
2
3
4
5
6
如果你对于Java的接口回调不是很懂那么推荐你看看《Head First设计模式》第一章,策略模式。

写在后面

这篇博客还是很简单的,主要完善了Spider的细小模块,后面将会介绍Spider的四大组件,如果喜欢多多支持~

CountableThreadPool的更多相关文章

  1. 【转】WebMagic-总体流程源码分析

    转自:http://m.blog.csdn.net/article/details?id=51943601 写在前面 前一段时间开发[知了]用到了很多技术(可以看我前面的博文http://blog.c ...

  2. HttpClient 专题

    HttpClient is a HTTP/1.1 compliant HTTP agent implementation based on HttpCore. It also provides reu ...

  3. Java 多线程爬虫及分布式爬虫架构探索

    这是 Java 爬虫系列博文的第五篇,在上一篇 Java 爬虫服务器被屏蔽,不要慌,咱们换一台服务器 中,我们简单的聊反爬虫策略和反反爬虫方法,主要针对的是 IP 被封及其对应办法.前面几篇文章我们把 ...

  4. Java 多线程爬虫及分布式爬虫架构

    这是 Java 爬虫系列博文的第五篇,在上一篇 Java 爬虫服务器被屏蔽,不要慌,咱们换一台服务器 中,我们简单的聊反爬虫策略和反反爬虫方法,主要针对的是 IP 被封及其对应办法.前面几篇文章我们把 ...

  5. webmagic源码浅析

    webmagic简介 webmagic可以说是中国传播度最广的Java爬虫框架,https://github.com/code4craft/webmagic,阅读相关源码,获益良多.阅读作者博客[代码 ...

随机推荐

  1. C++程序设计教材目录思维导图(增C++Primer)

    正在做C++思维导图,大工程,比较艰苦. 先做了三个C++教材目录的思维导图.C++教材不等于C++,这个容易些.看思维导图,整理所学知识,这个可以会. 给出三张图,对应三种教材: 谭浩强. C++程 ...

  2. 数据挖掘算法R语言实现之决策树

    数据挖掘算法R语言实现之决策树 最近,看到很多朋友问我如何用数据挖掘算法R语言实现之决策树,想要了解这方面的内容如下: > library("party")导入数据包 > ...

  3. node项目搭建

    一:安装 1.简单安装法 下载.msi [编译好的nodejs]  ->  点击安装 [系统会自动配置系统变量]   2.复杂安装法(不推荐) 由于nodejs的源码由C++和js组成 同时需要 ...

  4. 信息摘要算法 MessageDigestUtil

    package com.xgh.message.digest.test; import java.math.BigInteger; import java.security.MessageDigest ...

  5. day39-Spring 15-Spring的JDBC模板:C3P0连接池配置

    <!-- 配置C3P0连接池 --> <bean id="dataSource2" class="com.mchange.v2.c3p0.ComboPo ...

  6. 【To Read】Shortest Palindrome(KMP)

    题意:Given a string S, you are allowed to convert it to a palindrome by adding characters in front of ...

  7. Date日期类,Canlendar日历类,Math类,Random随机数学类

    Date日期类,SimpleDateFormat日期格式类 Date  表示特定的时间,精确到毫秒 常用方法 getTime() setTime() before() after() compareT ...

  8. hdu1848 sg打表

    果然是神器. #include<stdio.h> #include<string.h> #define maxn 1002 ],sg[maxn],hash[maxn]; voi ...

  9. 洛谷4178 BZOJ1468 Tree题解点分治

    点分治的入门练习. 题目链接 BZOJ的链接(权限题) 关于点分治的思想我就不再重复了,这里重点说一下如何判重. 我们来看上图,假设我们去除了1节点,求出d[2]=1,d[3]=d[4]=2 假设k为 ...

  10. Nacos 发布 1.0.0 GA 版本,可大规模投入到生产环境

    经过 3 个 RC 版本的社区体验之后,Nacos 正式发布 1.0.0 GA 版本,在架构.功能和 API 设计上进行了全方位的重构和升级. 1.0.0 版本的发布标志着 Nacos 已经可以大规模 ...