Spider剩下的CountableThreadPool

在上一篇的Spider中我们一定注意到了threadpool这个变量,这个变量是Spider中的线程池,具体代码

public class CountableThreadPool {

private int threadNum;

private AtomicInteger threadAlive = new AtomicInteger();

private ReentrantLock reentrantLock = new ReentrantLock();

private Condition condition = reentrantLock.newCondition();

public CountableThreadPool(int threadNum) {
this.threadNum = threadNum;
this.executorService = Executors.newFixedThreadPool(threadNum);
}

public CountableThreadPool(int threadNum, ExecutorService executorService) {
this.threadNum = threadNum;
this.executorService = executorService;
}

public void setExecutorService(ExecutorService executorService) {
this.executorService = executorService;
}

public int getThreadAlive() {
return threadAlive.get();
}

public int getThreadNum() {
return threadNum;
}

private ExecutorService executorService;

public void execute(final Runnable runnable) {

if (threadAlive.get() >= threadNum) {
try {
reentrantLock.lock();
while (threadAlive.get() >= threadNum) {
try {
condition.await();
} catch (InterruptedException e) {
}
}
} finally {
reentrantLock.unlock();
}
}
threadAlive.incrementAndGet();
executorService.execute(new Runnable() {
@Override
public void run() {
try {
runnable.run();
} finally {
try {
reentrantLock.lock();
threadAlive.decrementAndGet();
condition.signal();
} finally {
reentrantLock.unlock();
}
}
}
});
}

public boolean isShutdown() {
return executorService.isShutdown();
}

public void shutdown() {
executorService.shutdown();
}

}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
CountableThreadPool提供了设置Executor或者默认创建的方式,如果不是很懂Java的线程池先去补习一下~最主要的三个变量

private AtomicInteger threadAlive = new AtomicInteger();

private ReentrantLock reentrantLock = new ReentrantLock();

private Condition condition = reentrantLock.newCondition();
1
2
3
4
5
1
2
3
4
5
threadAlive表示目前正在执行的线程,reentrantLock是一个自旋锁,用于对条件变量操作的同步,condition用户唤醒阻塞线程的条件变量。

关键的方法:

public void execute(final Runnable runnable) {

if (threadAlive.get() >= threadNum) {
try {
reentrantLock.lock();
while (threadAlive.get() >= threadNum) {
try {
condition.await();
} catch (InterruptedException e) {
}
}
} finally {
reentrantLock.unlock();
}
}
threadAlive.incrementAndGet();
executorService.execute(new Runnable() {
@Override
public void run() {
try {
runnable.run();
} finally {
try {
reentrantLock.lock();
threadAlive.decrementAndGet();
condition.signal();
} finally {
reentrantLock.unlock();
}
}
}
});
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
使用threadAlive这个变量来控制目前的活动线程,如果超出定义的线程数就阻塞,为什么这样呢,因为我们创建的是固定大小的线程池,默认的newFixedThreadPool创建的最大线程数就是传入的参数,如果线程数量超过线程池中的数值,对于默认的操作就是抛异常了。可以看一下这篇博客:
http://uule.iteye.com/blog/1123185
http://blog.csdn.net/sd0902/article/details/8395677

Spider剩下的SpiderMonitor

先说一句

SpiderMonitor是负责监控Spider的运行状态的,建议仔细阅读官方文档
http://webmagic.io/docs/zh/posts/ch4-basic-page-processor/monitor.html
http://my.oschina.net/xpbug/blog/221547
所以如果这部分对你没什么用,你可以跳过去,我就没用到~

开始吧

在Spider的代码中我们看到了这个

public void run() {
try {
processRequest(requestFinal);
onSuccess(requestFinal);
} catch (Exception e) {
onError(requestFinal);
logger.error("process request " + requestFinal + " error", e);
} finally {
pageCount.incrementAndGet();
signalNewUrl();
}
}
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
对于onSuccess(requestFinal)和onError(requestFinal)这个方法名,如果你看的多了一眼就知道这是个接口,那么回调在哪里?也就是SpiderMonitor的内部类

public class SpiderMonitor {

private static SpiderMonitor INSTANCE = new SpiderMonitor();

private AtomicBoolean started = new AtomicBoolean(false);

private Logger logger = LoggerFactory.getLogger(getClass());

private MBeanServer mbeanServer;

private String jmxServerName;

private List<SpiderStatusMXBean> spiderStatuses = new ArrayList<SpiderStatusMXBean>();

protected SpiderMonitor() {
jmxServerName = "WebMagic";
mbeanServer = ManagementFactory.getPlatformMBeanServer();
}

/**
* Register spider for monitor.
*
* @param spiders spiders
* @return this
*/
public synchronized SpiderMonitor register(Spider... spiders) throws JMException {
for (Spider spider : spiders) {
MonitorSpiderListener monitorSpiderListener = new MonitorSpiderListener();
if (spider.getSpiderListeners() == null) {
List<SpiderListener> spiderListeners = new ArrayList<SpiderListener>();
spiderListeners.add(monitorSpiderListener);
spider.setSpiderListeners(spiderListeners);
} else {
spider.getSpiderListeners().add(monitorSpiderListener);
}
SpiderStatusMXBean spiderStatusMBean = getSpiderStatusMBean(spider, monitorSpiderListener);
registerMBean(spiderStatusMBean);
spiderStatuses.add(spiderStatusMBean);
}
return this;
}

protected SpiderStatusMXBean getSpiderStatusMBean(Spider spider, MonitorSpiderListener monitorSpiderListener) {
return new SpiderStatus(spider, monitorSpiderListener);
}

public static SpiderMonitor instance() {
return INSTANCE;
}

public class MonitorSpiderListener implements SpiderListener {

private final AtomicInteger successCount = new AtomicInteger(0);

private final AtomicInteger errorCount = new AtomicInteger(0);

private List<String> errorUrls = Collections.synchronizedList(new ArrayList<String>());

@Override
public void onSuccess(Request request) {
successCount.incrementAndGet();
}

@Override
public void onError(Request request) {
errorUrls.add(request.getUrl());
errorCount.incrementAndGet();
}

public AtomicInteger getSuccessCount() {
return successCount;
}

public AtomicInteger getErrorCount() {
return errorCount;
}

public List<String> getErrorUrls() {
return errorUrls;
}
}

protected void registerMBean(SpiderStatusMXBean spiderStatus) throws MalformedObjectNameException, InstanceAlreadyExistsException, MBeanRegistrationException, NotCompliantMBeanException {
ObjectName objName = new ObjectName(jmxServerName + ":name=" + spiderStatus.getName());
mbeanServer.registerMBean(spiderStatus, objName);
}

}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
我们看到了在onSuccess和onError做了一些记录,主要是为了监控,如果你希望在爬虫成功或者失败实现一些自己方法也可以实现这个接口

public interface SpiderListener {

public void onSuccess(Request request);

public void onError(Request request);
}
1
2
3
4
5
6
1
2
3
4
5
6
如果你对于Java的接口回调不是很懂那么推荐你看看《Head First设计模式》第一章,策略模式。

写在后面

这篇博客还是很简单的,主要完善了Spider的细小模块,后面将会介绍Spider的四大组件,如果喜欢多多支持~

CountableThreadPool的更多相关文章

  1. 【转】WebMagic-总体流程源码分析

    转自:http://m.blog.csdn.net/article/details?id=51943601 写在前面 前一段时间开发[知了]用到了很多技术(可以看我前面的博文http://blog.c ...

  2. HttpClient 专题

    HttpClient is a HTTP/1.1 compliant HTTP agent implementation based on HttpCore. It also provides reu ...

  3. Java 多线程爬虫及分布式爬虫架构探索

    这是 Java 爬虫系列博文的第五篇,在上一篇 Java 爬虫服务器被屏蔽,不要慌,咱们换一台服务器 中,我们简单的聊反爬虫策略和反反爬虫方法,主要针对的是 IP 被封及其对应办法.前面几篇文章我们把 ...

  4. Java 多线程爬虫及分布式爬虫架构

    这是 Java 爬虫系列博文的第五篇,在上一篇 Java 爬虫服务器被屏蔽,不要慌,咱们换一台服务器 中,我们简单的聊反爬虫策略和反反爬虫方法,主要针对的是 IP 被封及其对应办法.前面几篇文章我们把 ...

  5. webmagic源码浅析

    webmagic简介 webmagic可以说是中国传播度最广的Java爬虫框架,https://github.com/code4craft/webmagic,阅读相关源码,获益良多.阅读作者博客[代码 ...

随机推荐

  1. oracle包头包体

    补充说明:包头和包体可以以java的接口来理解,包头像java的接口,包体像java接口的实现类. 一 包的组成 包头(package):包头部分申明包内数据类型,常量,变量,游标,子程序和异常错误处 ...

  2. Leetcode883.Projection Area of 3D Shapes三维形体投影面积

    在 N * N 的网格中,我们放置了一些与 x,y,z 三轴对齐的 1 * 1 * 1 立方体. 每个值 v = grid[i][j] 表示 v 个正方体叠放在单元格 (i, j) 上. 现在,我们查 ...

  3. Ubuntu无法连接无线网

    shell里输入: su ifconfig wlan0 up 不行的话 rfkill block all rfkill unblock all ifconfig wlan0 up

  4. 【linux】Ubuntu16.04中文输入法安装

    最近刚给笔记本装了Ubuntu+win10双系统,但是ubuntu16.04没有自带中文输入法,所以经过网上的一些经验搜索整合,分享一下安装中文输入法的心得.本文主要介绍了谷歌拼音跟ibus中文输入法 ...

  5. 【Django入坑之路】Django后台上传图片,以及前端的显示

    #setting配置: MEDIA_URL = "/media/" MEDIA_ROOT = os.path.join(BASE_DIR, "media") # ...

  6. day39-Spring 13-Spring的JDBC模板:默认连接池的配置

    Spring内置的连接池DriverManagerDataSource的源码. /* * Copyright 2002-2008 the original author or authors. * * ...

  7. mysql操作手册

    开启日志:https://segmentfault.com/a/1190000003072237 常用词:  Mysql:一种免费的跨平台的数据库系统  E:\mysql:表示是在dos 命令窗口下面 ...

  8. Android 神兵利器之通过解析网页获取到的API数据合集,可拿来就用

    AppApis 前段时间,写了个做app的实战系列教程,其中一篇章提到了解析网页中的数据为己所用,看到大家的响应还不错,于是把自己以前解析过的网页数据都整理了下,开放出来,给更多的人使用,希望可以帮助 ...

  9. hdu3549 最大流

    #include<stdio.h> #include<string.h> #include<queue> #define MAXN 1010 using names ...

  10. C++之ARX,Acstring,ACahr转char

    AcDbText* pText = AcDbText::cast(pEnt); AcString sText = DBHelper::AcStringFree(pText->textString ...