先看看文档对于Scheduler的作用介绍
https://code4craft.gitbooks.io/webmagic-in-action/content/zh/posts/ch1-overview/architecture.html
之前我们也介绍过了,Scheduler主要负责爬虫的下一步爬取的规划,包括一些去重等功能。在主流程中也看到了Scheduler,现在来具体结合源码分析

源码

Scheduler是一个接口

public interface Scheduler {

/**
* add a url to fetch
*
* @param request
* @param task
*/
public void push(Request request, Task task);

/**
* get an url to crawl
*
* @param task the task of spider
* @return the url to crawl
*/
public Request poll(Task task);

}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
其主要的实现是DuplicateRemovedScheduler,使用模板模式定义了push的步骤。

public abstract class DuplicateRemovedScheduler implements Scheduler {

protected Logger logger = LoggerFactory.getLogger(getClass());

private DuplicateRemover duplicatedRemover = new HashSetDuplicateRemover();

public DuplicateRemover getDuplicateRemover() {
return duplicatedRemover;
}

public DuplicateRemovedScheduler setDuplicateRemover(DuplicateRemover duplicatedRemover) {
this.duplicatedRemover = duplicatedRemover;
return this;
}

@Override
public void push(Request request, Task task) {
logger.trace("get a candidate url {}", request.getUrl());
if (!duplicatedRemover.isDuplicate(request, task) || shouldReserved(request)) {
logger.debug("push to queue {}", request.getUrl());
pushWhenNoDuplicate(request, task);
}
}

protected boolean shouldReserved(Request request) {
return request.getExtra(Request.CYCLE_TRIED_TIMES) != null;
}

protected void pushWhenNoDuplicate(Request request, Task task) {

}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
我们来看看负责去重的接口DuplicateRemover,其实现类有HashSetDuplicateRemover使用HashSet来去重,RedisScheduler接触Redis来去重和BloomFilterDuplicateRemover使用BloomFilter去重。默认使用HashSetDuplicateRemover

public class HashSetDuplicateRemover implements DuplicateRemover {

private Set<String> urls = Sets.newSetFromMap(new ConcurrentHashMap<String, Boolean>());

@Override
public boolean isDuplicate(Request request, Task task) {
return !urls.add(getUrl(request));
}

protected String getUrl(Request request) {
return request.getUrl();
}

@Override
public void resetDuplicateCheck(Task task) {
urls.clear();
}

@Override
public int getTotalRequestsCount(Task task) {
return urls.size();
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
DuplicateRemovedScheduler抽象类有四个具体实现类QueueScheduler,PriorityScheduler,FileCacheQueueScheduler和RedisScheduler。默认使用QueueScheduler

@ThreadSafe
public class QueueScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler {

private BlockingQueue<Request> queue = new LinkedBlockingQueue<Request>();

@Override
public void pushWhenNoDuplicate(Request request, Task task) {
queue.add(request);
}

@Override
public synchronized Request poll(Task task) {
return queue.poll();
}

@Override
public int getLeftRequestsCount(Task task) {
return queue.size();
}

@Override
public int getTotalRequestsCount(Task task) {
return getDuplicateRemover().getTotalRequestsCount(task);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
其内部是使用了一个LinkedBlockingQueue这个无界队列来存储Request,我们应该看到了@ThreadSafe注解,那我抛一个问题吧。Scheduler是否存在线程同步问题呢,如果存在那是如何解决的呢?
再来看下一个

@ThreadSafe
public class PriorityScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler {

public static final int INITIAL_CAPACITY = 5;

private BlockingQueue<Request> noPriorityQueue = new LinkedBlockingQueue<Request>();

private PriorityBlockingQueue<Request> priorityQueuePlus = new PriorityBlockingQueue<Request>(INITIAL_CAPACITY, new Comparator<Request>() {
@Override
public int compare(Request o1, Request o2) {
return -NumberUtils.compareLong(o1.getPriority(), o2.getPriority());
}
});

private PriorityBlockingQueue<Request> priorityQueueMinus = new PriorityBlockingQueue<Request>(INITIAL_CAPACITY, new Comparator<Request>() {
@Override
public int compare(Request o1, Request o2) {
return -NumberUtils.compareLong(o1.getPriority(), o2.getPriority());
}
});

@Override
public void pushWhenNoDuplicate(Request request, Task task) {
if (request.getPriority() == 0) {
noPriorityQueue.add(request);
} else if (request.getPriority() > 0) {
priorityQueuePlus.put(request);
} else {
priorityQueueMinus.put(request);
}
}

@Override
public synchronized Request poll(Task task) {
Request poll = priorityQueuePlus.poll();
if (poll != null) {
return poll;
}
poll = noPriorityQueue.poll();
if (poll != null) {
return poll;
}
return priorityQueueMinus.poll();
}

@Override
public int getLeftRequestsCount(Task task) {
return noPriorityQueue.size();
}

@Override
public int getTotalRequestsCount(Task task) {
return getDuplicateRemover().getTotalRequestsCount(task);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
我们看到了两个PriorityBlockingQueue和一个LinkedBlockingQueue。在poll的时候存在一个顺序。
继续

public class FileCacheQueueScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler {

private String filePath = System.getProperty("java.io.tmpdir");

private String fileUrlAllName = ".urls.txt";

private Task task;

private String fileCursor = ".cursor.txt";

private PrintWriter fileUrlWriter;

private PrintWriter fileCursorWriter;

private AtomicInteger cursor = new AtomicInteger();

private AtomicBoolean inited = new AtomicBoolean(false);

private BlockingQueue<Request> queue;

private Set<String> urls;

public FileCacheQueueScheduler(String filePath) {
if (!filePath.endsWith("/") && !filePath.endsWith("\\")) {
filePath += "/";
}
this.filePath = filePath;
}

private void flush() {
fileUrlWriter.flush();
fileCursorWriter.flush();
}

private void init(Task task) {
this.task = task;
File file = new File(filePath);
if (!file.exists()) {
file.mkdirs();
}
readFile();
initWriter();
initFlushThread();
inited.set(true);
logger.info("init cache scheduler success");
}

private void initFlushThread() {
Executors.newScheduledThreadPool(1).scheduleAtFixedRate(new Runnable() {
@Override
public void run() {
flush();
}
}, 10, 10, TimeUnit.SECONDS);
}

private void initWriter() {
try {
fileUrlWriter = new PrintWriter(new FileWriter(getFileName(fileUrlAllName), true));
fileCursorWriter = new PrintWriter(new FileWriter(getFileName(fileCursor), false));
} catch (IOException e) {
throw new RuntimeException("init cache scheduler error", e);
}
}

private void readFile() {
try {
queue = new LinkedBlockingQueue<Request>();
urls = new LinkedHashSet<String>();
readCursorFile();
readUrlFile();
} catch (FileNotFoundException e) {
//init
logger.info("init cache file " + getFileName(fileUrlAllName));
} catch (IOException e) {
logger.error("init file error", e);
}
}

private void readUrlFile() throws IOException {
String line;
BufferedReader fileUrlReader = null;
try {
fileUrlReader = new BufferedReader(new FileReader(getFileName(fileUrlAllName)));
int lineReaded = 0;
while ((line = fileUrlReader.readLine()) != null) {
urls.add(line.trim());
lineReaded++;
if (lineReaded > cursor.get()) {
queue.add(new Request(line));
}
}
} finally {
if (fileUrlReader != null) {
IOUtils.closeQuietly(fileUrlReader);
}
}
}

private void readCursorFile() throws IOException {
BufferedReader fileCursorReader = null;
try {
fileCursorReader = new BufferedReader(new FileReader(getFileName(fileCursor)));
String line;
//read the last number
while ((line = fileCursorReader.readLine()) != null) {
cursor = new AtomicInteger(NumberUtils.toInt(line));
}
} finally {
if (fileCursorReader != null) {
IOUtils.closeQuietly(fileCursorReader);
}
}
}

private String getFileName(String filename) {
return filePath + task.getUUID() + filename;
}

@Override
protected void pushWhenNoDuplicate(Request request, Task task) {
if (!inited.get()) {
init(task);
}
queue.add(request);
fileUrlWriter.println(request.getUrl());
}

@Override
public synchronized Request poll(Task task) {
if (!inited.get()) {
init(task);
}
fileCursorWriter.println(cursor.incrementAndGet());
return queue.poll();
}

@Override
public int getLeftRequestsCount(Task task) {
return queue.size();
}

@Override
public int getTotalRequestsCount(Task task) {
return getDuplicateRemover().getTotalRequestsCount(task);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
会将url和已经执行的url指针存在两个文件中,创建了scheduleExecutor定期的flush,所有内存中的url还是存在BlockingQueue中。
RedisScheduler不是很懂。。目前还没有接触过:)

使用

具体使用过程还是需要自己根据自己的爬虫特点然后选择特定的Scheduler及DuplicateRemover,只有懂得其原理才能选择最合适的组件。
WebMagic组件都可以自行设置这点真的太棒了~

Scheduler的更多相关文章

  1. AndroidStudio3.0无法打开Android Device Monitor的解决办法(An error has occurred on Android Device Monitor)

    ---恢复内容开始--- 打开monitor时出现 An error has occurred. See the log file... ------------------------------- ...

  2. 从scheduler is shutted down看程序员的英文水平

    我有个windows服务程序,今天重点在测试系统逻辑.部署后,在看系统日志时,不经意看到一行:scheduler is shutted down. 2016-12-29 09:40:24.175 {& ...

  3. Spring 4 + Quartz 2.2.1 Scheduler Integration Example

    In this post we will see how to schedule Jobs using Quartz Scheduler with Spring. Spring provides co ...

  4. VMware中CPU分配不合理以及License限制引起的SQL Scheduler不能用于查询处理

    有一台SQL Server(SQL Server 2014 标准版)服务器中的scheduler_count与cpu_count不一致,如下截图所示: SELECT  cpu_count ,      ...

  5. Windows Task Scheduler Fails With Error Code 2147943785

    Problem: Windows Task Scheduler Fails With Error Code 2147943785 Solution: This is usually due to a ...

  6. Fair Scheduler 队列设置经验总结

    Fair Scheduler 队列设置经验总结 由于公司的hadoop集群的计算资源不是很充足,需要开启yarn资源队列的资源抢占.在使用过程中,才明白资源抢占的一些特点.在这里总结一下. 只有一个队 ...

  7. Fair Scheduler中的Delay Schedule分析

    延迟调度的主要目的是提高数据本地性(data locality),减少数据在网络中的传输.对于那些输入数据不在本地的MapTask,调度器将会延迟调度他们,而把slot分配给那些具备本地性的MapTa ...

  8. 【Cocos2d-x 3.x】 调度器Scheduler类源码分析

    非个人的全部理解,部分摘自cocos官网教程,感谢cocos官网. 在<CCScheduler.h>头文件中,定义了关于调度器的五个类:Timer,TimerTargetSelector, ...

  9. Linux IO Scheduler(Linux IO 调度器)

    每个块设备或者块设备的分区,都对应有自身的请求队列(request_queue),而每个请求队列都可以选择一个I/O调度器来协调所递交的request.I/O调度器的基本目的是将请求按照它们对应在块设 ...

  10. Pair Project: Elevator Scheduler [电梯调度算法的实现和测试]

    作业提交时间:10月9日上课前. Design and implement an Elevator Scheduler to aim for both correctness and performa ...

随机推荐

  1. iOS 中的 Deferred Deep Linking(延迟深度链接)

    http://www.cocoachina.com/ios/20160105/14871.html Deep Linking 其实 deep linking 并不是一个新名词,在 web 开发领域,区 ...

  2. js实现动态计数效果

    下面附有数字图片和数字边框图 <!DOCTYPE html> <html> <head> <meta charset="utf-8"> ...

  3. CentOS下重启uwsgi

    使用Django开发项目,每次修改内容无法刷新,重启nginx也无效,每次都重启主机, 网上搜索很多资料,发现可以重启uwsgi来解决问题,但是发现uwsgi重启每个人都不一样,很多人写了脚本重启 我 ...

  4. 51nod1196 字符串的数量

    用N个不同的字符(编号1 - N),组成一个字符串,有如下要求:(1) 对于编号为i的字符,如果2 * i > n,则该字符可以作为结尾字符.如果不作为结尾字符而是中间的字符,则该字符后面可以接 ...

  5. 移动项目到centos中运行报错:failed to open stream: Permission denied

    这是文件夹没有读写权限的错误: (注意:TP5.0权限给runtime文件夹就行了,官方文档在安装tp5的方法中有介绍到权限问题) 在需要赋予权限的文件夹的前一级输入: chmod -R 文件夹名字

  6. Sublime text2 常用插件

    很早就安装了这款软件,因为听说很NB.但是一直都是习惯用vim, 所以一直没有去学习它, 现在用用感觉确实很不错,所以找了些插件, 参考网址: http://www.hphq.net/Marketin ...

  7. linux awk命令详解,使用system来内嵌系统命令, awk合并两列

    linux awk命令详解 简介 awk是一个强大的文本分析工具,相对于grep的查找,sed的编辑,awk在其对数据分析并生成报告时,显得尤为强大.简单来说awk就是把文件逐行的读入,以空格为默认分 ...

  8. 【JZOJ4877】【NOIP2016提高A组集训第10场11.8】力场护盾

    题目描述 ZMiG成功粉碎了707的基因突变计划,为了人类的安全,他决定向707的科学实验室发起进攻!707并没有想到有人敢攻击她的实验室,一时间不知所措,决定牺牲电力来换取自己实验室的平安. 在实验 ...

  9. CSDN编程挑战——《-3+1》

    版权声明:本文为博主原创文章,未经博主同意不得转载. https://blog.csdn.net/user_longling/article/details/24674033 -3+1 题目详情: 有 ...

  10. Person Re-identification 系列论文笔记(四):Re-ID done right: towards good practices for person re-identification

    Re-ID done right: towards good practices for person re-identification Almazan J, Gajic B, Murray N, ...