先看看文档对于Scheduler的作用介绍
https://code4craft.gitbooks.io/webmagic-in-action/content/zh/posts/ch1-overview/architecture.html
之前我们也介绍过了,Scheduler主要负责爬虫的下一步爬取的规划,包括一些去重等功能。在主流程中也看到了Scheduler,现在来具体结合源码分析

源码

Scheduler是一个接口

public interface Scheduler {

/**
* add a url to fetch
*
* @param request
* @param task
*/
public void push(Request request, Task task);

/**
* get an url to crawl
*
* @param task the task of spider
* @return the url to crawl
*/
public Request poll(Task task);

}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
其主要的实现是DuplicateRemovedScheduler,使用模板模式定义了push的步骤。

public abstract class DuplicateRemovedScheduler implements Scheduler {

protected Logger logger = LoggerFactory.getLogger(getClass());

private DuplicateRemover duplicatedRemover = new HashSetDuplicateRemover();

public DuplicateRemover getDuplicateRemover() {
return duplicatedRemover;
}

public DuplicateRemovedScheduler setDuplicateRemover(DuplicateRemover duplicatedRemover) {
this.duplicatedRemover = duplicatedRemover;
return this;
}

@Override
public void push(Request request, Task task) {
logger.trace("get a candidate url {}", request.getUrl());
if (!duplicatedRemover.isDuplicate(request, task) || shouldReserved(request)) {
logger.debug("push to queue {}", request.getUrl());
pushWhenNoDuplicate(request, task);
}
}

protected boolean shouldReserved(Request request) {
return request.getExtra(Request.CYCLE_TRIED_TIMES) != null;
}

protected void pushWhenNoDuplicate(Request request, Task task) {

}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
我们来看看负责去重的接口DuplicateRemover,其实现类有HashSetDuplicateRemover使用HashSet来去重,RedisScheduler接触Redis来去重和BloomFilterDuplicateRemover使用BloomFilter去重。默认使用HashSetDuplicateRemover

public class HashSetDuplicateRemover implements DuplicateRemover {

private Set<String> urls = Sets.newSetFromMap(new ConcurrentHashMap<String, Boolean>());

@Override
public boolean isDuplicate(Request request, Task task) {
return !urls.add(getUrl(request));
}

protected String getUrl(Request request) {
return request.getUrl();
}

@Override
public void resetDuplicateCheck(Task task) {
urls.clear();
}

@Override
public int getTotalRequestsCount(Task task) {
return urls.size();
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
DuplicateRemovedScheduler抽象类有四个具体实现类QueueScheduler,PriorityScheduler,FileCacheQueueScheduler和RedisScheduler。默认使用QueueScheduler

@ThreadSafe
public class QueueScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler {

private BlockingQueue<Request> queue = new LinkedBlockingQueue<Request>();

@Override
public void pushWhenNoDuplicate(Request request, Task task) {
queue.add(request);
}

@Override
public synchronized Request poll(Task task) {
return queue.poll();
}

@Override
public int getLeftRequestsCount(Task task) {
return queue.size();
}

@Override
public int getTotalRequestsCount(Task task) {
return getDuplicateRemover().getTotalRequestsCount(task);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
其内部是使用了一个LinkedBlockingQueue这个无界队列来存储Request,我们应该看到了@ThreadSafe注解,那我抛一个问题吧。Scheduler是否存在线程同步问题呢,如果存在那是如何解决的呢?
再来看下一个

@ThreadSafe
public class PriorityScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler {

public static final int INITIAL_CAPACITY = 5;

private BlockingQueue<Request> noPriorityQueue = new LinkedBlockingQueue<Request>();

private PriorityBlockingQueue<Request> priorityQueuePlus = new PriorityBlockingQueue<Request>(INITIAL_CAPACITY, new Comparator<Request>() {
@Override
public int compare(Request o1, Request o2) {
return -NumberUtils.compareLong(o1.getPriority(), o2.getPriority());
}
});

private PriorityBlockingQueue<Request> priorityQueueMinus = new PriorityBlockingQueue<Request>(INITIAL_CAPACITY, new Comparator<Request>() {
@Override
public int compare(Request o1, Request o2) {
return -NumberUtils.compareLong(o1.getPriority(), o2.getPriority());
}
});

@Override
public void pushWhenNoDuplicate(Request request, Task task) {
if (request.getPriority() == 0) {
noPriorityQueue.add(request);
} else if (request.getPriority() > 0) {
priorityQueuePlus.put(request);
} else {
priorityQueueMinus.put(request);
}
}

@Override
public synchronized Request poll(Task task) {
Request poll = priorityQueuePlus.poll();
if (poll != null) {
return poll;
}
poll = noPriorityQueue.poll();
if (poll != null) {
return poll;
}
return priorityQueueMinus.poll();
}

@Override
public int getLeftRequestsCount(Task task) {
return noPriorityQueue.size();
}

@Override
public int getTotalRequestsCount(Task task) {
return getDuplicateRemover().getTotalRequestsCount(task);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
我们看到了两个PriorityBlockingQueue和一个LinkedBlockingQueue。在poll的时候存在一个顺序。
继续

public class FileCacheQueueScheduler extends DuplicateRemovedScheduler implements MonitorableScheduler {

private String filePath = System.getProperty("java.io.tmpdir");

private String fileUrlAllName = ".urls.txt";

private Task task;

private String fileCursor = ".cursor.txt";

private PrintWriter fileUrlWriter;

private PrintWriter fileCursorWriter;

private AtomicInteger cursor = new AtomicInteger();

private AtomicBoolean inited = new AtomicBoolean(false);

private BlockingQueue<Request> queue;

private Set<String> urls;

public FileCacheQueueScheduler(String filePath) {
if (!filePath.endsWith("/") && !filePath.endsWith("\\")) {
filePath += "/";
}
this.filePath = filePath;
}

private void flush() {
fileUrlWriter.flush();
fileCursorWriter.flush();
}

private void init(Task task) {
this.task = task;
File file = new File(filePath);
if (!file.exists()) {
file.mkdirs();
}
readFile();
initWriter();
initFlushThread();
inited.set(true);
logger.info("init cache scheduler success");
}

private void initFlushThread() {
Executors.newScheduledThreadPool(1).scheduleAtFixedRate(new Runnable() {
@Override
public void run() {
flush();
}
}, 10, 10, TimeUnit.SECONDS);
}

private void initWriter() {
try {
fileUrlWriter = new PrintWriter(new FileWriter(getFileName(fileUrlAllName), true));
fileCursorWriter = new PrintWriter(new FileWriter(getFileName(fileCursor), false));
} catch (IOException e) {
throw new RuntimeException("init cache scheduler error", e);
}
}

private void readFile() {
try {
queue = new LinkedBlockingQueue<Request>();
urls = new LinkedHashSet<String>();
readCursorFile();
readUrlFile();
} catch (FileNotFoundException e) {
//init
logger.info("init cache file " + getFileName(fileUrlAllName));
} catch (IOException e) {
logger.error("init file error", e);
}
}

private void readUrlFile() throws IOException {
String line;
BufferedReader fileUrlReader = null;
try {
fileUrlReader = new BufferedReader(new FileReader(getFileName(fileUrlAllName)));
int lineReaded = 0;
while ((line = fileUrlReader.readLine()) != null) {
urls.add(line.trim());
lineReaded++;
if (lineReaded > cursor.get()) {
queue.add(new Request(line));
}
}
} finally {
if (fileUrlReader != null) {
IOUtils.closeQuietly(fileUrlReader);
}
}
}

private void readCursorFile() throws IOException {
BufferedReader fileCursorReader = null;
try {
fileCursorReader = new BufferedReader(new FileReader(getFileName(fileCursor)));
String line;
//read the last number
while ((line = fileCursorReader.readLine()) != null) {
cursor = new AtomicInteger(NumberUtils.toInt(line));
}
} finally {
if (fileCursorReader != null) {
IOUtils.closeQuietly(fileCursorReader);
}
}
}

private String getFileName(String filename) {
return filePath + task.getUUID() + filename;
}

@Override
protected void pushWhenNoDuplicate(Request request, Task task) {
if (!inited.get()) {
init(task);
}
queue.add(request);
fileUrlWriter.println(request.getUrl());
}

@Override
public synchronized Request poll(Task task) {
if (!inited.get()) {
init(task);
}
fileCursorWriter.println(cursor.incrementAndGet());
return queue.poll();
}

@Override
public int getLeftRequestsCount(Task task) {
return queue.size();
}

@Override
public int getTotalRequestsCount(Task task) {
return getDuplicateRemover().getTotalRequestsCount(task);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
会将url和已经执行的url指针存在两个文件中,创建了scheduleExecutor定期的flush,所有内存中的url还是存在BlockingQueue中。
RedisScheduler不是很懂。。目前还没有接触过:)

使用

具体使用过程还是需要自己根据自己的爬虫特点然后选择特定的Scheduler及DuplicateRemover,只有懂得其原理才能选择最合适的组件。
WebMagic组件都可以自行设置这点真的太棒了~

Scheduler的更多相关文章

  1. AndroidStudio3.0无法打开Android Device Monitor的解决办法(An error has occurred on Android Device Monitor)

    ---恢复内容开始--- 打开monitor时出现 An error has occurred. See the log file... ------------------------------- ...

  2. 从scheduler is shutted down看程序员的英文水平

    我有个windows服务程序,今天重点在测试系统逻辑.部署后,在看系统日志时,不经意看到一行:scheduler is shutted down. 2016-12-29 09:40:24.175 {& ...

  3. Spring 4 + Quartz 2.2.1 Scheduler Integration Example

    In this post we will see how to schedule Jobs using Quartz Scheduler with Spring. Spring provides co ...

  4. VMware中CPU分配不合理以及License限制引起的SQL Scheduler不能用于查询处理

    有一台SQL Server(SQL Server 2014 标准版)服务器中的scheduler_count与cpu_count不一致,如下截图所示: SELECT  cpu_count ,      ...

  5. Windows Task Scheduler Fails With Error Code 2147943785

    Problem: Windows Task Scheduler Fails With Error Code 2147943785 Solution: This is usually due to a ...

  6. Fair Scheduler 队列设置经验总结

    Fair Scheduler 队列设置经验总结 由于公司的hadoop集群的计算资源不是很充足,需要开启yarn资源队列的资源抢占.在使用过程中,才明白资源抢占的一些特点.在这里总结一下. 只有一个队 ...

  7. Fair Scheduler中的Delay Schedule分析

    延迟调度的主要目的是提高数据本地性(data locality),减少数据在网络中的传输.对于那些输入数据不在本地的MapTask,调度器将会延迟调度他们,而把slot分配给那些具备本地性的MapTa ...

  8. 【Cocos2d-x 3.x】 调度器Scheduler类源码分析

    非个人的全部理解,部分摘自cocos官网教程,感谢cocos官网. 在<CCScheduler.h>头文件中,定义了关于调度器的五个类:Timer,TimerTargetSelector, ...

  9. Linux IO Scheduler(Linux IO 调度器)

    每个块设备或者块设备的分区,都对应有自身的请求队列(request_queue),而每个请求队列都可以选择一个I/O调度器来协调所递交的request.I/O调度器的基本目的是将请求按照它们对应在块设 ...

  10. Pair Project: Elevator Scheduler [电梯调度算法的实现和测试]

    作业提交时间:10月9日上课前. Design and implement an Elevator Scheduler to aim for both correctness and performa ...

随机推荐

  1. 什么? 1XIN = 21BTC

    什么? 1XIN = 21BTC 最初看到这个标题,我还回去考证一下. 原来是 Mixin Network 的宣传广告. BTC 是多少? 2100万枚. XIN 是 100 万枚. 所以才有了 1X ...

  2. git pull 提示错误,Your local changes to the following files would be overwritten by merge

    error: Your local changes to the following files would be overwritten by merge: Please commit your c ...

  3. JavaScript--预解析在IE存在的问题

    <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title> ...

  4. height自适应

    如果子元素没有设置 float 属性啥的,父元素就是自动适应子元素宽高的. 子元素如果全是浮动属性(float),那么父元素就没有高度. 除非,你在子元素最后加一个清除浮动( <div styl ...

  5. fedora下eclipse安装tomcat插件

    首先下载tomcat插件: http://www.eclipsetotale.com/tomcatPlugin.html,下载最新的3.3版本: 由于我的eclipse是通过yum自动安装的,因此ec ...

  6. markdown-it + highlight.js简易实现

    markdown-it 官方demo markdown-it 文档 1.配置highlightjs,针对markdown中各种语言高亮,针对对应的标签 pre code 里面的样式 -- index. ...

  7. 利用IDEA构建springboot应用-构建好SpringBoot + SSM 框架

    一. 创建项目 选择 Spring Initiallizr 添加最基本的几个依赖 Web,MySQL,MyBatis,其他需求可以后续再添加 ; 数据库选择了 MySQL 二. 配置数据源 数据源中存 ...

  8. C#面向对象基础 —— 类与对象

    文章来源: https://www.cnblogs.com/huluobozu/p/5070500.html 一.类与对象 类是面向对象编程的基本单元:类造出来的变量叫对象. 一个类包含俩种成员:字段 ...

  9. input禁止复制、粘贴、剪切

    <input type="text" autocomplete="off"> <!-- autocomplete="off" ...

  10. Android依赖别的包时,出现的问题

    项目和依赖的项目一定要在同一个文件夹下,不然会出现这种问题