使用AKKA做分布式爬虫的思路
上周公司其它小组在讨论做分布式爬虫,我也思考了一下。提了一个方案,就是使用akka分布式rpc框架来做,自己写master和worker程序,client向master提交begin任务或者其它爬虫需求,master让worker去爬网页,worker都是kafka的同一个group然后从kafka里面拉取数据(URL),然后处理爬了的网页,解析内容,把爬下来的网页通过正則表達式匹配出嵌套的网页,然后请求actor推断是否爬过(防止生成有向图。让其变成树形结构)(这里应该是个单独的actor。这样多个请求过来不会出现线程同步问题),最后把没有爬的URL扔到Kafka,直到kafka的URL被拉去完
这里有个简单的图例:
代码上面没有写爬虫的东西,也没有写checkActor,仅仅是简单的做了下模拟,就写了个简单的分布式事例作为參考
代码结构例如以下:
当中POM同这篇博客一样:http://blog.csdn.net/qq_20641565/article/details/65488828
Master的代码:
package com.lijie.scala.service
import scala.collection.mutable.HashMap
import scala.concurrent.duration.DurationInt
import com.lijie.scala.bean.WorkBean
import com.lijie.scala.utils.ActorUtils
import akka.actor.Actor
import akka.actor.Props
import akka.actor.actorRef2Scala
import com.lijie.scala.bean.WorkBeanInfo
import com.lijie.scala.bean.WorkBeanInfo
import com.lijie.scala.caseclass.Submit
import com.lijie.scala.caseclass.SubmitAble
import com.lijie.scala.caseclass.Hearbeat
import com.lijie.scala.caseclass.RegisterSucess
import com.lijie.scala.caseclass.CheckConn
import com.lijie.scala.caseclass.Register
import com.lijie.scala.caseclass.SubmitCrawler
import com.lijie.scala.caseclass.BeginCrawler
class Master(val masterHost: String, val masterPort: Int, val masterActorSystem: String, val masterName: String) extends Actor {
//保存work的Actor连接
var workerConn = new HashMap[String, WorkBean]
//保存客户端的连接
var clientConn = new HashMap[String, WorkBean]
//超时时间
val OVERTIME = 20000
override def preStart(): Unit = {
//隐式转换
import context.dispatcher
//启动的时候定时检查worker是否挂了,假设挂了就从workerConn移除
context.system.scheduler.schedule(0 millis, OVERTIME millis, self, CheckConn)
}
def receive: Actor.Receive = {
//注冊
case Register(workerId, workerHost, workerPort, workerActorSystem, workerName) => {
//打印worker上线消息
println(workerId + "," + workerHost + "," + workerPort + "," + workerActorSystem + "," + workerName)
//获取Master的代理对象
var workerRef = context.actorSelection(s"akka.tcp://$workerActorSystem@$workerHost:$workerPort/user/$workerName")
//保存连接
workerConn += (workerId -> new WorkBean(workerRef, 0))
//给worker返回应答注冊成功
sender ! RegisterSucess
}
//接受心跳
case Hearbeat(workerId) => {
if (workerConn.contains(workerId)) {
//取出workerBean
var workBean = workerConn.get(workerId).get
//又一次设置时间
workBean.time = System.currentTimeMillis()
//移除之前的值
workerConn -= workerId
//将新值放入conn
workerConn += (workerId -> workBean)
}
}
//定时检查
case CheckConn => {
//得到超时的值
var over = workerConn.filter(System.currentTimeMillis() - _._2.time > OVERTIME)
//得到超时的值
for (key <- over.keySet) {
//将超时的从链接中移除
workerConn -= key
}
//測试输出还有多少个链接
val alive = workerConn.size
println(s"还有$alive 个worker活着")
}
case Submit(clientId, clientHost, clientPort, clientActorSystem, clientName) => {
//打印client上线消息
println(clientId + "," + clientHost + "," + clientPort + "," + clientActorSystem + "," + clientName)
//获取Master的代理对象
var clientRef = context.actorSelection(s"akka.tcp://$clientActorSystem@$clientHost:$clientPort/user/$clientName")
//保存连接
clientConn += (clientId -> new WorkBean(clientRef, 0))
//给client返回能够提交申请
sender ! SubmitAble
}
//收到爬虫任务分发给worker
case SubmitCrawler(kafka, redis, other) => {
//让全部worker開始爬虫任务
for (workerBean <- workerConn.values) {
//向全部存活的worker发送爬虫任务
workerBean.worker ! BeginCrawler(kafka, redis, other)
}
}
}
}
object Master {
def main(args: Array[String]): Unit = {
val argss = Array[String]("127.0.0.1", "8080", "masterSystem", "actorMaster")
val host = argss(0)
val port = argss(1).toInt
val actorSystem = argss(2)
val actorName = argss(3)
//获取master的actorSystem
val masterSystem = ActorUtils.getActorSystem(host, port, actorSystem)
val master = masterSystem.actorOf(Props(new Master(host, port, actorSystem, actorName)), actorName)
masterSystem.awaitTermination()
}
}
Worker代码例如以下:
package com.lijie.scala.service
import akka.actor.Actor
import akka.actor.ActorSelection
import java.util.UUID
import scala.concurrent.duration._
import com.lijie.scala.caseclass.SendHearbeat
import com.lijie.scala.utils.ActorUtils
import akka.actor.Props
import com.lijie.scala.caseclass.BeginCrawler
import com.lijie.scala.caseclass.Hearbeat
import com.lijie.scala.caseclass.RegisterSucess
import com.lijie.scala.caseclass.Register
class Worker(val workerHost: String, val workerPort: Int, val workerActorSystem: String, val workerName: String, val masterHost: String, val masterPort: Int, val masterActorSystem: String, val masterName: String) extends Actor {
//master的代理对象
var master: ActorSelection = _
//每一个worker的id
val workerId = UUID.randomUUID().toString()
override def preStart(): Unit = {
//获取Master的代理对象
master = context.actorSelection(s"akka.tcp://$masterActorSystem@$masterHost:$masterPort/user/$masterName")
//向master注冊
master ! Register(workerId, workerHost, workerPort, workerActorSystem, workerName)
}
def receive: Actor.Receive = {
//收到注冊成功的消息,定时发送心跳
case RegisterSucess => {
println("收到注冊成功的消息,開始发送心跳")
//隐式转换
import context.dispatcher
//创建定时器,并发送心跳
context.system.scheduler.schedule(0 millis, 10000 millis, self, SendHearbeat)
}
//发送心跳
case SendHearbeat => {
println("向master发送心跳")
//发送心跳
master ! Hearbeat(workerId)
}
//開始爬虫
case BeginCrawler(kafka, redis, other) => {
println("開始执行爬虫任务...")
println("kafka和redis以及其它消息内容:" + kafka + "," + redis + "," + other)
println("初始化kafka连接和redis连接...")
println("从队列里面取出url...")
println("開始爬数据...")
println("假设失败重试几次...")
println("............")
println("解析这个网页的内容,解析出里面的url...")
//请求actionCheck
println("请求actionCheck...")
println("检查是否爬过...")
println("把该刚爬了的url扔到redis")
println("把该网页解析的没有爬过的url扔到队列...")
println("继续从队列里面拿url直到队列里面url被爬完...")
}
}
}
object Worker {
def main(args: Array[String]): Unit = {
val argss = Array[String]("127.0.0.1", "8088", "workSystem", "actorWorker", "127.0.0.1", "8080", "masterSystem", "actorMaster")
//worker
val host = argss(0)
val port = argss(1).toInt
val actorSystem = argss(2)
val actorName = argss(3)
//master
val hostM = argss(4)
val portM = argss(5).toInt
val actorSystemM = argss(6)
val actorNameM = argss(7)
//获取woker的actorSystem
val workerSystem = ActorUtils.getActorSystem(host, port, actorSystem)
val worker = workerSystem.actorOf(Props(new Worker(host, port, actorSystem, actorName, hostM, portM, actorSystemM, actorNameM)), actorName)
workerSystem.awaitTermination()
}
}
ActionUtils代码例如以下:
package com.lijie.scala.utils
import com.typesafe.config.ConfigFactory
import akka.actor.ActorSystem
import akka.actor.Props
import akka.actor.Actor
object ActorUtils {
//获取actor工具类
def getActorSystem(host: String, port: Int, actorSystem: String) = {
val conf = s"""
|akka.actor.provider = "akka.remote.RemoteActorRefProvider"
|akka.remote.netty.tcp.hostname = "$host"
|akka.remote.netty.tcp.port = "$port"
""".stripMargin
val config = ConfigFactory.parseString(conf)
//创建注冊worker的的ActorSystem
val actorSys = ActorSystem(actorSystem, config)
//返回actorSystem
actorSys
}
}
WorkBean代码例如以下:
package com.lijie.scala.bean
import akka.actor.ActorSelection
//封装worker的引用和当前时间戳
class WorkBean(var worker: ActorSelection, var time: Long)
class WorkBeanInfo(val workerId: String, val workerHost: String, val workerPort: Int, val workerActorSystem: String, val workerName: String, var time: Long)
caseClass代码例如以下:
package com.lijie.scala.caseclass
//開始提交任务 client2client
case object BeginSubmit
// client2client-------------------------------
//client提交任务 client2master
case class Submit(val clientId: String, val clientHost: String, val clientPort: Int, val clientActorSystem: String, val clientName: String) extends Serializable
//提交爬虫任务
case class SubmitCrawler(val kafkaInfo: String, val redisInfo: String, val otherInfo: String)
// client2master-------------------------------
//能够提交任务 master2client
case object SubmitAble
// master2client-------------------------------
//检查哪些worker挂了 master2master
case object CheckConn
//返回注冊成功 master2worker
case object RegisterSucess extends Serializable
// master2worker-------------------------------
//worker注冊 worker2master
case class Register(val workerId: String, val workerHost: String, val workerPort: Int, val workerActorSystem: String, val workerName: String) extends Serializable
//发送心跳 worker2master
case class Hearbeat(workId: String) extends Serializable
// worker2master-------------------------------
//发送心跳 worker2worker
case object SendHearbeat
//爬虫 worker2worker
case class BeginCrawler(val kafkaInfo: String, val redisInfo: String, val otherInfo: String)
// worker2worker-------------------------------
最后先执行master。然后执行worker,我这里执行的两个worker。最后执行client 结果例如以下
Master:
Worker01:
watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcXFfMjA2NDE1NjU=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast" alt="这里写图片描写叙述" title="">
Worker02:
Client:
使用AKKA做分布式爬虫的思路的更多相关文章
- Scrapy分布式爬虫,分布式队列和布隆过滤器,一分钟搞定?
使用Scrapy开发一个分布式爬虫?你知道最快的方法是什么吗?一分钟真的能 开发好或者修改出 一个分布式爬虫吗? 话不多说,先让我们看看怎么实践,再详细聊聊细节~ 快速上手 Step 0: 首先安装 ...
- python3 分布式爬虫
背景 部门(东方IC.图虫)业务驱动,需要搜集大量图片资源,做数据分析,以及正版图片维权.前期主要用node做爬虫(业务比较简单,对node比较熟悉).随着业务需求的变化,大规模爬虫遇到各种问题.py ...
- scrapy分布式爬虫scrapy_redis一篇
分布式爬虫原理 首先我们来看一下scrapy的单机架构: 可以看到,scrapy单机模式,通过一个scrapy引擎通过一个调度器,将Requests队列中的request请求发给下载器,进行页 ...
- 分布式爬虫系统设计、实现与实战:爬取京东、苏宁易购全网手机商品数据+MySQL、HBase存储
http://blog.51cto.com/xpleaf/2093952 1 概述 在不用爬虫框架的情况,经过多方学习,尝试实现了一个分布式爬虫系统,并且可以将数据保存到不同地方,类似MySQL.HB ...
- scrapy——7 scrapy-redis分布式爬虫,用药助手实战,Boss直聘实战,阿布云代理设置
scrapy——7 什么是scrapy-redis 怎么安装scrapy-redis scrapy-redis常用配置文件 scrapy-redis键名介绍 实战-利用scrapy-redis分布式爬 ...
- 【Python3爬虫】爬取美女图新姿势--Redis分布式爬虫初体验
一.写在前面 之前写的爬虫都是单机爬虫,还没有尝试过分布式爬虫,这次就是一个分布式爬虫的初体验.所谓分布式爬虫,就是要用多台电脑同时爬取数据,相比于单机爬虫,分布式爬虫的爬取速度更快,也能更好地应对I ...
- 【Python3爬虫】学习分布式爬虫第一步--Redis分布式爬虫初体验
一.写在前面 之前写的爬虫都是单机爬虫,还没有尝试过分布式爬虫,这次就是一个分布式爬虫的初体验.所谓分布式爬虫,就是要用多台电脑同时爬取数据,相比于单机爬虫,分布式爬虫的爬取速度更快,也能更好地应对I ...
- 基于java的分布式爬虫
分类 分布式网络爬虫包含多个爬虫,每个爬虫需要完成的任务和单个的爬行器类似,它们从互联网上下载网页,并把网页保存在本地的磁盘,从中抽取URL并沿着这些URL的指向继续爬行.由于并行爬行器需要分割下载任 ...
- 如何让你的scrapy爬虫不再被ban之二(利用第三方平台crawlera做scrapy爬虫防屏蔽)
我们在做scrapy爬虫的时候,爬虫经常被ban是常态.然而前面的文章如何让你的scrapy爬虫不再被ban,介绍了scrapy爬虫防屏蔽的各种策略组合.前面采用的是禁用cookies.动态设置use ...
随机推荐
- servlet生命周期和执行流程
一 .生命周期 servlet 声明周期可以分四个阶段: 类装载过程 init() 初始化过程 service() 服务过程,选择doGet \ doPost destroy() 销毁过程 servl ...
- hdu 4442 Physical Examination 贪心排序
Physical Examination Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others ...
- SPOJ 3267. D-query (主席树,查询区间有多少个不相同的数)
3267. D-query Problem code: DQUERY English Vietnamese Given a sequence of n numbers a1, a2, ..., an ...
- XBee Level Shifting
http://www.faludi.com/bwsn/xbee-level-shifting/ The XBee communication (RX/TX) pins definitely opera ...
- C/C++服务器开发的必备利器–libconfig
http://www.leoox.com/?p=311 程序肯定需要一份配置文件,要不然,自己的程序不是“可配置”的,自己都不好意思往“高大上”靠拢.言归正传,以前自己写代码,配置文件的读写都是各式各 ...
- 提交改动到 github 远程服务器,怎么跳过要求输入密码的步骤
新机器上将工程改动提交到 github 服务器时,发现每次都要输入密码,这个有点儿小烦人,怎么解决这个问题呢? 首先,切换到工程根目录的 .git 隐藏目录,用 TextEdit 打开 config ...
- java基础知识概要图
- java.lang.ClassNotFoundException: org.springframework.web.util.IntrospectorCleanupListener
一:如果出现下面的错误信息,如果你的项目是Maven结构的,那么一般都是你的项目的Maven Dependencies没有添加到项目的编译路径下: 信息: The APR based Apache T ...
- iOS中安全结束 子线程 的方法
一个典型的结束子线程的方法: 用 isFinished 检测子线程是否被完全kill掉 -(IBAction)btnBack:(id)sender { //释放内存 仅仅remove 并不会触发内 ...
- Selenium2+Python自动化-处理浏览器弹窗(转载)
本篇转自博客:上海-小T 原文地址:http://blog.csdn.net/real_tino/article/details/59068827 我们在浏览网页时经常会碰到各种花样的弹窗,在做UI自 ...