mk-supervisor

(defserverfn mk-supervisor [conf shared-context ^ISupervisor isupervisor]
(log-message "Starting Supervisor with conf " conf)
(.prepare isupervisor conf (supervisor-isupervisor-dir conf)) ;;初始化supervisor-id,并存在localstate中(参考ISupervisor的实现)
(FileUtils/cleanDirectory (File. (supervisor-tmp-dir conf))) ;;清空本机的supervisor目录
(let [supervisor (supervisor-data conf shared-context isupervisor)
;;创建两个event-manager,用于在后台执行function
[event-manager processes-event-manager :as managers] [(event/event-manager false) (event/event-manager false)]
sync-processes (partial sync-processes supervisor) ;;partial sync-process
;;mk-synchronize-supervisor, mk-supervisor的主要工作,参考下面
synchronize-supervisor (mk-synchronize-supervisor supervisor sync-processes event-manager processes-event-manager)
;;定义生成supervisor hb的funciton
heartbeat-fn (fn [] (.supervisor-heartbeat!
(:storm-cluster-state supervisor)
(:supervisor-id supervisor)
(SupervisorInfo. (current-time-secs)
(:my-hostname supervisor)
(:assignment-id supervisor)
(keys @(:curr-assignment supervisor))
;; used ports
(.getMetadata isupervisor)
(conf SUPERVISOR-SCHEDULER-META)
((:uptime supervisor)))))]
;;先调用heartbeat-fn发送一次supervisor的hb
    ;;接着使用schedule-recurring去定期调用heartbeat-fn更新hb
    (heartbeat-fn)
;; should synchronize supervisor so it doesn't launch anything after being down (optimization)
(schedule-recurring (:timer supervisor)
0
(conf SUPERVISOR-HEARTBEAT-FREQUENCY-SECS)
heartbeat-fn))
 

mk-synchronize-supervisor

supervisor很简单, 主要管两件事,
当assignment发生变化时, 从nimbus同步topology的代码到本地
当assignment发生变化时, check workers状态, 保证被分配的work的状态都是valid

两个需求,
1. 当assignment发生变化时触发
    怎样通过zookeeper的watcher实现这个反复触发机制, 参考
Storm-源码分析- Storm中Zookeeper的使用

2. 因为比较耗时, 后台执行
    创建两个event-manager, 分别用于后台执行mk-synchronize-supervisor和sync-processes

mk-synchronize-supervisor, 比较特别的是内部用了一个有名字的匿名函数this来封装这个函数体
刚开始看到非常诧异, 其实目的是为了可以在sync-callback中将这个函数add到event-manager里面去
即每次被调用, 都需要再一次把sync-callback注册到zk, 以保证下次可以被继续触发

(defn mk-synchronize-supervisor [supervisor sync-processes event-manager processes-event-manager]
(fn this []
(let [conf (:conf supervisor)
storm-cluster-state (:storm-cluster-state supervisor)
^ISupervisor isupervisor (:isupervisor supervisor)
^LocalState local-state (:local-state supervisor) ;;本地缓存数据库
sync-callback (fn [& ignored] (.add event-manager this)) ;;生成callback函数(后台执行mk-synchronize-supervisor)
assignments-snapshot (assignments-snapshot storm-cluster-state sync-callback) ;;读取assignments,并注册callback,在zk->assignment发生变化时被触发
storm-code-map (read-storm-code-locations assignments-snapshot) ;;从哪儿下载topology code
downloaded-storm-ids (set (read-downloaded-storm-ids conf)) ;;已经下载了哪些topology
all-assignment (read-assignments ;;supervisor的port上被分配了哪些executors
assignments-snapshot
(:assignment-id supervisor)) ;;supervisor-id
new-assignment (->> all-assignment ;;new=all,因为confirmAssigned没有具体实现,always返回true
(filter-key #(.confirmAssigned isupervisor %)))
assigned-storm-ids (assigned-storm-ids-from-port-assignments new-assignment) ;;supervisor上被分配的topology id集合
existing-assignment (.get local-state LS-LOCAL-ASSIGNMENTS)] ;;从local-state数据库里面读出当前保存的local assignments ;;下载新分配的topology代码
(doseq [[storm-id master-code-dir] storm-code-map]
(when (and (not (downloaded-storm-ids storm-id))
(assigned-storm-ids storm-id))
(download-storm-code conf storm-id master-code-dir)))
      (.put local-state     ;;把new-assignment存到local-state数据库中
LS-LOCAL-ASSIGNMENTS
new-assignment)
(reset! (:curr-assignment supervisor) new-assignment) ;;把new-assignment cache到supervisor对象中
      ;;删除无用的topology code 
;;remove any downloaded code that's no longer assigned or active
(doseq [storm-id downloaded-storm-ids]
(when-not (assigned-storm-ids storm-id)
(log-message "Removing code for storm id " storm-id)
(rmr (supervisor-stormdist-root conf storm-id))
))
      ;;后台执行sync-processes
(.add processes-event-manager sync-processes)
)))

sync-processes

sync-processes用于管理workers, 比如处理不正常的worker或dead worker, 并创建新的workers
首先从本地读出workers的hb, 来判断work状况, shutdown所有状态非valid的workers
并为被assignment, 而worker状态非valid的slot, 创建新的worker

(defn sync-processes [supervisor]
(let [conf (:conf supervisor)
^LocalState local-state (:local-state supervisor)
assigned-executors (defaulted (.get local-state LS-LOCAL-ASSIGNMENTS) {})
now (current-time-secs)
allocated (read-allocated-workers supervisor assigned-executors now) ;;1.读取当前worker的状况
keepers (filter-val ;;找出状态为valid的worker
(fn [[state _]] (= state :valid))
allocated)
keep-ports (set (for [[id [_ hb]] keepers] (:port hb))) ;;keepers的ports集合
         ;;select-keys-pred(pred map), 对map中的key使用pred进行过滤
         ;;找出assigned-executors中executor的port, 哪些不属于keep-ports, 
        ;;即找出新被assign的workers或那些虽被assign但状态不是valid的workers(dead或没有start)
;;这些executors需要从新分配到新的worker上去

reassign-executors (select-keys-pred (complement keep-ports) assigned-executors)
new-worker-ids (into
{}
(for [port (keys reassign-executors)] ;;为reassign-executors的port产生新的worker-id
[port (uuid)]))
]
;; 1. to kill are those in allocated that are dead or disallowed
;; 2. kill the ones that should be dead
;; - read pids, kill -9 and individually remove file
;; - rmr heartbeat dir, rmdir pid dir, rmdir id dir (catch exception and log)
;; 3. of the rest, figure out what assignments aren't yet satisfied
;; 4. generate new worker ids, write new "approved workers" to LS
;; 5. create local dir for worker id
;; 5. launch new workers (give worker-id, port, and supervisor-id)
;; 6. wait for workers launch
(doseq [[id [state heartbeat]] allocated]
(when (not= :valid state) ;;shutdown所有状态不是valid的worker
(shutdown-worker supervisor id)))
(doseq [id (vals new-worker-ids)]
(local-mkdirs (worker-pids-root conf id))) ;;为新的worker创建目录, 并加到local-state的LS-APPROVED-WORKERS中
(.put local-state LS-APPROVED-WORKERS ;;更新的approved worker, 状态为valid的 + new workers
(merge
(select-keys (.get local-state LS-APPROVED-WORKERS) ;;现有approved worker中状态为valid
(keys keepers))
(zipmap (vals new-worker-ids) (keys new-worker-ids)) ;;new workers
))
(wait-for-workers-launch ;;2.wait-for-workers-launch
conf
(dofor [[port assignment] reassign-executors]
(let [id (new-worker-ids port)]
(launch-worker supervisor
(:storm-id assignment)
port
id)
id)))
))

1. read-allocated-workers

(defn read-allocated-workers
"Returns map from worker id to worker heartbeat. if the heartbeat is nil, then the worker is dead (timed out or never wrote heartbeat)"
[supervisor assigned-executors now]
(let [conf (:conf supervisor)
^LocalState local-state (:local-state supervisor)
;从local-state中读出每个worker的hb, 当然每个worker进程会不断的更新本地hb
id->heartbeat (read-worker-heartbeats conf)
        approved-ids (set (keys (.get local-state LS-APPROVED-WORKERS)))] ;;从local-state读出approved的worker
(into
{}
(dofor [[id hb] id->heartbeat] ;;根据hb来判断worker的当前状态
(let [state (cond
(or (not (contains? approved-ids id))
(not (matches-an-assignment? hb assigned-executors)))
:disallowed ;;不被允许
(not hb)
:not-started ;;无hb,没有start
(> (- now (:time-secs hb))
(conf SUPERVISOR-WORKER-TIMEOUT-SECS))
:timed-out ;;超时,dead
true
:valid)]
(log-debug "Worker " id " is " state ": " (pr-str hb) " at supervisor time-secs " now)
[id [state hb]] ;;返回每个worker的当前state和hb
))
)))

2. wait-for-workers-launch

对reassign-executors中的每个new_work_id调用launch-worker

最终调用wait-for-workers-launch, 等待worder被成功launch

逻辑也比较简单, check hb, 如果没有就不停的sleep, 至到超时, 打印failed to start

(defn- wait-for-worker-launch [conf id start-time]
(let [state (worker-state conf id)]
(loop []
(let [hb (.get state LS-WORKER-HEARTBEAT)]
(when (and
(not hb)
(<
(- (current-time-secs) start-time)
(conf SUPERVISOR-WORKER-START-TIMEOUT-SECS)
))
(log-message id " still hasn't started")
(Time/sleep 500)
(recur)
)))
(when-not (.get state LS-WORKER-HEARTBEAT)
(log-message "Worker " id " failed to start")
))) (defn- wait-for-workers-launch [conf ids]
(let [start-time (current-time-secs)]
(doseq [id ids]
(wait-for-worker-launch conf id start-time))
))

Storm-源码分析-Topology Submit-Supervisor的更多相关文章

  1. Storm源码分析--Nimbus-data

    nimbus-datastorm-core/backtype/storm/nimbus.clj (defn nimbus-data [conf inimbus] (let [forced-schedu ...

  2. JStorm与Storm源码分析(四)--均衡调度器,EvenScheduler

    EvenScheduler同DefaultScheduler一样,同样实现了IScheduler接口, 由下面代码可以看出: (ns backtype.storm.scheduler.EvenSche ...

  3. JStorm与Storm源码分析(一)--nimbus-data

    Nimbus里定义了一些共享数据结构,比如nimbus-data. nimbus-data结构里定义了很多公用的数据,请看下面代码: (defn nimbus-data [conf inimbus] ...

  4. JStorm与Storm源码分析(三)--Scheduler,调度器

    Scheduler作为Storm的调度器,负责为Topology分配可用资源. Storm提供了IScheduler接口,用户可以通过实现该接口来自定义Scheduler. 其定义如下: public ...

  5. JStorm与Storm源码分析(二)--任务分配,assignment

    mk-assignments主要功能就是产生Executor与节点+端口的对应关系,将Executor分配到某个节点的某个端口上,以及进行相应的调度处理.代码注释如下: ;;参数nimbus为nimb ...

  6. storm源码分析之任务分配--task assignment

    在"storm源码分析之topology提交过程"一文最后,submitTopologyWithOpts函数调用了mk-assignments函数.该函数的主要功能就是进行topo ...

  7. storm源码分析之topology提交过程

    storm集群上运行的是一个个topology,一个topology是spouts和bolts组成的图.当我们开发完topology程序后将其打成jar包,然后在shell中执行storm jar x ...

  8. Nimbus<三>Storm源码分析--Nimbus启动过程

    Nimbus server, 首先从启动命令开始, 同样是使用storm命令"storm nimbus”来启动看下源码, 此处和上面client不同, jvmtype="-serv ...

  9. JStorm与Storm源码分析(五)--SpoutOutputCollector与代理模式

    本文主要是解析SpoutOutputCollector源码,顺便分析该类中所涉及的设计模式–代理模式. 首先介绍一下Spout输出收集器接口–ISpoutOutputCollector,该接口主要声明 ...

  10. Nimbus<二>storm启动nimbus源码分析-nimbus.clj

    nimbus是storm集群的"控制器",是storm集群的重要组成部分.我们可以通用执行bin/storm nimbus >/dev/null 2>&1 &a ...

随机推荐

  1. jquery easy ui 验证框架

    引入参考最下面API ) var reg = /^1[3|4|5|8|9]\d{9}$/; return reg.test(value); }, message: '输入手机号码格式不准确.' } } ...

  2. 《TCP/IP图解》读书笔记

    看这本书的目的: 了解计算机之间是怎么通信的 熟悉TCP/IP协议 后面就这两个目的进行展开,要达到这两个目的,读这本书,学到了哪些知识. 一.计算机之间是怎么通信的 先来了解下面几个概念,中继器,二 ...

  3. eclipse生成export生成jar详解

    使用eclipse打jar包可能还有很多人不是很了解,今天特意测试整理一番. 打jar包有3种形式 JAR file               JAR Javadoc              ja ...

  4. 开启Visual Studio 2013时,出现Microsoft.VisualStudio.Web.PasteJson.JsonPackage无法载入的可能解決方案

    1.先下载:http://www.jb51.net/dll/Microsoft.VisualStudio.Web.PasteJson.dll.html Microsoft.VisualStudio.W ...

  5. AM335x 添加 HUAWEI MU609 Mini PCIe Module,并用pppd 启动相关设备

    kernel 的配置 kernel 3.2.0 make menuconfig Device Drivers ---> [*] USB support ---> <*> USB ...

  6. [TI-Sitara]启动流程

    前段时间在准备AM437x启动相关的一些事情,对MLO.SPL等事情也是有些糊涂,于是分享下面这篇文章 转自:http://blog.csdn.net/psvoldemort/article/deta ...

  7. Android学习之两款下拉刷新库分享

    昨天没有写博客.心里非常罪过呀,今天给大家写两种比較常见的下拉刷新的用法.一款是SwipeRefreshLayout,一款是CircleRefreshLayout. SwipeRefreshLayou ...

  8. C++ 友元类,友元函数

    //友元函数 友元类 #include<iostream> using namespace std; class PointB { public: friend class PointC; ...

  9. 若在逻辑上 A 是 B 的“一部分”(a part of)

    若在逻辑上 A 是 B 的“一部分”(a part of) ,则不允许 B 从 A 派生, 而是要用 A 和其它东西组合出 B. #include <iostream> /* run th ...

  10. java---正则表达式的字符串简单实用及扩展链接

    一:什么是正则表达式 1.定义:正则表达式是一种可以用于模式匹配和替换的规范,一个正则表达式就是由普通的字符(例如字符a到z)以及特殊字符(元字符)组成的文字模式,它 用以描述在查找文字主体时待匹配的 ...