Akka源码分析-Cluster-Metrics

　　一个应用软件维护的后期一定是要做监控，akka也不例外，它提供了集群模式下的度量扩展插件。

　　其实如果读者读过前面的系列文章的话，应该是能够自己写一个这样的监控工具的。简单来说就是创建一个actor，它负责收集节点的性能信息，然后用eventStream或者PUB/SUB把消息发布出去，需要这个信息的actor或者router订阅，然后根据信息做响应的操作就好了。当然了，akka估计也是这样做的，因为在akka里面一切都是actor。

　　akka实现的Metrics扩展能够搜集系统性能指标，并能够把它发布给集群内的其他节点。但我个人觉得这是不够的，因为只有系统的度量，而没有actorSystem的度量，比如当前actorSystem的actor数量、未处理消息数量、已处理消息数量、处理失败的消息数量、每个actor处理消息的平均时长、每类消息的平均处理时长等等。这些ActorSystem的相关指标，也许可以提供给开发者另一个视角来对actor做监控。

　　好了废话不多说，我们先来看看akka的监控体系。

　　Metrics Collector。指标收集器，是用来收集相关的指标的，每个收集器都提供不同的指标集。目前集群度量扩展提供了默认的两种实现：akka.cluster.metrics.SigarMetricsCollector、akka.cluster.metrics.JmxMetricsCollector。第一种收集器收集的指标比较精细、准确，但比较耗资源。第二种则刚好相反。其实JMX在Cluster源码分析中有提到过，但没有深入讲解。当然了收集器的加载是有一定顺序的，用户自定义优先级最高，SigarMetricsCollector次之，JmxMetricsCollector最低。有优先级就意味着收集器只能有一个。

　　Metrics Events。指标事件。我们知道akka体系中最重要的就是actor和消息，那么收集器收集的结果就是指标事件，会在固定周期内发布出去。

　　其实分析到这里大家应该会想到之前讲到的消息中继器，也就是PUB/SUB框架。使用消息订阅、发布机制，就可以简单的实现这个度量体系的。

　　首先我们来看看如何订阅指标事件。

class MetricsListener extends Actor with ActorLogging {

  val selfAddress = Cluster(context.system).selfAddress

  val extension = ClusterMetricsExtension(context.system)

  // Subscribe unto ClusterMetricsEvent events.

  override def preStart(): Unit = extension.subscribe(self)

  // Unsubscribe from ClusterMetricsEvent events.

  override def postStop(): Unit = extension.unsubscribe(self)

  def receive = {

    case ClusterMetricsChanged(clusterMetrics) ⇒

      clusterMetrics.filter(_.address == selfAddress) foreach { nodeMetrics ⇒

        logHeap(nodeMetrics)

        logCpu(nodeMetrics)

      }

    case state: CurrentClusterState ⇒ // Ignore.

  }

  def logHeap(nodeMetrics: NodeMetrics): Unit = nodeMetrics match {

    case HeapMemory(address, timestamp, used, committed, max) ⇒

      log.info("Used heap: {} MB", used.doubleValue / 1024 / 1024)

    case _ ⇒ // No heap info.

  }

  def logCpu(nodeMetrics: NodeMetrics): Unit = nodeMetrics match {

    case Cpu(address, timestamp, Some(systemLoadAverage), cpuCombined, cpuStolen, processors) ⇒

      log.info("Load: {} ({} processors)", systemLoadAverage, processors)

    case _ ⇒ // No cpu info.

  }

}

　　怎么样是不是很简单，其实就是一行代码：ClusterMetricsExtension(context.system).subscribe(self)。那么我们就从ClusterMetricsExtension入手。

/**

 * Cluster metrics extension.

 *

 * Cluster metrics is primarily for load-balancing of nodes. It controls metrics sampling

 * at a regular frequency, prepares highly variable data for further analysis by other entities,

 * and publishes the latest cluster metrics data around the node ring and local eventStream

 * to assist in determining the need to redirect traffic to the least-loaded nodes.

 *

 * Metrics sampling is delegated to the [[MetricsCollector]].

 *

 * Smoothing of the data for each monitored process is delegated to the

 * [[EWMA]] for exponential weighted moving average.

 */

class ClusterMetricsExtension(system: ExtendedActorSystem) extends Extension {

  /**

   * Metrics extension configuration.

   */

  val settings = ClusterMetricsSettings(system.settings.config)

  import settings._

  /**

   * INTERNAL API

   *

   * Supervision strategy.

   */

  private[metrics] val strategy = system.dynamicAccess.createInstanceFor[SupervisorStrategy](

    SupervisorStrategyProvider, immutable.Seq(classOf[Config] → SupervisorStrategyConfiguration))

    .getOrElse {

      val log: LoggingAdapter = Logging(system, getClass.getName)

      log.error(s"Configured strategy provider ${SupervisorStrategyProvider} failed to load, using default ${classOf[ClusterMetricsStrategy].getName}.")

      new ClusterMetricsStrategy(SupervisorStrategyConfiguration)

    }

  /**

   * Supervisor actor.

   * Accepts subtypes of [[CollectionControlMessage]]s to manage metrics collection at runtime.

   */

  val supervisor = system.systemActorOf(

    Props(classOf[ClusterMetricsSupervisor]).withDispatcher(MetricsDispatcher).withDeploy(Deploy.local),

    SupervisorName)

  /**

   * Subscribe user metrics listener actor unto [[ClusterMetricsEvent]]

   * events published by extension on the system event bus.

   */

  def subscribe(metricsListener: ActorRef): Unit = {

    system.eventStream.subscribe(metricsListener, classOf[ClusterMetricsEvent])

  }

  /**

   * Unsubscribe user metrics listener actor from [[ClusterMetricsEvent]]

   * events published by extension on the system event bus.

   */

  def unsubscribe(metricsListenter: ActorRef): Unit = {

    system.eventStream.unsubscribe(metricsListenter, classOf[ClusterMetricsEvent])

  }

}

　　有没有发现，对于重要的类官方注释都很详细。集群度量扩展，主要是用来做负载均衡的。它周期性的对系统指标进行采样并发布出去。当然数据的平滑还用了EWMA，这个是啥这里也不再深入分析。

　　其实ClusterMetricsExtension源码非常简单，就是定义了监控策略、启动了一个系统actor、公开了订阅/取消订阅接口。消息的发布是通过eventStream。看到这里我都不想再继续分析了，因为实现方案跟预测的差不多。

　　为了稳定性，顶层还有一个监督actor负责启动指标收集器，希望这个用法大家一定要体会其好处。其实这样的设计在之前很多地方都出现了，只不过没有深入研究而已。

　　ClusterMetricsSupervisor源码不再分析，就是启动了ClusterMetricsCollector，并提供了监督机制。

　　ClusterMetricsCollector的源码也不再具体深入的分析，它通过MetricsCollector来收集指标信息并把它通过gossip协议发布出去。其实就是调用sample方法采集指标。

/**

 * Metrics sampler.

 *

 * Implementations of cluster system metrics collectors extend this trait.

 */

trait MetricsCollector extends Closeable {

  /**

   * Samples and collects new data points.

   * This method is invoked periodically and should return

   * current metrics for this node.

   */

  def sample(): NodeMetrics

}

　　MetricsCollector特质非常简单，就只是定义了一个sample方法，返回NodeMetrics数据。其实有时候度量体系的设计中，度量指标的设计才是最重要的，具体如何收集可以有很多种方法，所以我们优先看NodeMetrics的源码。

/**

 * The snapshot of current sampled health metrics for any monitored process.

 * Collected and gossipped at regular intervals for dynamic cluster management strategies.

 *

 * Equality of NodeMetrics is based on its address.

 *

 * @param address [[akka.actor.Address]] of the node the metrics are gathered at

 * @param timestamp the time of sampling, in milliseconds since midnight, January 1, 1970 UTC

 * @param metrics the set of sampled [[akka.cluster.metrics.Metric]]

 */

@SerialVersionUID(1L)

final case class NodeMetrics(address: Address, timestamp: Long, metrics: Set[Metric] = Set.empty[Metric])

　　NodeMetrics是当前抽样的健康指标的快照。每个节点地址对应一个NodeMetrics。NodeMetrics有三个变量：节点地址、采样时间、度量指标集（Metric）。

/**

 * Metrics key/value.

 *

 * Equality of Metric is based on its name.

 *

 * @param name the metric name

 * @param value the metric value, which must be a valid numerical value,

 *   a valid value is neither negative nor NaN/Infinite.

 * @param average the data stream of the metric value, for trending over time. Metrics that are already

 *   averages (e.g. system load average) or finite (e.g. as number of processors), are not trended.

 */

@SerialVersionUID(1L)

final case class Metric private[metrics] (name: String, value: Number, average: Option[EWMA])

  extends MetricNumericConverter

　　指标又是如何定义的呢？官方说，这就是一个简单的K/V，key就是指标名称，value就是指标值。指标值必须是数值类型，还可以有指标的EWMA平均值。

　　下面来看JmxMetricsCollector的具体实现。

/**

 * Loads JVM and system metrics through JMX monitoring beans.

 *

 * @param address The [[akka.actor.Address]] of the node being sampled

 * @param decayFactor how quickly the exponential weighting of past data is decayed

 */

class JmxMetricsCollector(address: Address, decayFactor: Double) extends MetricsCollector

　　JmxMetricsCollector通过JMX监控bean加载JVM和系统指标，JMX是什么这里先不具体解释。这个类有两个参数，第一个不再解释，第二个比较重要。decayFactor代表历史数据指数加权的衰败因子，我想就是一个过期限制吧。

/**

   * Samples and collects new data points.

   * Creates a new instance each time.

   */

  def sample(): NodeMetrics = NodeMetrics(address, newTimestamp, metrics)

　　这就是sample的实现方法，非常简单，就是返回NodeMetrics，关键是第三个参数，它是一个函数。

/**

   * Generate metrics set.

   * Creates a new instance each time.

   */

  def metrics(): Set[Metric] = {

    val heap = heapMemoryUsage

    Set(systemLoadAverage, heapUsed(heap), heapCommitted(heap), heapMax(heap), processors).flatten

  }

　　而metrics这个方法又分别调用了systemLoadAverage, heapUsed(heap), heapCommitted(heap), heapMax(heap), processors这5个方法，我们只分析第一个。

  /**

   * (JMX) Returns the OS-specific average load on the CPUs in the system, for the past 1 minute.

   * On some systems the JMX OS system load average may not be available, in which case a -1 is

   * returned from JMX, and None is returned from this method.

   * Creates a new instance each time.

   */

  def systemLoadAverage: Option[Metric] = Metric.create(

    name = SystemLoadAverage,

    value = osMBean.getSystemLoadAverage,

    decayFactor = None)

　　systemLoadAverage返回当前系统过去1分钟的CPU平均负载，一些系统上JMX平均负载可能不能使用，此时返回-1。请注意其中第二个参数。

private val memoryMBean: MemoryMXBean = ManagementFactory.getMemoryMXBean

private val osMBean: OperatingSystemMXBean = ManagementFactory.getOperatingSystemMXBean

　　osMBean、memoryMBean是这个类定义的两个非常重要的字段，这两个属性是java.lang.management提供的监控和管理JVM虚拟机的其中的两个接口，类似的接口大概有8个。

ClassLoadingMXBean	用于 Java 虚拟机的类加载系统的管理接口。
CompilationMXBean	用于 Java 虚拟机的编译系统的管理接口。
GarbageCollectorMXBean	用于 Java 虚拟机的垃圾回收的管理接口。
MemoryManagerMXBean	内存管理器的管理接口。
MemoryMXBean	Java 虚拟机的内存系统的管理接口。
MemoryPoolMXBean	内存池的管理接口。
OperatingSystemMXBean	用于操作系统的管理接口，Java 虚拟机在此操作系统上运行。
RuntimeMXBean	Java 虚拟机的运行时系统的管理接口。
ThreadMXBean	Java 虚拟机线程系统的管理接口。

　　其实源码看到这里基本就差不多了，因为跟我们预计的差不多，其实就是创建了一个actor，这个actor通过java.lang.management获取JVM相关的信息，然后通过eventStream把数据分发出去，需要度量事件的节点的actor订阅相关的事件就可以了。当然了akka又往前走了一步，既然akka说集群度量体系的初衷是为了提供负载均衡的，它就真的提供了AdaptiveLoadBalancingPool / AdaptiveLoadBalancingGroup 这两个自适应性的负载均衡策略，这两个路由策略基于度量指标收集到的信息把消息分散到集群中的对应的actor，以达到负载均衡的目的。这两个类的源码今天就不再深入研究了，感兴趣的读者可自行研究。

Cluster Metrics Extension

Akka源码分析-Cluster-Metrics的更多相关文章

Akka源码分析-Cluster-Distributed Publish Subscribe in Cluster
在ClusterClient源码分析中,我们知道,他是依托于“Distributed Publish Subscribe in Cluster”来实现消息的转发的,那本文就来分析一下Pub/Sub是如 ...
Akka源码分析-Cluster-Sharding
个人觉得akka提供的cluster工具中,sharding是最吸引人的.当我们需要把actor分布在不同的节点上时,Cluster sharding非常有用.我们可以使用actor的逻辑标识符与ac ...
Akka源码分析-Cluster-Singleton
akka Cluster基本实现原理已经分析过,其实它就是在remote基础上添加了gossip协议,同步各个节点信息,使集群内各节点能够识别.在Cluster中可能会有一个特殊的节点,叫做单例节点. ...
Akka源码分析-Persistence
在学习akka过程中,我们了解了它的监督机制,会发现actor非常可靠,可以自动的恢复.但akka框架只会简单的创建新的actor,然后调用对应的生命周期函数,如果actor有状态需要回复,我们需要h ...
Akka源码分析-Cluster-ActorSystem
前面几篇博客,我们依次介绍了local和remote的一些内容,其实再分析cluster就会简单很多,后面关于cluster的源码分析,能够省略的地方,就不再贴源码而是一句话带过了,如果有不理解的地方 ...
storm操作zookeeper源码分析-cluster.clj
storm操作zookeeper的主要函数都定义在命名空间backtype.storm.cluster中(即cluster.clj文件中).backtype.storm.cluster定义了两个重要p ...
Akka源码分析-Akka Typed
对不起,akka typed 我是不准备进行源码分析的,首先这个库的API还没有release,所以会may change,也就意味着其概念和设计包括API都会修改,基本就没有再深入分析源码的意义了. ...
Akka源码分析-Akka-Streams-概念入门
今天我们来讲解akka-streams,这应该算akka框架下实现的一个很高级的工具.之前在学习akka streams的时候,我是觉得云里雾里的,感觉非常复杂,而且又难学,不过随着对akka源码的深 ...
Akka源码分析-local-DeathWatch
生命周期监控,也就是死亡监控,是akka编程中常用的机制.比如我们有了某个actor的ActorRef之后,希望在该actor死亡之后收到响应的消息,此时我们就可以使用watch函数达到这一目的. c ...

随机推荐

jquery给span赋值
span是最简单的容器,可以当作一个形式标签,其取值赋值方法有别于一般的页面元素. //赋值 $("#spanid").html(value) //取值 $("#span ...
前端性能分析-HTTPWatch和dynaTrace
openstack -> openinfra
https://www.openstack.org/assets/software/projectmap/openstack-map.pdf
BNUOJ 6023 畅通工程续
畅通工程续 Time Limit: 1000ms Memory Limit: 32768KB This problem will be judged on HDU. Original ID: 1874 ...
567. Permutation in String
Problem statement: Given two strings s1 and s2, write a function to return true if s2 contains the p ...
hdu 1075 字典树
#include<stdio.h> #include<iostream> struct node { int num,i; node *a[27]; char s[20];// ...
BBS+Blog项目代码
项目目录结构: cnblog/ |-- blog/(APP) |-- migrations(其中文件略) |-- templatetags/ |-- my_tags.py |-- utils/ |-- ...
POJ1328 Radar Installation 解题报告
Description Assume the coasting is an infinite straight line. Land is in one side of coasting, sea i ...
HTML学习之Flex 布局
一.Flex 布局是什么? Flex 是 Flexible Box 的缩写,意为"弹性布局",用来为盒状模型提供最大的灵活性. 任何一个容器都可以指定为 Flex 布局. .box ...
推荐IOS开发3个工具:Homebrew、TestFight、Crashlytics-b
1. Homebrew 什么是Homebrew? Homebrew is the easiest and most flexible way to install the UNIX tools App ...

Akka源码分析-Cluster-Metrics

Akka源码分析-Cluster-Metrics的更多相关文章

随机推荐

热门专题