Ganglia监控系统是UC Berkeley开源的一个项目,设计初衷就是要做好分布式集群的监控。监控层面包含资源层面和业务层面,资源层面包含cpu、memory、disk、IO、网络负载等,至于业务层面因为用户能够非常方便的添加自己定义的metric。因此能够用于做诸如服务性能、负载、出错率等的监控。比如某web服务的QPS、Http status错误率。此外,假设和Nagios集成起来还能够在某指标超过一定阈值时触发对应的报警。

Ganglia相比zabbix的优势在于client收集agent(gmond)所带来的系统开销很低,不会影响相关服务的性能。

ganglia主要有几个模块:

  • gmond: 部署在各个被监控机器上,用于定期将数据收集起来。进行广播或者单播。

  • gmetad:部署在server端,定时从配置的data_source中的host去拉取gmond收集好的数据
  • ganglia-web:将监控数据投递到web页面

关于ganglia的安装本文不做过多介绍,传送门:http://www.it165.net/admin/html/201302/770.html

本文主要介绍一下怎样开发自己定义的metric。方便监控自己关心的指标。

主要有几大类的方法:

1. 直接使用gmetric

安装gmond的机器,会同一时候安装上/usr/bin/gmetric,该命令是将一个metric的name
value等信息进行广播的工具,比如

/usr/bin/gmetric -c /etc/ganglia/gmond.conf --name=test --type=int32 --units=sec --value=2    
详细gmetric的选项见:http://manpages.ubuntu.com/manpages/hardy/man1/gmetric.1.html 


此外,除了直接命令行使用gmetric外,还能够使用常见语言的binding,比如go、Java、python等,github上都有相关的binding能够使用。仅仅须要import进来就可以。
go语言   https://github.com/ganglia/ganglia_contrib/tree/master/ganglia-go ruby  https://github.com/igrigorik/gmetric/blob/master/lib/gmetric.rb Java  https://github.com/ganglia/ganglia_contrib/tree/master/gmetric-java Python   https://github.com/ganglia/ganglia_contrib/tree/master/gmetric-python

2. 使用基于gmetric的第三方工具

本文以ganglia-logtailer举例: https://github.com/ganglia/ganglia_contrib/tree/master/ganglia-logtailer

该工具基于logtail(debain)/logcheck(centos) package, 实现对日志的定时tail,然后通过指定classname来使用对应的类进行日志的分析,

依据自己关注的字段统计出自己定义metric,并由gmetric广播出来。

比如我们依据自己服务的nginx日志格式,改动NginxLogtailer.py例如以下:

# -*- coding: utf-8 -*-
###
### This plugin for logtailer will crunch nginx logs and produce these metrics:
### * hits per second
### * GETs per second
### * average query processing time
### * ninetieth percentile query processing time
### * number of HTTP 200, 300, 400, and 500 responses per second
###
### Note that this plugin depends on a certain nginx log format, documented in
## __init__.
import time
import threading
import re
# local dependencies
from ganglia_logtailer_helper import GangliaMetricObject
from ganglia_logtailer_helper import LogtailerParsingException, LogtailerStateException
class NginxLogtailer(object):
# only used in daemon mode
period = 30
def __init__(self):
'''This function should initialize any data structures or variables
needed for the internal state of the line parser.'''
self.reset_state()
self.lock = threading.RLock()
# this is what will match the nginx lines
#log_format ganglia-logtailer
# '$host '
# '$server_addr '
# '$remote_addr '
# '- '
# '"$time_iso8601" '
# '$status '
# '$body_bytes_sent '
# '$request_time '
# '"$http_referer" '
# '"$request" '
# '"$http_user_agent" '
# '$pid';
# NOTE: nginx 0.7 doesn't support $time_iso8601, use $time_local
# instead
# original apache log format string:
# %v %A %a %u %{%Y-%m-%dT%H:%M:%S}t %c %s %>s %B %D \"%{Referer}i\" \"%r\" \"%{User-Agent}i\" %P
# host.com 127.0.0.1 127.0.0.1 - "2008-05-08T07:34:44" - 200 200 371 103918 - "-" "GET /path HTTP/1.0" "-" 23794
# match keys: server_name, local_ip, remote_ip, date, status, size,
# req_time, referrer, request, user_agent, pid
self.reg = re.compile('^(?P<remote_ip>[^ ]+) (? P<server_name>[^ ]+) (?P<hit>[^ ]+) \[(?P<date>[^\]]+)\] "(?P<request>[^"]+)" (?P<status>[^ ]+) (?P<size>[^ ]+) "(?P<referrer>[^"]+)" "(?P<user_agent>[^"]+)" "(? P<forward_to>[^"]+)" "(? P<req_time>[^"]+)"')
# assume we're in daemon mode unless set_check_duration gets called
self.dur_override = False # example function for parse line
# takes one argument (text) line to be parsed
# returns nothing
def parse_line(self, line):
'''This function should digest the contents of one line at a time,
updating the internal state variables.'''
self.lock.acquire()
try:
regMatch = self.reg.match(line)
if regMatch:
linebits = regMatch.groupdict()
if '-' == linebits['request'] or 'file2get' in linebits['request']:
self.lock.release()
return
self.num_hits+=1
# capture GETs
if( 'GET' in linebits['request'] ):
self.num_gets+=1
# capture HTTP response code
rescode = float(linebits['status'])
if( (rescode >= 200) and (rescode < 300) ):
self.num_two+=1
elif( (rescode >= 300) and (rescode < 400) ):
self.num_three+=1
elif( (rescode >= 400) and (rescode < 500) ):
self.num_four+=1
elif( (rescode >= 500) and (rescode < 600) ):
self.num_five+=1
# capture request duration
dur = float(linebits['req_time'])
self.req_time += dur
# store for 90th % calculation
self.ninetieth.append(dur)
else:
raise LogtailerParsingException, "regmatch failed to match"
except Exception, e:
self.lock.release()
raise LogtailerParsingException, "regmatch or contents failed with %s" % e
self.lock.release()
# example function for deep copy
# takes no arguments
# returns one object
def deep_copy(self):
'''This function should return a copy of the data structure used to
maintain state. This copy should different from the object that is
currently being modified so that the other thread can deal with it
without fear of it changing out from under it. The format of this
object is internal to the plugin.'''
myret = dict( num_hits=self.num_hits,
num_gets=self.num_gets,
req_time=self.req_time,
num_two=self.num_two,
num_three=self.num_three,
num_four=self.num_four,
num_five=self.num_five,
ninetieth=self.ninetieth
)
return myret
# example function for reset_state
# takes no arguments
# returns nothing
def reset_state(self):
'''This function resets the internal data structure to 0 (saving
whatever state it needs). This function should be called
immediately after deep copy with a lock in place so the internal
data structures can't be modified in between the two calls. If the
time between calls to get_state is necessary to calculate metrics,
reset_state should store now() each time it's called, and get_state
will use the time since that now() to do its calculations'''
self.num_hits = 0
self.num_gets = 0
self.req_time = 0
self.num_two = 0
self.num_three = 0
self.num_four = 0
self.num_five = 0
self.ninetieth = list()
self.last_reset_time = time.time()
# example for keeping track of runtimes
# takes no arguments
# returns float number of seconds for this run
def set_check_duration(self, dur):
'''This function only used if logtailer is in cron mode. If it is
invoked, get_check_duration should use this value instead of calculating
it.'''
self.duration = dur
self.dur_override = True
def get_check_duration(self):
'''This function should return the time since the last check. If called
from cron mode, this must be set using set_check_duration(). If in
daemon mode, it should be calculated internally.'''
if( self.dur_override ):
duration = self.duration
else:
cur_time = time.time()
duration = cur_time - self.last_reset_time
# the duration should be within 10% of period
acceptable_duration_min = self.period - (self.period / 10.0)
acceptable_duration_max = self.period + (self.period / 10.0)
if (duration < acceptable_duration_min or duration > acceptable_duration_max):
raise LogtailerStateException, "time calculation problem - duration (%s) > 10%% away from period (%s)" % (duration, self.period)
return duration
# example function for get_state
# takes no arguments
# returns a dictionary of (metric => metric_object) pairs
def get_state(self):
'''This function should acquire a lock, call deep copy, get the
current time if necessary, call reset_state, then do its
calculations. It should return a list of metric objects.'''
# get the data to work with
self.lock.acquire()
try:
mydata = self.deep_copy()
check_time = self.get_check_duration()
self.reset_state()
self.lock.release()
except LogtailerStateException, e:
# if something went wrong with deep_copy or the duration, reset and continue
self.reset_state()
self.lock.release()
raise e
# crunch data to how you want to report it
hits_per_second = mydata['num_hits'] / check_time
gets_per_second = mydata['num_gets'] / check_time
if (mydata['num_hits'] != 0):
avg_req_time = mydata['req_time'] / mydata['num_hits']
else:
avg_req_time = 0
two_per_second = mydata['num_two'] / check_time
three_per_second = mydata['num_three'] / check_time
four_per_second = mydata['num_four'] / check_time
five_per_second = mydata['num_five'] / check_time
# calculate 90th % request time
ninetieth_list = mydata['ninetieth']
ninetieth_list.sort()
num_entries = len(ninetieth_list)
if (num_entries != 0 ):
ninetieth_element = ninetieth_list[int(num_entries * 0.9)]
else:
ninetieth_element = 0
# package up the data you want to submit
hps_metric = GangliaMetricObject('nginx_hits', hits_per_second, units='hps')
gps_metric = GangliaMetricObject('nginx_gets', gets_per_second, units='hps')
avgdur_metric = GangliaMetricObject('nginx_avg_dur', avg_req_time, units='sec')
ninetieth_metric = GangliaMetricObject('nginx_90th_dur', ninetieth_element, units='sec')
twops_metric = GangliaMetricObject('nginx_200', two_per_second, units='hps')
threeps_metric = GangliaMetricObject('nginx_300', three_per_second, units='hps')
fourps_metric = GangliaMetricObject('nginx_400', four_per_second, units='hps')
fiveps_metric = GangliaMetricObject('nginx_500', five_per_second, units='hps')
# return a list of metric objects
return [ hps_metric, gps_metric, avgdur_metric, ninetieth_metric, twops_metric, threeps_metric, fourps_metric, fiveps_metric, ]

在被监控机器上部署ganglia-logtailer后,使用例如以下命令建立crond任务

*/1 * * * * root   /usr/local/bin/ganglia-logtailer --classname NginxLogtailer --log_file /usr/local/nginx-video/logs/access.log  --mode cron --gmetric_options '-C test_cluster -g nginx_status'

reload crond service,过一分钟后。在ganglia web上就可以看到对应的metric信息:

version=1&modificationDate=1436761120000" alt="" style="border:1px solid transparent; font-family:Arial,Helvetica,FreeSans,sans-serif; font-size:13.3333339691162px; line-height:17.3333339691162px">

关于ganglia-logtailer的部署方法,详见:https://github.com/ganglia/ganglia_contrib/tree/master/ganglia-logtailer

3. 用支持的语言编写自己的module。本文以Python为例

ganglia支持用户编写自己的Python
module,下面为github上简要介绍:

Writing a Python module is very simple. You just need to write it following a template and put the resulting Python
module (.py) in /usr/lib(64)/ganglia/python_modules. 

A corresponding Python Configuration (.pyconf) file needs to reside in /etc/ganglia/conf.d/.

比如。编写一个检查机器温度的演示样例Python文件

acpi_file = "/proc/acpi/thermal_zone/THRM/temperature"

def temp_handler(name):
try:
f = open(acpi_file, 'r') except IOError:
return 0 for l in f:
line = l.split() return int(line[1]) def metric_init(params):
global descriptors, acpi_file if 'acpi_file' in params:
acpi_file = params['acpi_file'] d1 = {'name': 'temp',
'call_back': temp_handler,
'time_max': 90,
'value_type': 'uint',
'units': 'C',
'slope': 'both',
'format': '%u',
'description': 'Temperature of host',
'groups': 'health'} descriptors = [d1] return descriptors def metric_cleanup():
'''Clean up the metric module.'''
pass #This code is for debugging and unit testing
if __name__ == '__main__':
metric_init({})
for d in descriptors:
v = d['call_back'](d['name'])
print 'value for %s is %u' % (d['name'], v)

有了module功能文件,还须要编写一个相应的配置文件(放在/etc/ganglia/conf.d/temp.pyconf下),格式例如以下:

modules {
module {
name = "temp"
language = "python"
# The following params are examples only
# They are not actually used by the temp module
param RandomMax {
value = 600
}
param ConstantValue {
value = 112
}
}
} collection_group {
collect_every = 10
time_threshold = 50
metric {
name = "temp"
title = "Temperature"
value_threshold = 70
}
}

有了这两个文件,这个module就算加入成功了。

很多其它的用户贡献的module,请查看 https://github.com/ganglia/gmond_python_modules

当中包含elasticsearch、filecheck、nginx_status、MySQL等常见服务的监控metric相应的module,很实用。仅仅须要稍作改动,就可以满足自己的需求。


其它的一些比較有用的用户贡献的工具

如有问题,欢迎留言讨论。

ganglia监控自己定义metric实践的更多相关文章

  1. 使用ganglia监控hadoop及hbase集群

    一.Ganglia简介 Ganglia 是 UC Berkeley 发起的一个开源监视项目,设计用于测量数以千计的节点.每台计算机都运行一个收集和发送度量数据(如处理器速度.内存使用量等)的名为 gm ...

  2. Ganglia监控搭建

    一.Ganglia介绍: Ganglia是一个监控服务器.集群的开源软件,能够用曲线图表现最近一个小时,最近一天,最近一周,最近一月,最近一年的服务器或者集群的cpu负载,内存,网络,硬盘等指标.Ga ...

  3. Ganglia监控扩展实现机制

    Ganglia监控扩展实现机制 默认安装完成的Ganglia仅向我们提供基础的系统监控信息,通过Ganglia插件可以实现两种扩展Ganglia监控功能的方法.1.添加带内(in-band)插件,主要 ...

  4. Kubernetes集群的监控报警策略最佳实践

    版权声明:本文为博主原创文章,未经博主同意不得转载. https://blog.csdn.net/M2l0ZgSsVc7r69eFdTj/article/details/79652064 本文为Kub ...

  5. 第十二章 Ganglia监控Hadoop及Hbase集群性能(安装配置)

    1 Ganglia简介 Ganglia 是 UC Berkeley 发起的一个开源监视项目,设计用于测量数以千计的节点.每台计算机都运行一个收集和发送度量数据(如处理器速度.内存使用量等)的名为 gm ...

  6. 使用ganglia监控hadoop及hbase集群 分类: B3_LINUX 2015-03-06 20:53 646人阅读 评论(0) 收藏

    介绍性内容来自:http://www.uml.org.cn/sjjm/201305171.asp 一.Ganglia简介 Ganglia 是 UC Berkeley 发起的一个开源监视项目,设计用于测 ...

  7. Ganglia监控Hadoop集群的安装部署[转]

    Ganglia监控Hadoop集群的安装部署 一. 安装环境 Ubuntu server 12.04 安装gmetad的机器:192.168.52.105 安装gmond的机 器:192.168.52 ...

  8. ganglia监控hadoop2.0配置方法

    ganglia监控hadoop2.0配置方法前提:hadoop2.0集群已安装ganglia监控工具第一步:Hadoop用户登录集群每一个节点,修改文件:vi /opt/hadoop-2.0.0-cd ...

  9. Ganglia 监控Hadoop

    Ganglia监控Hadoop集群的安装部署 一. 安装环境 Ubuntu server 12.04 安装gmetad的机器:192.168.52.105 安装gmond的机 器:192.168.52 ...

随机推荐

  1. Java垃圾回收之老年代垃圾收集器

    1.Serial Old 收集器(-XX: +UseSerialOldGC, 标记-整理算法) 1.1 单线程收集,进行垃圾收集时,必须暂停所有工作线程 1.2 简单高效,Client模式下默认的老年 ...

  2. Maven实战读书笔记(一):Maven概述

    1.1 Maven是什么,能做什么 Maven是一个跨平台的项目管理工具,主要服务于Java平台的项目构建.依赖管理和项目信息管理. Maven的用途之一是项目构建,能够自动化构建过程,从清理.编译. ...

  3. css 给div 添加滚动条样式hover 效果

             css .nui-scroll { margin-left: 100px; border: 1px solid #000; width: 200px; height: 100px; ...

  4. 秋招复习-C++(三)

    • 数据库 1.数据库的索引有哪些? (1)B树索引:利用B树作为底层数据结构的索引,在B树索引中保存索引列的值和数据表的对应行的ID,每一个叶子结点都存放着一个索引列的值和数据表对应行的ID,通过这 ...

  5. 内存区--Java

    一.概述 对于 Java 程序员来说,在虚拟机自动内存管理机制下,不再需要像C/C++程序开发程序员这样为内一个 new 操作去写对应的 delete/free 操作,不容易出现内存泄漏和内存溢出问题 ...

  6. docker-machine 快速搭建docker环境

    环境:腾讯云测试成功 1.条件:本地主机A和远程主机B 2.远程主机B,配置免密登录 1,在本地主机A上生成公钥和私钥,生成命令:ssh-keygen -t rsa 私钥:id_rsa 公钥:id_r ...

  7. 简单的自动化使用--使用selenium实现学习通网站的刷慕课程序。注释空格加代码大概200行不到

    简单的自动化使用--使用selenium实现学习通网站的刷慕课程序.注释空格加代码大概200行不到 相见恨晚啊 github地址 环境Python3.6 + pycharm + chrom浏览器 + ...

  8. 关于Kubernetes v1.14.0的 kube-controller-manager部署

    1. kube-controller-manager准备 默认kube-controller-manager 部署在kube-apiserver部署的服务器上面服务器的配置等在这就不在列出来 二进制文 ...

  9. Python-集合数据类型内置方法

    集合内置方法(必考) 用途:用于关系运算的集合体,由于集合内的元素无序且集合元素不可重复,因此集合可以去重,但是去重后的集合会打乱原来元素的顺序. 定义方式:{}内用逗号隔开多个元素,元素只能是不可变 ...

  10. JQuery给元素动态增删类或特性

    背景:通过JQuery动态给Html元素增加.删除类或属性,使Html元素在不同的时刻呈现不同的样式,给用户更好的体验感觉. 如存在以下p片段和button按钮,代码如下: <p id=&quo ...