Thanos prometheus 集群以及多租户解决方案docker-compose 试用(一)

prometheus 是一个非常不多的metrics 监控解决方案，但是对于ha 以及多租户的处理并不是很好，当前有好多解决方案

cortex
Thanos
prometheus+ influxdb
Timebala
M3db
以下结合github 上的一个docker-compose项目学习下Thanos 的集群方案

Thanos 参考架构图

简单说明

thanos 包含了sidecar，store api，query，compact 组件，sidecar 和每个promethues关联，同时配置prometheus 都配置了不同的label
以下是一个简单的组件通信图，可以参考https://improbable.io/games/blog/thanos-prometheus-at-scale 查看详细说明

docker-compose 运行

docker-compose文件

version: '3.7'

services:

  # one prometheus and its sidecar

  prometheus1:

    image: prom/prometheus:v2.9.2

    command:

    - --web.enable-lifecycle

    - --config.file=/etc/prometheus/prometheus.yml

    - --storage.tsdb.path=/prometheus

    - --web.console.libraries=/usr/share/prometheus/console_libraries

    - --web.console.templates=/usr/share/prometheus/consoles

    - --storage.tsdb.min-block-duration=1m # small just to not wait hours to test :)

    - --storage.tsdb.max-block-duration=1m # small just to not wait hours to test :)

    volumes:

    - ./prometheus1.yaml:/etc/prometheus/prometheus.yml

    - ./data1:/prometheus

    ports:

    - "9090:9090"

    depends_on:

    - minio

  sidecar1:

    image: improbable/thanos:v0.4.0

    volumes:

    - ./data1:/var/prometheus

    - ./bucket_config.yaml:/bucket_config.yaml

    command:

    - sidecar

    - --tsdb.path=/var/prometheus

    - --prometheus.url=http://prometheus1:9090

    - --objstore.config-file=/bucket_config.yaml

    - --http-address=0.0.0.0:19191

    - --grpc-address=0.0.0.0:19090

    depends_on:

    - minio

  # another prometheus and its sidecar

  prometheus2:

    image: prom/prometheus:v2.9.2

    command:

    - --web.enable-lifecycle

    - --config.file=/etc/prometheus/prometheus.yml

    - --storage.tsdb.path=/prometheus

    - --web.console.libraries=/usr/share/prometheus/console_libraries

    - --web.console.templates=/usr/share/prometheus/consoles

    - --storage.tsdb.min-block-duration=1m # small just to not wait hours to test :)

    - --storage.tsdb.max-block-duration=1m # small just to not wait hours to test :)

    volumes:

    - ./prometheus2.yaml:/etc/prometheus/prometheus.yml

    - ./data2:/prometheus

    ports:

    - "9091:9090"

    depends_on:

    - minio

  sidecar2:

    image: improbable/thanos:v0.4.0

    volumes:

    - ./data2:/var/prometheus

    - ./bucket_config.yaml:/bucket_config.yaml

    command:

    - sidecar

    - --tsdb.path=/var/prometheus

    - --prometheus.url=http://prometheus2:9090

    - --objstore.config-file=/bucket_config.yaml

    - --http-address=0.0.0.0:19191

    - --grpc-address=0.0.0.0:19090

    depends_on:

    - minio

  grafana:

    image: grafana/grafana

    ports:

    - "3000:3000"

  # to search on old metrics

  storer:

    image: improbable/thanos:v0.4.0

    volumes:

    - ./bucket_config.yaml:/bucket_config.yaml

    command:

    - store

    - --data-dir=/var/thanos/store

    - --objstore.config-file=bucket_config.yaml

    - --http-address=0.0.0.0:19191

    - --grpc-address=0.0.0.0:19090

    depends_on:

    - minio

  # downsample metrics on the bucket

  compactor:

    image: improbable/thanos:v0.4.0

    volumes:

    - ./bucket_config.yaml:/bucket_config.yaml

    command:

    - compact

    - --data-dir=/var/thanos/compact

    - --objstore.config-file=bucket_config.yaml

    - --http-address=0.0.0.0:19191

    - --wait

    depends_on:

    - minio

  # querier component which can be scaled

  querier:

    image: improbable/thanos:v0.4.0

    labels:

    - "traefik.enable=true"

    - "traefik.port=19192"

    - "traefik.frontend.rule=PathPrefix:/"

    command:

    - query

    - --http-address=0.0.0.0:19192

    - --store=sidecar1:19090

    - --store=sidecar2:19090

    - --store=storer:19090

    - --query.replica-label=replica

  # s3 compatible storage

  minio:

    image: minio/minio

    ports:

    - 9000:9000

    environment:

    - MINIO_ACCESS_KEY=minio

    - MINIO_SECRET_KEY=miniostorage

    volumes:

    - ./minio_data:/data

    command: server /data

  # a simple exporter to test some metrics

  domain_exporter:

    image: caarlos0/domain_exporter:v1

    ports:

    - "9222:9222"

  # a load balancer to reverse-proxy to all queriers

  traefik:

    image: traefik

    restart: always

    ports:

      - 80:80

      - 8080:8080

    volumes:

      - /var/run/docker.sock:/var/run/docker.sock:ro

      - ./traefik.toml:/traefik.toml

networks:

  default:

    driver: bridge

简单说明
通过traefik 提供统一的访问入口，每个prometheus 都关联一个thanos sidecar，启动包含了一个简单的domain_exporter 方便测试的
每个prometheus 通过静态配置的方式，添加了服务的监控，对于对象存储使用了开源的minio，注意如果需要让store启动成功，需要
进入minio创建对应的bucket

启动&&测试

启动

docker-compose up -d

创建minio bucket

为了保证store 启动成功，需要再运行下docker-compose up -d storer

效果

统一入口（thanos 提供），界面如原生prometheus 基本一样，都是有差异，比如去重，还有就是status 中关于服务发现以及rule target 的配置没了
（因为thanos 统一处理了，不需要了）

原有prometheus 界面（可以通过9090，9091 查看）

grafana 界面

说明：
对于grafana prometheus 的配置，我们不使用以前的，直接使用thanos 提供的就可以了，配置如下：

minio 对象存储对于metrics 的存储(方便长时间的metrics 存储以及查询)

说明

以上是一个简单的docker-compose部署，从使用上对于thanos有一个简单的了解,学习简单的使用,更多的关于个组件的详细细节还需要查看
官方文档学习。

参考资料

https://github.com/mattbostock/timbala
https://github.com/cortexproject/cortex
https://github.com/thanos-io/thanos
https://github.com/influxdata/influxdb
https://github.com/m3db/m3
https://github.com/rongfengliang/thanos-playground