1. Prometheus 的基本架构
Prometheus 是一个开源的完整监控解决方案, 涵盖数据采集, 查询, 告警, 展示整个监控流程, 下图是 Prometheus 的架构图:
官方文档: https://prometheus.io/docs/introduction/overview/
2. 组件说明
Prometheus 生态系统由多个组件组成. 其中许多组件都是可选的
Promethus server
必须安装, 本质是一个时序数据库, 主要负责数据 pull, 存储, 分析, 提供 PromQL 查询语言的支持;
Push Gateway
非必选项, 支持临时性 Job 主动推送指标的中间网关
exporters
部署在客户端的 agent, 如 node_exporte, mysql_exporter 等
提供被监控组件信息的 HTTP 接口被叫做 exporter, 目前互联网公司常用的组件大部分都有 exporter 可以直接使用, 比如 Varnish,Haproxy,Nginx,MySQL,Linux 系统信息 (包括磁盘, 内存, CPU, 网络等等); 如: https://prometheus.io/docs/instrumenting/exporters/
alertmanager
用来进行报警, Promethus server 经过分析, 把出发的警报发送给 alertmanager 组件, alertmanager 组件通过自身的规则, 来发送通知,(邮件, 或者 webhook)
3. Prometheus-Operator
Prometheus-Operator 的架构图:
上图是 Prometheus-Operator 官方提供的架构图, 其中 Operator 是最核心的部分, 作为一个控制器, 他会去创建 Prometheus,ServiceMonitor,AlertManager 以及 PrometheusRule4 个 CRD 资源对象, 然后会一直监控并维持这 4 个资源对象的状态.
其中创建的 prometheus 这种资源对象就是作为 Prometheus Server 存在, 而 ServiceMonitor 就是 exporter 的各种抽象, exporter 前面我们已经学习了, 是用来提供专门提供 metrics 数据接口的工具, Prometheus 就是通过 ServiceMonitor 提供的 metrics 数据接口去 pull 数据的, 当然 alertmanager 这种资源对象就是对应的 AlertManager 的抽象, 而 PrometheusRule 是用来被 Prometheus 实例使用的报警规则文件.
这样我们要在集群中监控什么数据, 就变成了直接去操作 Kubernetes 集群的资源对象了, 是不是方便很多了. 上图中的 Service 和 ServiceMonitor 都是 Kubernetes 的资源, 一个 ServiceMonitor 可以通过 labelSelector 的方式去匹配一类 Service,Prometheus 也可以通过 labelSelector 去匹配多个 ServiceMonitor.
4. Prometheus-Operator 部署
官方 chart 地址:
搜索最新包下载到本地
- # 搜索
- helm search prometheus-operator
- NAME CHART VERSION App VERSION DESCRIPTION
- stable/prometheus-operator 6.4.0 0.31.0 Provides easy monitoring definitions for Kubernetes servi...
- # 拉取到本地
- helm fetch prometheus-operator
安装
- # 新建一个 monitoring 的 namespaces
- Kubectl create ns monitoring
- # 安装
- helm install -f ./prometheus-operator/values.YAML --name prometheus-operator --namespace=monitoring ./prometheus-operator
- # 更新
- helm upgrade -f prometheus-operator/values.YAML prometheus-operator ./prometheus-operator
卸载 prometheus-operator
- helm delete prometheus-operator --purge
- # 删除 crd
- kubectl delete customresourcedefinitions prometheuses.monitoring.coreos.com prometheusrules.monitoring.coreos.com servicemonitors.monitoring.coreos.com
- kubectl delete customresourcedefinitions alertmanagers.monitoring.coreos.com
- kubectl delete customresourcedefinitions podmonitors.monitoring.coreos.com
修改配置文档 values.YAML
4.1. 邮件告警
- config:
- global:
- resolve_timeout: 5m
- smtp_smarthost: 'smtp.qq.com:465'
- smtp_from: '1xxx@qq.com'
- smtp_auth_username: '1xxx@qq.com'
- smtp_auth_password: 'xreqcqffrxtnieff'
- smtp_hello: '163.com'
- smtp_require_tls: false
- route:
- group_by: ['job','severity']
- group_wait: 30s
- group_interval: 1m
- repeat_interval: 12h
- receiver: default
- routes:
- - receiver: webhook
- match:
- alertname: TargetDown
- receivers:
- - name: default
- email_configs:
- - to: 'hejianlai@pcidata.cn'
- send_resolved: true
- - name: webhook
- email_configs:
- - to: 'xxx@xxx.cn'
- send_resolved: true
这里有个坑请参考: https://www.cnblogs.com/Dev0ps/p/11320177.html
4.2. prometheus 持久化存储
- storage:
- volumeClaimTemplate:
- spec:
- storageClassName: nfs-client
- accessModes: ["ReadWriteOnce"]
- resources:
- requests:
- storage: 50Gi
4.3. Grafana 持久化
路径: prometheus-operator/charts/grafana/values.YAML
- persistence:
- enabled: true
- storageClassName: "nfs-client"
- accessModes:
- - ReadWriteOnce
- size: 10Gi
4.4. 自动发现 Service
- - job_name: 'kubernetes-service-endpoints'
- kubernetes_sd_configs:
- - role: endpoints
- relabel_configs:
- - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
- action: keep
- regex: true
- - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
- action: replace
- target_label: __scheme__
- regex: (https?)
- - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
- action: replace
- target_label: __metrics_path__
- regex: (.+)
- - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
- action: replace
- target_label: __address__
- regex: ([^:]+)(?::\d+)?;(\d+)
- replacement: $1:$2
- - action: labelmap
- regex: __meta_kubernetes_service_label_(.+)
- - source_labels: [__meta_kubernetes_namespace]
- action: replace
- target_label: kubernetes_namespace
- - source_labels: [__meta_kubernetes_service_name]
- action: replace
- target_label: kubernetes_name
- - job_name: 'kubernetes-pod'
- kubernetes_sd_configs:
- - role: pod
- relabel_configs:
- - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
- action: keep
- regex: true
- - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
- action: replace
- target_label: __metrics_path__
- regex: (.+)
- - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
- action: replace
- regex: ([^:]+)(?::\d+)?;(\d+)
- replacement: $1:$2
- target_label: __address__
- - action: labelmap
- regex: __meta_kubernetes_pod_label_(.+)
- - source_labels: [__meta_kubernetes_namespace]
- action: replace
- target_label: kubernetes_namespace
- - source_labels: [__meta_kubernetes_pod_name]
- action: replace
- target_label: kubernetes_pod_name
- - job_name: istio-mesh
- scrape_interval: 15s
- scrape_timeout: 10s
- metrics_path: /metrics
- scheme: http
- kubernetes_sd_configs:
- - api_server: null
- role: endpoints
- namespaces:
- names:
- - istio-system
- relabel_configs:
- - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
- separator: ;
- regex: istio-telemetry;prometheus
- replacement: $1
- action: keep
- - job_name: envoy-stats
- scrape_interval: 15s
- scrape_timeout: 10s
- metrics_path: /stats/prometheus
- scheme: http
- kubernetes_sd_configs:
- - api_server: null
- role: pod
- namespaces:
- names: []
- relabel_configs:
- - source_labels: [__meta_kubernetes_pod_container_port_name]
- separator: ;
- regex: .*-envoy-prom
- replacement: $1
- action: keep
- - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
- separator: ;
- regex: ([^:]+)(?::\d+)?;(\d+)
- target_label: __address__
- replacement: $1:15090
- action: replace
- - separator: ;
- regex: __meta_kubernetes_pod_label_(.+)
- replacement: $1
- action: labelmap
- - source_labels: [__meta_kubernetes_namespace]
- separator: ;
- regex: (.*)
- target_label: namespace
- replacement: $1
- action: replace
- - source_labels: [__meta_kubernetes_pod_name]
- separator: ;
- regex: (.*)
- target_label: pod_name
- replacement: $1
- action: replace
- metric_relabel_configs:
- - source_labels: [cluster_name]
- separator: ;
- regex: (outbound|inbound|prometheus_stats).*
- replacement: $1
- action: drop
- - source_labels: [tcp_prefix]
- separator: ;
- regex: (outbound|inbound|prometheus_stats).*
- replacement: $1
- action: drop
- - source_labels: [listener_address]
- separator: ;
- regex: (.+)
- replacement: $1
- action: drop
- - source_labels: [http_conn_manager_listener_prefix]
- separator: ;
- regex: (.+)
- replacement: $1
- action: drop
- - source_labels: [http_conn_manager_prefix]
- separator: ;
- regex: (.+)
- replacement: $1
- action: drop
- - source_labels: [__name__]
- separator: ;
- regex: envoy_tls.*
- replacement: $1
- action: drop
- - source_labels: [__name__]
- separator: ;
- regex: envoy_tcp_downstream.*
- replacement: $1
- action: drop
- - source_labels: [__name__]
- separator: ;
- regex: envoy_http_(stats|admin).*
- replacement: $1
- action: drop
- - source_labels: [__name__]
- separator: ;
- regex: envoy_cluster_(lb|retry|bind|internal|max|original).*
- replacement: $1
- action: drop
- - job_name: istio-policy
- scrape_interval: 15s
- scrape_timeout: 10s
- metrics_path: /metrics
- scheme: http
- kubernetes_sd_configs:
- - api_server: null
- role: endpoints
- namespaces:
- names:
- - istio-system
- relabel_configs:
- - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
- separator: ;
- regex: istio-policy;http-monitoring
- replacement: $1
- action: keep
- - job_name: istio-telemetry
- scrape_interval: 15s
- scrape_timeout: 10s
- metrics_path: /metrics
- scheme: http
- kubernetes_sd_configs:
- - api_server: null
- role: endpoints
- namespaces:
- names:
- - istio-system
- relabel_configs:
- - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
- separator: ;
- regex: istio-telemetry;http-monitoring
- replacement: $1
- action: keep
- - job_name: pilot
- scrape_interval: 15s
- scrape_timeout: 10s
- metrics_path: /metrics
- scheme: http
- kubernetes_sd_configs:
- - api_server: null
- role: endpoints
- namespaces:
- names:
- - istio-system
- relabel_configs:
- - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
- separator: ;
- regex: istio-pilot;http-monitoring
- replacement: $1
- action: keep
- - job_name: galley
- scrape_interval: 15s
- scrape_timeout: 10s
- metrics_path: /metrics
- scheme: http
- kubernetes_sd_configs:
- - api_server: null
- role: endpoints
- namespaces:
- names:
- - istio-system
- relabel_configs:
- - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
- separator: ;
- regex: istio-galley;http-monitoring
- replacement: $1
- action: keep
- - job_name: citadel
- scrape_interval: 15s
- scrape_timeout: 10s
- metrics_path: /metrics
- scheme: http
- kubernetes_sd_configs:
- - api_server: null
- role: endpoints
- namespaces:
- names:
- - istio-system
- relabel_configs:
- - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
- separator: ;
- regex: istio-citadel;http-monitoring
- replacement: $1
- action: keep
- - job_name: kubernetes-pods-istio-secure
- scrape_interval: 15s
- scrape_timeout: 10s
- metrics_path: /metrics
- scheme: https
- kubernetes_sd_configs:
- - api_server: null
- role: pod
- namespaces:
- names: []
- tls_config:
- ca_file: /etc/istio-certs/root-cert.pem
- cert_file: /etc/istio-certs/cert-chain.pem
- key_file: /etc/istio-certs/key.pem
- insecure_skip_verify: true
- relabel_configs:
- - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
- separator: ;
- regex: "true"
- replacement: $1
- action: keep
- - source_labels: [__meta_kubernetes_pod_annotation_sidecar_istio_io_status, __meta_kubernetes_pod_annotation_istio_mtls]
- separator: ;
- regex: (([^;]+);([^;]*))|(([^;]*);(true))
- replacement: $1
- action: keep
- - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
- separator: ;
- regex: (http)
- replacement: $1
- action: drop
- - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
- separator: ;
- regex: (.+)
- target_label: __metrics_path__
- replacement: $1
- action: replace
- - source_labels: [__address__]
- separator: ;
- regex: ([^:]+):(\d+)
- replacement: $1
- action: keep
- - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
- separator: ;
- regex: ([^:]+)(?::\d+)?;(\d+)
- target_label: __address__
- replacement: $1:$2
- action: replace
- - separator: ;
- regex: __meta_kubernetes_pod_label_(.+)
- replacement: $1
- action: labelmap
- - source_labels: [__meta_kubernetes_namespace]
- separator: ;
- regex: (.*)
- target_label: namespace
- replacement: $1
- action: replace
- - source_labels: [__meta_kubernetes_pod_name]
- separator: ;
- regex: (.*)
- target_label: pod_name
- replacement: $1
- action: replace
- 4.5. etcd
对于 etcd 集群一般情况下, 为了安全都会开启 https 证书认证的方式, 所以要想让 Prometheus 访问到 etcd 集群的监控数据, 就需要提供相应的证书校验.
由于我们这里演示环境使用的是 Kubeadm 搭建的集群, 我们可以使用 kubectl 工具去获取 etcd 启动的时候使用的证书路径:
[root@cn-hongkong ~]# kubectl get pod etcd-cn-hongkong.i-j6caps6av1mtyxyofmrw -n kube-system -o YAML
我们可以看到 etcd 使用的证书都对应在节点的 / etc/kubernetes/pki/etcd 这个路径下面, 所以首先我们将需要使用到的证书通过 secret 对象保存到集群中去:(在 etcd 运行的节点)
1) 手动获取 etcd 信息
curl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/healthcheck-client.crt --key /etc/kubernetes/pki/etcd/healthcheck-client.key https://172.31.182.152:2379/metrics
2) 使用 prometheus 抓取
kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crt
3) 添加 values.YAML 文件中 kubeEtcd 配置
- ## Component scraping etcd
- ##
- kubeEtcd:
- enabled: true
- ## If your etcd is not deployed as a pod, specify IPs it can be found on
- ##
- endpoints: []
- ## Etcd service. If using kubeEtcd.endpoints only the port and targetPort are used
- ##
- service:
- port: 2379
- targetPort: 2379
- selector:
- component: etcd
- ## Configure secure access to the etcd cluster by loading a secret into prometheus and
- ## specifying security configuration below. For example, with a secret named etcd-client-cert
- ##
- serviceMonitor:
- scheme: https
- insecureSkipVerify: true
- serverName: localhost
- caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
- certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt
- keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key
4) 将上面创建的 etcd-certs 对象配置到 prometheus 中 (特别重要)
- ## Secrets is a list of Secrets in the same namespace as the Prometheus object, which shall be mounted into the Prometheus Pods.
- ## The Secrets are mounted into /etc/prometheus/secrets/. Secrets changes after initial creation of a Prometheus object are not
- ## reflected in the running Pods. To change the secrets mounted into the Prometheus Pods, the object must be deleted and recreated
- ## with the new list of secrets.
- ##
- secrets:
- - etcd-certs
安装后证书就会出现在 prometheus 目录下
4.6 抓取自定义 Server
我们需要建一个 ServiceMonitor,namespaceSelector: 的 any:true 表示匹配 所有命名空间下面的具有 App= sscp-transaction 这个 label 标签的 Service.
- apiVersion: monitoring.coreos.com/v1
- kind: ServiceMonitor
- metadata:
- labels:
- App: sscp-transaction
- release: prometheus-operator
- name: springboot
- namespace: monitoring
- spec:
- endpoints:
- - interval: 15s
- path: /actuator/prometheus
- port: health
- scheme: http
- namespaceSelector:
- any: true
- # matchNames:
- # - sscp-dev
- selector:
- matchLabels:
- App: sscp-transaction
- # release: sscp
效果图:
来源: https://www.cnblogs.com/Dev0ps/p/11465819.html