如何估算 Prometheus 的本地存储和内存消耗

1. 本地存储容量

所需磁盘大小（GB） = 数据保留时长 _ 每秒获取指标数量 _ 指标数据大小 / 1024 / 1024 / 1024

其中

每秒获取指标数量 rate(prometheus_tsdb_head_samples_appended_total[1d])
一个小时内样本的平均大小 rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1d])/rate(prometheus_tsdb_compaction_chunk_samples_sum[1d])

一天（86400 秒）的磁盘消耗，可以在 Prometheus 中直接查询:

86400 * (rate(prometheus_tsdb_head_samples_appended_total[1d]) * (rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1d]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1d]))) / 1024 /1024 / 1024

例如，返回 {instance="localhost:9090", job="prometheus"} 4.437027408140867，那么表示 localhost:9090 实例每天需要消耗 4.437 GB 的存储空间。同时，在实例中，有不少于 3 个 wal 文件用于存储原始数据，每个 128 MB。

2. 内存消耗

内存消耗 = Prometheus Server 自身的内存消耗 + 数据块 block 内存消耗 + 抓取指标的内存消耗 + 查询带来的内存消耗

Prometheus Server 自身的内存消耗

在刚安装好的多节点高可用集群上，Prometheus Server 的内存消耗为 500 MB 左右。

数据块 block 内存消耗

主要和以下参数相关

- 每秒获取指标数量 rate(prometheus_tsdb_head_samples_appended_total[1d])
- 每个指标的平均标签数
- 不同的标签 Pair 总数
- 每个标签 Pair 平均大小
- 数据块 block 落盘周期

抓取指标的内存消耗

主要和以下参数相关

- 每秒获取指标数量 rate(prometheus_tsdb_head_samples_appended_total[1d])
- 一个小时内样本的平均大小
- 采集间隔，通常是 15s

在页面 https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion/ 可以估算上面两部分。

查询带来的内存消耗

当查询的数据不在内存时，Prometheus 会加载硬盘数据到内存，会有额外的内存消耗。

在生产中，通过 avg(container_memory_working_set_bytes{image!="", container="prometheus-server"}) / 1024 /1024 查询的 40 多个集群的平均内存消耗在 953 MB，每个集群平均个 300 Pod。