CLOUD & MANAGED SERVICESKUBERNETES PROMETHEUSTHANOS
23/11/2023 • Leonardo Wolo

Prometheus High Availability with Thanos in Kubernetes

Prometheus is a popular monitoring and alerting tool in the Kubernetes ecosystem. Nevertheless, with the expansion of your infrastructure, a single Prometheus instance may encounter difficulties handling the growing workload and potentially become a vulnerable single point of failure. 

In this blog post, we will explore how to configure Prometheus for high availability using Thanos in Kubernetes. Additionally, we will leverage Helm for easy deployment and management.

If you're not familiar with Prometheus, we recommend going through their documentation first.

What is Thanos?

To successfully implement Thanos, it's crucial to comprehend the integral components of this robust tool and its seamless integration.

Thanos is an open-source project that enhances Prometheus' functionalities and offers a scalable, highly available, and long-term storage solution. It allows you to achieve horizontal scalability while maintaining a global querying view and deduplication across multiple Prometheus instances. Thanos acts as a middle layer between Prometheus and a long-term storage system like object storage or a distributed database. In the diagram [1] below, you can see how all the components work together.

Now that we have a global understanding of what Thanos is, let's look at each component in detail.

Thanos Sidecar

The Thanos sidecar works hand in hand with each Prometheus instance. It collects data blocks from Prometheus and stores them in an object storage or distributed database. Additionally, the sidecar pushes data blocks received from Prometheus into the storage system.

Thanos Query

The Thanos querier facilitates querying across data stored in various Prometheus instances and the object storage or distributed database. It serves as a central endpoint, gathering query results from different sources and providing a global view of the data. With Thanos Querier, there's no need to query individual Prometheus instances; instead, you can consolidate and deduplicate responses from store gateways and the object storage system.

Thanos Store

Thanos Store Gateway acts as an API Gateway between your Thanos cluster and the Object store. Its primary role is to proxy read requests made by Thanos Query to the underlying store. By doing so, it provides consistent global querying across multiple Prometheus instances. Instead of querying individual Prometheus instances directly, the store gateway offers a unified view of data from all instances, simplifying the querying process.

Thanos Compactor

The Thanos compactor is responsible for optimizing the long-term storage system by performing downsampling, deduplication, and compaction of data blocks. It periodically scans the blocks stored in the object storage or distributed database and applies these operations to enhance storage efficiency and reduce query response times.

Integrate Thanos with Prometheus

As we've seen, Thanos consists of different components, but we will only configure the sidecar and query components. Our decision to emphasize these two components is rooted in our goal of prioritizing high availability over extensive long-term storage solutions. The diagram [2] below shows how the components work together with Prometheus.

Before proceeding, make sure you have the following in place. 

  • Kubernetes cluster: Set up a Kubernetes cluster where you plan to deploy Prometheus and Thanos.
  • Prometheus: Have a Prometheus setup ready, including the necessary configuration files.

Note: Our Prometheus Component is deployed using the Prometheus Operator. The Thanos sidecar is integrated with the Prometheus CRD, which is very useful for the management of both components. Deploying components such as the Querier, Compactor, and Store Gateway of Thanos should be done separately from the Prometheus Operator. The kube-Thanos project provides useful starting points for deploying these additional Thanos components.

1.   Prometheus Custom Resource with Thanos Sidecar

The Prometheus Custom Resource Definition (CRD) offers the capability to include a Thanos sidecar in the Prometheus Pod. Enabling the sidecar requires setting a value in the Thanos section. For instance, a straightforward configuration option is to specify a valid container image version for the Thanos sidecar.

...
spec:
  ...
  thanos:
    version: v0.31.0
...

After making this adjustment, a new Prometheus Pod will be started with the Thanos sidecar container. As of now, it is desired to modify the number of replicas to have multiple instances of Prometheus running in the cluster.

In order to enable our Thanos Query instance to discover and query the Thanos sidecars, it's essential to include a service that exposes the ports of the Thanos sidecar.

---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-thanos-discovery
  namespace: prometheus
  labels:
    app: prometheus-thanos-discovery
    app.kubernetes.io/part-of: prometheus
spec:
  type: ClusterIP
  clusterIP: None
  ports:
    - name: grpc
      port: 10901
      targetPort: grpc
    - name: http
      port: 10902
      targetPort: http
  selector:
    app.kubernetes.io/name: prometheus
    operator.prometheus.io/name: prometheus-prometheus

2.   Thanos Query in our cluster

After configuring the Sidecar for all Prometheus instances, the next step involves le veraging Thanos's global Query Layer for executing PromQL queries simultaneously across all instances.

The Querier component of Thanos is designed to be stateless and horizontally scalable, allowing for deployment with multiple replicas. Once connected to the Thanos Sidecar, it automatically identifies the relevant Prometheus servers to contact for a given PromQL query.

Additionally, the Thanos Querier implements Prometheus's official HTTP API, enabling its compatibility with external tools like Grafana. It also provides a modified version of Prometheus's user interface, allowing for ad-hoc querying and monitoring of the Thanos stores' status.

Let's start by deploying the Kubernetes Thanos-query ServiceAccount.

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: thanos-query
  namespace: prometheus
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-query
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/version: v0.31.0

Next, let's deploy the Thanos-query service

---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-query
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/version: v0.31.0
  name: thanos-query
  namespace: prometheus
spec:
  ports:
    - name: grpc
      port: 10901
      targetPort: grpc
    - name: http
      port: 9090
      targetPort: http
  selector:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-query
    app.kubernetes.io/name: thanos-query

Now, The most crucial step lies ahead: deploying the Thanos Query application.

Note:

  • The container argument --endpoint=dnssrv+_grpc._tcp.prometheus-thanos-discovery.prometheus.svc.cluster.local helps detect Thanos API servers through respective DNS lookups. Don't forget to adjust the service url to match your setup.
  • The container argument --query.replica-label=replica labels that will be treated as a replica indicator along which data is deduplicated. 
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-query
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/version: v0.31.0
  name: thanos-query
  namespace: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: query-layer
      app.kubernetes.io/instance: thanos-query
      app.kubernetes.io/name: thanos-query
  template:
    metadata:
      labels:
        app.kubernetes.io/component: query-layer
        app.kubernetes.io/instance: thanos-query
        app.kubernetes.io/name: thanos-query
        app.kubernetes.io/version: v0.31.0
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app.kubernetes.io/name
                      operator: In
                      values:
                        - thanos-query
                namespaces:
                  - prometheus
                topologyKey: kubernetes.io/hostname
              weight: 100
      serviceAccountName: thanos-query
      containers:
        - name: thanos-query
          image: "quay.io/thanos/thanos:v0.31.0"
          args:
            - query
            - --log.level=info
            - --grpc-address=0.0.0.0:10901
            - --http-address=0.0.0.0:9090
            - --query.replica-label=replica
            - --endpoint=dnssrv+_grpc._tcp.prometheus-thanos-discovery.prometheus.svc.cluster.local
          ports:
            - name: grpc
              containerPort: 10901
            - name: http
              containerPort: 9090
          livenessProbe:
            failureThreshold: 4
            httpGet:
              path: /-/healthy
              port: 9090
              scheme: HTTP
            periodSeconds: 30
          readinessProbe:
            failureThreshold: 20
            httpGet:
              path: /-/ready
              port: 9090
              scheme: HTTP
            periodSeconds: 5
          terminationMessagePolicy: FallbackToLogsOnError
          resources:
            requests:
              memory: 1.5Gi
            limits:
              memory: 1.5Gi
      terminationGracePeriodSeconds: 120

With all the essential components successfully operating in our cluster, we can now proceed to validate the functionality of querying metrics using the Thanos query. This can be achieved by port-forwarding to a pod.

# Get the thanos query pods
kubectl -n <namespace> get pods
 
# Port forward to a pod
kubectl -n <namespace> port-forward <pod-name> <local-port>:<remote-port>

By navigating to the Stores section in the Thanos Gui [3] we can validate whether all Prometheus instances have been found by the Thanos Query application.

In the Graph section [4], we can validate whether deduplication works by running a query with this function enabled and/or disabled.


With confirmation that our Thanos configuration is functioning properly, the final step involves configuring Grafana to utilize Thanos Query as the endpoint for metrics.

Conclusion

Configuring Prometheus for high availability is crucial to maintain uninterrupted monitoring and seamless data collection. The open-source project, Thanos, offers a solution for achieving scalability and long-term data storage. In this blog post, we've guided you through the process of configuring Prometheus with Thanos in Kubernetes using the Prometheus Operator, allowing you to seamlessly deploy and manage Thanos for your Prometheus setup.

For more information, visit the official Thanos Documentation.

Do you want to apply the energy of kubernetes?