Blog

Local GPU Cluster Monitoring for LLM hosting with Prometheus & Grafana on Kubernetes

Jetze Schuurmans

October 6, 2025

9 minutes

Self-hosting Prometheus and Grafana

When deploying large language models (LLMs) on GPU clusters, having granular visibility into your system's performance is crucial. A local monitoring stack is ideal for this scenario, especially when working with sensitive applications where you don't want logging and monitoring data sent over the internet to third-party servers.

A popular combination of open-source tools for observability is Prometheus and Grafana. Prometheus is great at collecting, storing, and querying metrics of a system. Grafana can be used to create powerful dashboards to visualize these metrics. This blog post describes how we deployed both for a local Kubernetes cluster.

Configuration

This section describes how to deploy Prometheus and Grafana using Helm and Kustomize. Prometheus can be deployed using community Helm Charts. Meanwhile Grafana provides official Helm Charts. This blog explains how and why to modify these Charts to deploy Prometheus and Grafana.

We start by creating two folders, one for each applications: prometheus/ and grafana/.

Prometheus

Inside the prometheus/ folder, we create a file to kustomize the community Helm Charts, prometheus/kustomize.yml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: prometheus

helmCharts:
  - name: prometheus
    releaseName: prometheus
    version: "27.0"
    repo: https://prometheus-community.github.io/helm-charts/

Namespace

To create the namespace, under the prometheus/resources/ folder, a file called namespace.yml is created:

apiVersion: v1
kind: Namespace
metadata:
  name: prometheus

Which is referenced in the prometheus/kustomize.yml:

...
resources:
  - resources/namespace.yml

URL

Since the Prometheus Helm chart uses prometheus-server as the default Service name, you can access it internally via http://prometheus-server.prometheus.svc.cluster.local.

Grafana

Inside the grafana/ folder, a Kustomize file is created for Grafana, grafana/kustomize.yml. Here we reference the official Helm Charts:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: grafana

helmCharts:
  - name: grafana
    releaseName: grafana
    version: "8.13.1"
    repo: https://grafana.github.io/helm-charts

Namespace

To create a namespace, under the grafana/resources/ folder, a file called namespace.yml is created:

apiVersion: v1
kind: Namespace
metadata:
  name: grafana

This file is referenced in the grafana/kustomization.yml:

...
resources:
  - resources/namespace.yml

Secrets

Grafana requires an admin user and password to be configured. Ofcourse we don't want this sensitive information in our git history, so we use Sealed Secrets to created encrypted values. First, ensure Sealed Secrets is installed on your server and on your local device:

brew install kubeseal yq

Then run the following command:

kubectl create secret generic admin-password --namespace grafana --dry-run=client -o json --from-literal=username=<your_username_value> --from-literal=password=<your_password_value> | kubeseal --controller-namespace sealed-secrets --controller-name sealed-secrets | yq -p json

Paste the result in grafana/resources/secrets.yml:

kind: SealedSecret
apiVersion: bitnami.com/v1alpha1
metadata:
  name: admin-password
  namespace: grafana
  creationTimestamp: null
spec:
  template:
    metadata:
      name: admin-password
      namespace: grafana
      creationTimestamp: null
  encryptedData:
    password: ...
    username: ...

Note that the password and username are now encrypted secrets.

Then reference the secrets in grafana/kustomization.yml:

helmCharts:
    ...
    valuesInline:
      admin:
        existingSecret: admin-password
        userKey: username
        passwordKey: password

resources:
...
- resources/secrets.yml

Configure Datasource Prometheus

To ingest the data from Prometheus, configure Prometheus as a datasource in grafana/kustomization.yml:

...
helmCharts:
  ...
    valuesInline:
    ...
      datasources:
        datasources.yml:
          apiVersion: 1
          datasources:
            - name: Prometheus
              type: prometheus
              url: http://prometheus-server.prometheus.svc.cluster.local

Ingress

To access Grafana within the cluster, enable ingress and configure the host:

helmCharts:
  ...
    valuesInline:
    ...
      ingress:
        enabled: true
        hosts:
          - grafana.my-server.internal

Persistent Storage

To maintain data (such as dashboards, user data & config, alerts and snapshots) after pods are restarted, configure persistant storage. First, create a Persistent Volume by creating a file grafana/resources/pv.yml:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: grafana-pv
  labels:
    model: grafana-pv
spec:
  storageClassName: manual
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/mnt/data/grafana"

The Persistent Volume Claim is configured in Grafana's Helm Chart grafana/kustomization.yml:

helmCharts:
  ...
    valuesInline:
    ...
      persistence:
        enabled: true
        storageClassName: manual
        volumeName: grafana-pv
        accessModes:
          - ReadWriteOnce
        size: 1Gi

Both are set to 1Gi according to Grafana's minimum hardware requirements.

Complete Example

To summarize, this results in the following files:

prometheus/
├── kustomization.yml
└── resources
    └── namespace.yml
grafana/
├── kustomization.yml
└── resources
    ├── namespace.yml
    ├── pv.yml
    └── secrets.yml

Prometheus

prometheus/kustomization.yml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: prometheus

helmCharts:
  - name: prometheus
    releaseName: prometheus
    version: "27.0"
    repo: https://prometheus-community.github.io/helm-charts/

resources:
  - resources/namespace.yml

Resources

prometheus/resources/namespace.yml:

apiVersion: v1
kind: Namespace
metadata:
  name: prometheus

Grafana

grafana/kustomization.yml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: grafana

helmCharts:
  - name: grafana
    releaseName: grafana
    version: "8.13.1"
    repo: https://grafana.github.io/helm-charts
    valuesInline:
      ingress:
        enabled: true
        hosts:
          - grafana.my-server.internal
      admin:
        existingSecret: admin-password
        userKey: username
        passwordKey: password
      datasources:
        datasources.yaml:
          apiVersion: 1
          datasources:
            - name: Prometheus
              type: prometheus
              url: http://prometheus-server.prometheus.svc.cluster.local
      persistence:
        enabled: true
        storageClassName: manual
        volumeName: grafana-pv
        accessModes:
          - ReadWriteOnce
        size: 1Gi

resources:
  - resources/namespace.yml
  - resources/secrets.yml
  - resources/pv.yml

Resources

grafana/resources/namespace.yml:

apiVersion: v1
kind: Namespace
metadata:
  name: grafana

grafana/resources/pv.yml:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: grafana-pv
  labels:
    model: grafana-pv
spec:
  storageClassName: manual
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/mnt/data/grafana"

grafana/resources/secrets.yml:

kind: SealedSecret
apiVersion: bitnami.com/v1alpha1
metadata:
  name: admin-password
  namespace: grafana
  creationTimestamp: null
spec:
  template:
    metadata:
      name: admin-password
      namespace: grafana
      creationTimestamp: null
  encryptedData:
    password: ...
    username: ...

Access and Test

In this section, we show how to test and use the deployment of Prometheus and Grafana.

Endpoints for Prometheus

Prometheus works by pulling metrics from endpoints that applications provide. How does Prometheus know where to scrape?

Some applications provide metadata about their metric endpoints by default in their Helm Charts. Take for example NVIDIA's gpu-operator. That deploys the Data Center GPU Manager (DCGM) exporter or dcgm-exporter for short. Which exposes GPU metrics using the NVIDIA Management Library (NVML). In its Helm Chart, podAnnotations and additionalLabels are set for Prometheus.

For other applications, you need to configure this yourself. We use vLLM to deploy large language models (LLMs) on our GPU cluster. To let Prometheus know where to scrape vLLM metrics, the Helm Chart of vLLM needs to be patched with the following metadata:

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: vllm
  name: vllm-deployment-router
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "8000"

An alternative would be to configure on Prometheus' side where to scrape. For vLLM, this would need to be added to your Prometheus config.

We opt for adding meta data on the application side, as it makes it straightforward to add scraping when a new application is deployed. Since configurations are stored centrally with respect to applications.

Alternatively, you can extend the testFramework. For example with load tests via k6.

Extensions

To professionalize this setup, the Prometheus and Grafana servers can be extended by:

Explore the Prometheus Configuration
Setup Alerting in Prometheus
Configure CA certificates for Grafana
CI/CD for Grafana dashboards
To monitor LLM hosting, vLLM does expose useful metrics (such as Tokens Per Second) to its router pod. However, these are not exposed to Prometheus. One way to expose them is to create an observability pod as suggested here. We also suggest keeping track of vLLMs observability efforts to see if this is resolved.

References

Banner image adaptation of: Cristóbal Ascencio & Archival Images of AI + AIxDESIGN / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/

Written by

Jetze Schuurmans

Machine Learning Engineer

Jetze is a well-rounded Machine Learning Engineer, who is as comfortable solving Data Science use cases as he is productionizing them in the cloud. His expertise includes: AI4Science, MLOps, and GenAI. As a researcher, he has published papers on: Computer Vision and Natural Language Processing and Machine Learning in general.

Contact

Let’s discuss how we can support your journey.

‌

Response

Related Topics

Context Files

Related Topics

Local GPU Cluster Monitoring for LLM hosting with Prometheus & Grafana on Kubernetes

Jetze Schuurmans

Self-hosting Prometheus and Grafana

Configuration

Prometheus

Namespace

URL

Grafana

Namespace

Secrets

Configure Datasource Prometheus

Ingress

Persistent Storage

Complete Example

Prometheus

Resources

Grafana

Resources

Access and Test

Endpoints for Prometheus

Extensions

References

Written by

Jetze Schuurmans

Let’s discuss how we can support your journey.