Self-hosting Prometheus and Grafana
When deploying large language models (LLMs) on GPU clusters, having granular visibility into your system's performance is crucial. A local monitoring stack is ideal for this scenario, especially when working with sensitive applications where you don't want logging and monitoring data sent over the internet to third-party servers.
A popular combination of open-source tools for observability is Prometheus and Grafana. Prometheus is great at collecting, storing, and querying metrics of a system. Grafana can be used to create powerful dashboards to visualize these metrics. This blog post describes how we deployed both for a local Kubernetes cluster.
Configuration
This section describes how to deploy Prometheus and Grafana using Helm and Kustomize. Prometheus can be deployed using community Helm Charts. Meanwhile Grafana provides official Helm Charts. This blog explains how and why to modify these Charts to deploy Prometheus and Grafana.
We start by creating two folders, one for each applications: prometheus/
and grafana/
.
Prometheus
Inside the prometheus/
folder, we create a file to kustomize the community Helm Charts, prometheus/kustomize.yml
:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: prometheus
helmCharts:
- name: prometheus
releaseName: prometheus
version: "27.0"
repo: https://prometheus-community.github.io/helm-charts/
Namespace
To create the namespace, under the prometheus/resources/
folder, a file called namespace.yml
is created:
apiVersion: v1
kind: Namespace
metadata:
name: prometheus
Which is referenced in the prometheus/kustomize.yml
:
...
resources:
- resources/namespace.yml
URL
Since the Prometheus Helm chart uses prometheus-server as the default Service name, you can access it internally via http://prometheus-server.prometheus.svc.cluster.local
.
Grafana
Inside the grafana/
folder, a Kustomize file is created for Grafana, grafana/kustomize.yml
. Here we reference the official Helm Charts:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: grafana
helmCharts:
- name: grafana
releaseName: grafana
version: "8.13.1"
repo: https://grafana.github.io/helm-charts
Namespace
To create a namespace, under the grafana/resources/
folder, a file called namespace.yml
is created:
apiVersion: v1
kind: Namespace
metadata:
name: grafana
This file is referenced in the grafana/kustomization.yml
:
...
resources:
- resources/namespace.yml
Secrets
Grafana requires an admin user and password to be configured. Ofcourse we don't want this sensitive information in our git history, so we use Sealed Secrets to created encrypted values. First, ensure Sealed Secrets is installed on your server and on your local device:
brew install kubeseal yq
Then run the following command:
kubectl create secret generic admin-password --namespace grafana --dry-run=client -o json --from-literal=username=<your_username_value> --from-literal=password=<your_password_value> | kubeseal --controller-namespace sealed-secrets --controller-name sealed-secrets | yq -p json
Paste the result in grafana/resources/secrets.yml
:
kind: SealedSecret
apiVersion: bitnami.com/v1alpha1
metadata:
name: admin-password
namespace: grafana
creationTimestamp: null
spec:
template:
metadata:
name: admin-password
namespace: grafana
creationTimestamp: null
encryptedData:
password: ...
username: ...
Note that the password and username are now encrypted secrets.
Then reference the secrets in grafana/kustomization.yml
:
helmCharts:
...
valuesInline:
admin:
existingSecret: admin-password
userKey: username
passwordKey: password
resources:
...
- resources/secrets.yml
Configure Datasource Prometheus
To ingest the data from Prometheus, configure Prometheus as a datasource in grafana/kustomization.yml
:
...
helmCharts:
...
valuesInline:
...
datasources:
datasources.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-server.prometheus.svc.cluster.local
Ingress
To access Grafana within the cluster, enable ingress and configure the host:
helmCharts:
...
valuesInline:
...
ingress:
enabled: true
hosts:
- grafana.my-server.internal
Persistent Storage
To maintain data (such as dashboards, user data & config, alerts and snapshots) after pods are restarted, configure persistant storage. First, create a Persistent Volume by creating a file grafana/resources/pv.yml
:
apiVersion: v1
kind: PersistentVolume
metadata:
name: grafana-pv
labels:
model: grafana-pv
spec:
storageClassName: manual
capacity:
storage: 1Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
hostPath:
path: "/mnt/data/grafana"
The Persistent Volume Claim is configured in Grafana's Helm Chart grafana/kustomization.yml
:
helmCharts:
...
valuesInline:
...
persistence:
enabled: true
storageClassName: manual
volumeName: grafana-pv
accessModes:
- ReadWriteOnce
size: 1Gi
Both are set to 1Gi according to Grafana's minimum hardware requirements.
Complete Example
To summarize, this results in the following files:
prometheus/
├── kustomization.yml
└── resources
└── namespace.yml
grafana/
├── kustomization.yml
└── resources
├── namespace.yml
├── pv.yml
└── secrets.yml
Prometheus
prometheus/kustomization.yml
:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: prometheus
helmCharts:
- name: prometheus
releaseName: prometheus
version: "27.0"
repo: https://prometheus-community.github.io/helm-charts/
resources:
- resources/namespace.yml
Resources
prometheus/resources/namespace.yml
:
apiVersion: v1
kind: Namespace
metadata:
name: prometheus
Grafana
grafana/kustomization.yml
:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: grafana
helmCharts:
- name: grafana
releaseName: grafana
version: "8.13.1"
repo: https://grafana.github.io/helm-charts
valuesInline:
ingress:
enabled: true
hosts:
- grafana.my-server.internal
admin:
existingSecret: admin-password
userKey: username
passwordKey: password
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-server.prometheus.svc.cluster.local
persistence:
enabled: true
storageClassName: manual
volumeName: grafana-pv
accessModes:
- ReadWriteOnce
size: 1Gi
resources:
- resources/namespace.yml
- resources/secrets.yml
- resources/pv.yml
Resources
grafana/resources/namespace.yml
:
apiVersion: v1
kind: Namespace
metadata:
name: grafana
grafana/resources/pv.yml
:
apiVersion: v1
kind: PersistentVolume
metadata:
name: grafana-pv
labels:
model: grafana-pv
spec:
storageClassName: manual
capacity:
storage: 1Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
hostPath:
path: "/mnt/data/grafana"
grafana/resources/secrets.yml
:
kind: SealedSecret
apiVersion: bitnami.com/v1alpha1
metadata:
name: admin-password
namespace: grafana
creationTimestamp: null
spec:
template:
metadata:
name: admin-password
namespace: grafana
creationTimestamp: null
encryptedData:
password: ...
username: ...
Access and Test
In this section, we show how to test and use the deployment of Prometheus and Grafana.
Endpoints for Prometheus
Prometheus works by pulling metrics from endpoints that applications provide. How does Prometheus know where to scrape?
Some applications provide metadata about their metric endpoints by default in their Helm Charts. Take for example NVIDIA's gpu-operator. That deploys the Data Center GPU Manager (DCGM) exporter or dcgm-exporter for short. Which exposes GPU metrics using the NVIDIA Management Library (NVML). In its Helm Chart, podAnnotations and additionalLabels are set for Prometheus.
For other applications, you need to configure this yourself. We use vLLM to deploy large language models (LLMs) on our GPU cluster. To let Prometheus know where to scrape vLLM metrics, the Helm Chart of vLLM needs to be patched with the following metadata:
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: vllm
name: vllm-deployment-router
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "8000"
An alternative would be to configure on Prometheus' side where to scrape. For vLLM, this would need to be added to your Prometheus config.
We opt for adding meta data on the application side, as it makes it straightforward to add scraping when a new application is deployed. Since configurations are stored centrally with respect to applications.
Alternatively, you can extend the testFramework
. For example with load tests via k6.
Extensions
To professionalize this setup, the Prometheus and Grafana servers can be extended by:
- Explore the Prometheus Configuration
- Setup Alerting in Prometheus
- Configure CA certificates for Grafana
- CI/CD for Grafana dashboards
- To monitor LLM hosting, vLLM does expose useful metrics (such as Tokens Per Second) to its router pod. However, these are not exposed to Prometheus. One way to expose them is to create an observability pod as suggested here. We also suggest keeping track of vLLMs observability efforts to see if this is resolved.
References
Written by

Jetze Schuurmans
Machine Learning Engineer
Jetze is a well-rounded Machine Learning Engineer, who is as comfortable solving Data Science use cases as he is productionizing them in the cloud. His expertise includes: AI4Science, MLOps, and GenAI. As a researcher, he has published papers on: Computer Vision and Natural Language Processing and Machine Learning in general.
Contact