Blog
Autoscaling Azure DevOps Pipelines Agents with KEDA

1. Introduction
As organizations scale their DevOps practices, the need for efficient resource management and automation becomes critical. One of the key challenges in large-scale CI/CD environments is managing the availability of build agents, especially when working with hundreds of specialized pipelines that require custom tools and configurations.
In our case, we manage hundreds of security scan pipelines that need to be executed regularly. These pipelines require a complex set of tools installed on self-hosted Azure DevOps agents. Managing the lifecycle of these agents manually or with static configurations can lead to inefficiencies, resource wastage, and increased costs. To address these challenges, our architect proposed using Kubernetes Event-Driven Autoscaling as an auto-scaling solution for our Azure DevOps Agent Pools.
In this article, we'll explore how KEDA can be integrated with Azure Kubernetes Service (AKS) environment to efficiently manage Azure DevOps Agent Pools. We'll cover the differences between KEDA's ScaledObject and ScaledJob mechanisms and provide code example with configurations to help you implement a robust, auto-scaling solution for your Azure DevOps agents.
2. Understanding KEDA: A Brief Overview
2.1 What exactly is KEDA?
KEDA (Kubernetes Event-Driven Autoscaling) is an open-source project that brings event-driven capabilities to Kubernetes, enabling applications to scale dynamically based on events, rather than just traditional CPU or memory metrics. It acts as a lightweight component that integrates seamlessly with Kubernetes, making it easy to scale workloads such as builds or deployments.
KEDA extends the standard Kubernetes Horizontal Pod Autoscaler (HPA) to support scaling based on a variety of external metrics, such as queue length in messaging systems (e.g., Azure Service Bus, RabbitMQ), database events, HTTP requests, and many more. This event-driven scaling mechanism allows Kubernetes workloads to respond instantly to changes in demand, ensuring resources are used efficiently and cost-effectively.
2.2 How does KEDA Works?
KEDA operates by monitoring external event sources and metrics, which are defined through triggers. Each trigger defines a specific event or metric that KEDA watches to determine when to scale your Kubernetes deployments or jobs up or down. When a trigger threshold is met, KEDA adjusts the number of replicas or jobs accordingly.

2.2.1 Core Components of KEDA
- KEDA Controller: responsible for reconciling KEDA custom resources like ScaledObject and ScaledJob. It monitors these resources and ensures that the desired state is reflected in the Kubernetes cluster. It manages the lifecycle of scaling resources based on triggers defined in the ScaledObject or ScaledJob.
- Metrics Adapter: KEDA includes a built-in Metrics Server that extends the Kubernetes Metrics API. It provides external metrics to the Kubernetes Horizontal Pod Autoscaler (HPA), enabling scaling decisions based on custom metrics from external event sources like message queues, databases, or HTTP endpoints.
- ScaledObject: This object defines how a Kubernetes deployment or stateful set should scale based on a specific metric or event. It is suitable for long-running processes that need to scale based on load.
- ScaledJob: This object is designed for short-lived jobs or batch processing tasks. It creates Kubernetes Jobs based on external events, such as a message in a queue, and scales the number of jobs to match the demand.
- Scaler: specialized component that communicate with external event sources or metrics systems (e.g., message queues, databases) to fetch metrics and events. Each type of scaler is responsible for interacting with a specific external service.
- Triggers define the external event sources or metrics that KEDA monitors to make scaling decisions. KEDA supports over 40 different triggers, including: Message Queues (Azure Service Bus, RabbitMQ, Kafka), Databases (Redis, PostgreSQL, MySQL), Monitoring Systems (Prometheus, Azure Monitor, HTTP requests)
- Admission Webhooks: Validates and mutates KEDA-related resources before they are applied to the cluster.
2.3 Key Benefits of Using KEDA
- Event-Driven Scaling: KEDA provides the ability to scale Kubernetes workloads based on real-world events and metrics, such as the number of messages in a queue or the length of a processing backlog.
- Flexible Trigger Support: KEDA supports over 40 different event sources, including Azure Monitor, Prometheus, Kafka, Redis, and more, making it a versatile choice for various use cases.
- Seamless Integration with Kubernetes: KEDA extends Kubernetes-native capabilities without requiring major changes to your existing architecture. It works alongside the Kubernetes Horizontal Pod Autoscaler (HPA) to provide fine-grained control over scaling behaviour.
- Cost Efficiency: By scaling resources up and down based on actual demand, KEDA helps reduce costs associated with over-provisioning, especially in environments with fluctuating workloads.
2.4 Example Use Cases for KEDA
- CI/CD Pipelines: Automatically scale agents in response to job queues, as we’re doing with Azure DevOps Agent Pools.
- Message Processing: Scale consumers dynamically based on the number of messages in a queue (e.g., Azure Service Bus, Kafka).
- Scheduled Scaling: Scale up or down based on scheduled events, such as batch processing tasks or maintenance windows.
2.5 KEDA in the Context of Azure DevOps Agent Pools
For scenarios like my own. described in the introduction, KEDA allows us to dynamically provision agents based on the number of queued jobs in Azure DevOps. By leveraging either a ScaledObject or a ScaledJob configuration, KEDA can scale the agent pool up or down, ensuring that resources are available when needed and conserved when they are not.
3. ScaledObject vs. ScaledJob in KEDA
KEDA provides two primary mechanisms for scaling Kubernetes workloads: ScaledObject and ScaledJob. Both serve different purposes and have unique characteristics that make them suitable for specific use cases. In this section, we will explore the differences between those two, their use cases, and their application in managing Azure DevOps Agent Pools.
3.1 ScaledObject: Managing Long-Running Workloads
A ScaledObject is a KEDA custom resource that scales Kubernetes deployments, stateful sets, or other long-running resources based on external metrics or events. It works by integrating with the Kubernetes Horizontal Pod Autoscaler (HPA) to adjust the number of replicas of a target resource based on predefined triggers.
Key Characteristics:
- Suitable for Long-Running Processes: ScaledObjects are ideal for workloads that need to run continuously and adjust their scale in response to varying loads. Examples include web applications, APIs, or background services.
- Real-Time Scaling: They provide real-time scaling capabilities by monitoring metrics like queue length, CPU usage, or custom metrics among others.
- Integration with HPA: ScaledObjects directly modify the replica count of deployments or stateful sets using the native HPA mechanism, ensuring seamless integration with existing Kubernetes autoscaling features.
- Object requirement: Requires configuration of Deployment resource to be defined because it scales the number of replicas in the Deployment.
Example Use Case for Azure DevOps Agent Pools
In scenarios where Azure DevOps pipelines have a steady or predictable load of jobs, a ScaledObject can be used to scale the number of self-hosted agents up or down based on the length of the job queue. This ensures that there are always enough agents to handle incoming jobs without over-provisioning resources.
3.2 ScaledJob: Managing Short-Lived, Batch Workloads
A ScaledJob is a KEDA custom resource designed to create and manage Kubernetes Jobs based on external events. Unlike ScaledObject, which scales long-running processes, ScaledJob is focused on short-lived tasks that can be processed independently.
Key Characteristics
- Best for Batch Processing: ScaledJobs are ideal for scenarios where tasks are discrete and stateless, such as processing messages from a queue or executing scheduled jobs.
- Dynamic Job Creation: It dynamically creates Kubernetes Jobs to handle workload spikes, ensuring that each unit of work is processed as soon as resources become available.
- Automatic Cleanup: Jobs created by ScaledJobs are automatically cleaned up after finishing processing, based on the configured successfulJobsHistoryLimit and failedJobsHistoryLimit.
Example Use Case for Azure DevOps Agent Pools
For scenarios where the job load is highly variable or bursty, and each pipeline run can be considered an independent task, using a ScaledJob can dynamically create short-lived agents that register themselves to the Azure DevOps pool, execute the assigned task, and terminate. This approach is particularly useful when the pipelines are short-lived and do not require persistent agents.
3.3 Choosing Between ScaledObject and ScaledJob
When deciding between ScaledObject and ScaledJob for managing Azure DevOps Agent Pools, consider the following factors:
Workload Type
- Use ScaledObject if the agents need to run continuously and handle a steady stream of tasks. This is suitable for long-running workloads where the agent is not terminated between jobs.
- Use ScaledJob if each pipeline job can be treated as an independent, short-lived task. This approach creates a new agent for each job, which is ideal for bursty or batch workloads.
Agent Lifecycle
- ScaledObject agents remain active even if no jobs are running, which is useful for reducing job start latency.
- ScaledJob agents are created and destroyed for each job, minimizing resource usage when no jobs are queued but potentially introducing a slight delay when starting new jobs.
Resource Optimization
- ScaledObject is better when you need to maintain a minimum number of agents to quickly handle new jobs.
- ScaledJob is more resource-efficient for intermittent or unpredictable workloads, as it only creates agents when jobs are present.
By understanding the unique features and capabilities of ScaledObject and ScaledJob, you can select the right approach for your specific Azure DevOps Agent Pool requirements.
4. Pre-requisites and Setup for implementation of KEDA with Azure DevOps Agent Pools
We will cover the necessary prerequisites, including setting up Azure Kubernetes Service (AKS) and Azure Container Registry (ACR) using Terraform, along with building image for our Agents with Docker.
Tools needed for this section:
- Helm (Kubernetes Package Manager) - [Install Guide]
- Azure CLI - [Install Guide]
- Kubernetes Command Line - [Install Guide]
- Terraform (Infrastructure as Code) - [Install Guide]
- Docker Desktop - [Install Guide]
4.1. Setting Up Azure Kubernetes Service (AKS) and Azure Container Registry (ACR)
In our example we will use simplest configuration for our Kubernetes Cluster, with only one node and no additional networking or security configurations, as this is only a presentation of concept.
4.1.1. Terraform Configuration for AKS and ACR
Below Terraform code will create few necessary resources on your Azure subscription:
- Resource Group (azurerm_resource_group): A resource group to hold all Azure resources.
- Azure Container Registry (azurerm_container_registry): The registry to store Docker images for your Azure DevOps agents.
- Azure Kubernetes Service (azurerm_kubernetes_cluster): An AKS cluster with a default node pool to deploy your workloads.
- Azure Role Assignment (azurerm_role_assignment): Necessary role assignment to ensure that the AKS cluster can pull images from the ACR.
Ensure your Terraform project has the following files:
├── main.tf
├── providers.tf
└── variables.tf (optional, if using variables)
################
# providers.tf #
################
terraform {
required_version = ">=1.5"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~>4.0"
}
azuread = {
source = "hashicorp/azuread"
version = "~>2.0"
}
}
}
provider "azurerm" {
features {
resource_group {
prevent_deletion_if_contains_resources = false
}
}
}
provider "azuread" {
# Azure AD provider can be used for advanced configuration if needed
}
###########
# main.tf #
###########
resource "azurerm_resource_group" "build-agent-rg" {
name = "uks-build-agent-rg"
location = "UK South"
}
resource "azurerm_container_registry" "build-agent-registry" {
name = "uksbuildagentacrjdkedapoc"
resource_group_name = azurerm_resource_group.build-agent-rg.name
location = azurerm_resource_group.build-agent-rg.location
sku = "Basic"
admin_enabled = "true"
}
resource "azurerm_kubernetes_cluster" "build-agent-cluster" {
name = "uks-build-agent-aks"
location = azurerm_resource_group.build-agent-rg.location
resource_group_name = azurerm_resource_group.build-agent-rg.name
dns_prefix = "uks-build-agent-aks"
default_node_pool {
name = "default"
node_count = 1
vm_size = "Standard_D2_v2"
}
identity {
type = "SystemAssigned"
}
tags = {
Environment = "Test"
}
}
resource "azurerm_role_assignment" "acr-role-assignment" {
principal_id = azurerm_kubernetes_cluster.build-agent-cluster.kubelet_identity[0].object_id
role_definition_name = "AcrPull"
scope = azurerm_container_registry.build-agent-registry.id
skip_service_principal_aad_check = true
}
Run the following commands in your Terraform directory to deploy the infrastructure:
# Initialize the Terraform working directory
terraform init

# Check plan configuration to be applied
terraform plan

# Apply the Terraform configuration to create the resources
terraform apply -auto-approve

If all commands would complete successfully, in Azure we should have all needed resources created.


After the AKS cluster is deployed, trigger below to configure your local kubectl client to interact with an Azure Kubernetes Service (AKS) cluster. By running this command, the AKS credentials will be retrieved and merged into your kubeconfig file, allowing you to use kubectl commands to manage the cluster:
az aks get-credentials --resource-group uks-build-agent-rg --name uks-build-agent-aks
4.1.2. Docker Image for Azure DevOps Agent
Next step will be preparing custom Dockerfile for our image that, will be used by Kubernetes as source images for containers. You can find examples directly on Microsoft site, but in our case, we will utilize image definition by from Martin Lakov, along with edited start.sh script which has additional flags for removing unneeded pipeline agents and killing container process after job completion.
##############
# Dockerfile #
##############
FROM ubuntu:20.04
# Set DEBIAN_FRONTEND and TARGETARCH environment variables
ENV DEBIAN_FRONTEND=noninteractive
TARGETARCH=linux-x64
# Combine apt-get update, upgrade, package installation, Azure CLI installation, and PowerShell installation into one RUN command
RUN apt-get update &&
apt-get upgrade -y &&
apt-get install -y -qq --no-install-recommends
apt-transport-https
apt-utils
ca-certificates
curl
git
iputils-ping
jq
lsb-release
software-properties-common
wget &&
curl -sL https://aka.ms/InstallAzureCLIDeb | bash &&
wget -q https://github.com/PowerShell/PowerShell/releases/download/v7.1.5/powershell-7.1.5-linux-x64.tar.gz -O /tmp/powershell.tar.gz &&
mkdir -p /opt/microsoft/powershell/7 &&
tar zxf /tmp/powershell.tar.gz -C /opt/microsoft/powershell/7 &&
ln -s /opt/microsoft/powershell/7/pwsh /usr/bin/pwsh &&
rm -rf /var/lib/apt/lists/* /tmp/powershell.tar.gz
# Set working directory
WORKDIR /azp
# Copy the startup script and ensure it's executable
COPY --chmod=755 ./start.sh .
# Set the entry point
ENTRYPOINT [ "./start.sh" ]
############
# start.sh #
############
#!/bin/bash
set -e
if [ -z "$AZP_URL" ]; then
echo 1>&2 "error: missing AZP_URL environment variable"
exit 1
fi
if [ -z "$AZP_TOKEN_FILE" ]; then
if [ -z "$AZP_TOKEN" ]; then
echo 1>&2 "error: missing AZP_TOKEN environment variable"
exit 1
fi
AZP_TOKEN_FILE=/azp/.token
echo -n $AZP_TOKEN > "$AZP_TOKEN_FILE"
fi
unset AZP_TOKEN
if [ -n "$AZP_WORK" ]; then
mkdir -p "$AZP_WORK"
fi
export AGENT_ALLOW_RUNASROOT="1"
cleanup() {
if [ -e config.sh ]; then
print_header "Cleanup. Removing Azure Pipelines agent..."
# If the agent has some running jobs, the configuration removal process will fail.
# So, give it some time to finish the job.
while true; do
./config.sh remove --unattended --auth PAT --token $(cat "$AZP_TOKEN_FILE") && break
echo "Retrying in 30 seconds..."
sleep 30
done
fi
}
print_header() {
lightcyan='