Blog

Building Resilient Public Networking on AWS - Part 5 - Managing HTTP Clients and TCP Persistent Connections for Failover

Jaime Navarro Santapau

Updated September 5, 2025

8 minutes

Welcome to the concluding chapter of our blog series. Over the course of this series, we’ve explored advanced networking strategies for regional evacuation, failover, and robust disaster recovery. Let’s recap our journey:

Revisiting Networking Concepts from the Client’s Perspective: We laid the foundation by exploring essential networking concepts, offering insights into client interactions and network design principles. Catch up here.
Deploy Secure Public Web Endpoints: Next, we demonstrated how to deploy public web endpoints securely on AWS, manage DNS with Route 53, and integrate with third-party DNS hosting providers. The detailed guide is here.
Region Evacuation with DNS Approach: We discussed deploying multi-region web server infrastructure and examined DNS-based region evacuation using AWS Route 53. Explore the details here.
Region Evacuation with Static Anycast IP Approach: In our fourth post, we explored a better regional evacuation approach, using AWS Global Accelerator with static anycast IPs, analyzing their advantages and trade-offs. Review it here.
In this final post, we turn our attention to a critical, often-overlooked factor: client TCP persistent connections. We’ll discuss how these persistent connections can impact the failover and disaster recovery strategies we’ve outlined and why addressing this behavior is important for ensuring seamless application availability.

As in previous posts, we have a GitHub repository to complement this blog series. It provides Infrastructure as Code (IaC) using AWS Cloud Development Kit (CDK), allowing you to deploy and manage the necessary infrastructure effortlessly.

Introduction

As mentioned, we will discuss how TCP persistent connections impact the failover and disaster recovery strategies covered in this series and why addressing this behavior is important for ensuring seamless application availability.

In the previous blog posts, we’ve explored two regional evacuation strategies to improve resiliency. However, an important flaw exists: these strategies are only effective when HTTP clients establish new TCP connections, since only then will traffic be correctly routed to a healthy region.

Generally, failover works as expected when a backend fails and creates timeouts due to a delay in the response. In these cases, those timeouts break existing TCP connections, making HTTP clients create new connections that are routed to a healthy region.

But what happens if the backend responds quickly with an error instead of timing out? In this scenario, the HTTP client may retry quickly, keeping the existing TCP connection alive. This creates a feedback loop that prevents the connection from being rerouted to another region, leading to degraded failover behavior.

Let’s closely examine this critical challenge and explore how to address it.

Why HTTP Clients Use TCP Persistent Connections

TCP persistent connections provide significant performance advantages by reducing latency for each HTTP request. Instead of opening a new connection for every request, clients reuse an existing connection, avoiding the overhead of repeatedly establishing new connections.

Each new connection introduces latency due to:

TCP Handshake: A three-step process required to establish a reliable connection between client and server.
SSL/TLS Handshake: For HTTPS, a secure handshake adds additional delay as cryptographic keys are exchanged.

By maintaining a persistent connection, HTTP clients avoid these steps, reducing latency to send repeated HTTP requests.

Common HTTP Clients For Testing

It’s also important to recognize that different HTTP clients behave differently, which can impact testing results. Understanding how each client manages TCP persistent connections helps ensure accurate evaluations of your failover strategies.

Below is a summary of the common HTTP clients used for testing:

Curl: A command-line HTTP client widely used for quick networking tests and validations. However, it does not maintain TCP persistent connections because the operating system releases associated resources, including sockets, when the command completes.
Postman: A popular tool for API development and testing, supports TCP persistent connections by keeping the process active between requests, allowing it to send multiple HTTP request over the same TCP connection more effectively than Curl.

Consider the details described above when testing failover and disaster recovery. To better understand HTTP client behavior, revisit our first blog post in this series.

Demonstrating the Impact of TCP Persistence on Regional Failover

In this section, we will demonstrate how TCP persistent connections can disrupt the regional evacuation process.

For this example, we build on the infrastructure from our previous blog post, “Region Evacuation with Static Anycast IP Approach – AWS Global Accelerator.” You could also replicate this scenario using the DNS-based evacuation strategy described in our third post, “Region Evacuation with DNS Approach – AWS Route 53.”

Step 1 – Review which is your closest AWS region

To determine your closest region, you can use the following curl command, which sends a request through the AWS Global Accelerator and returns the region to which your request is routed:

curl https://global-accelerator.subdomain-2.subdomain-1.cloudns.ph 
 { 
  "MessageResponse": "Region: us-east-1. HTTP Response code: 200. Delay response: 0 seconds.", 
  "SocketRequest": "::ffff:10.0.206.182:16068", 
  "HeadersRequest": { 
    "x-forwarded-for": "79.117.226.67", 
    "x-forwarded-proto": "https", 
    "x-forwarded-port": "443", 
    "host": "global-accelerator.subdomain-2.subdomain-1.cloudns.ph", 
    "x-amzn-trace-id": "Root=1-6787c723-0294604c58cfa08c3fc1bf48", 
    "user-agent": "curl/8.7.1", 
    "accept": "*/*" 
  }, 
  "HeadersResponse": { 
    "x-powered-by": "Express" 
  } 
}

Once clarify which region our requests are routed to, we can initiate our demo by creating an HTTP client that will keep a TCP connection alive, using this retry policy:

Retry every 30 seconds
Execute 120 retries

To create this HTTP client, we will use Postman Retry Policy for Testing. This setup simulates a client sending HTTP requests every 30 seconds for one hour.

Step 2 – Simulating Failure (us-east-1)

Once the previous HTTP client is running, we can simulate a failure in a region, to do so, we will stop the containers in that region (in our case, it will be us-east-1). As a result, our ALB will become unhealthy and start returning an HTTP 503 error code when attempting to reach our web server.

Use the AWS Console to update our Fargate task configuration:

Open deployment configuration for your Fargate task
Set the deployment config of “Desired tasks” to 0
Click the button “Update” at the bottom of the page.

Once the containers are stopped, the ALB health checks will start to fail, and it will cause our ALB to be in an unhealthy state. This triggers AWS Global Accelerator to stop sending traffic to the unhealthy region and reroute it to the remaining healthy regions.

Step 3 – New TCP connections are routed to us-west-2

To confirm that AWS Global Accelerator has stopped sending traffic to the unhealthy region (us-east-1) and has started rerouting traffic to the healthy regions, we can use a simple curl command, as mentioned previously, the curl command by default will create a new TCP connection every time we send an HTTP request.

curl https://global-accelerator.subdomain-2.subdomain-1.cloudns.ph   
{  
  "MessageResponse": "Region: us-west-2. HTTP Response code: 200. Delay response: 0 seconds.",  
  "SocketRequest": "::ffff:10.0.198.70:30538",  
  "HeadersRequest": {  
    "x-forwarded-for": "79.117.226.67",  
    "x-forwarded-proto": "https",  
    "x-forwarded-port": "443",  
    "host": "global-accelerator.subdomain-2.subdomain-1.cloudns.ph",  
    "x-amzn-trace-id": "Root=1-677fa9aa-16f2863a081d9dac08d359f2",  
    "user-agent": "curl/8.7.1",
    "accept": "*/*"  
  },  
  "HeadersResponse": {  
    "x-powered-by": "Express"  
  }  
}

Step 4 – Old TCP connections still reach us-east-1

Even after traffic is rerouted to us-west-2, TCP persistent connections established before the failure may still attempt to reach us-east-1. In this example, Postman, configured with a retry policy every 30 seconds, continues to send requests to the unhealthy region for an hour after the failure occurred. This highlights the limitation of persistent connections during regional evacuations.

Possible solutions

To mitigate the issues caused by persistent TCP connections during regional evacuations, several strategies can help manage client behavior and improve failover responsiveness. Below are some recommended approaches.

Update ALB config

Reducing the idle timeout values in your Application Load Balancer (ALB) can help close connections more quickly, but this comes with trade-offs that may affect performance.

Key timeout configurations to consider:

Connection Idle Timeout (Default: 60 seconds): Determines how long an idle connection remains open. Any data sent or received resets the timer.
HTTP Client Keep-Alive Duration (Default: 1 hour): Specifies the maximum time a client connection can stay open, even if data is actively transferred.

So, lowering these values helps enforce quicker connection closures, which is beneficial during failover. However, it can increase overhead as more frequent connection handshakes are required, potentially impacting performance under normal conditions.

It’s important to remember that if you’re using default ALB settings, a TCP connection can persist for up to one hour(HTTP Client Keep-Alive Duration) before the client creates a new connection. At that point, traffic will be routed to a healthy region, but waiting an hour might not be feasible if you need immediate redirection.

Force Connection Closures at the Backend Level

For situations where performance must remain optimal and permanently reducing idle timeout values is undesirable, dynamically forcing connection closures can be a more flexible solution.

Steps to Implement Dynamic Connection Closures:

Initiate Regional Evacuation: Begin the process of shifting traffic away from the impacted region.
Update ALB Listener Configuration in the affected region: Temporarily modify the listener to force TCP connection resets. One method is changing the listener port or using different configurations that cause connection disruptions.
Trigger New TCP Connections: Clients will be forced to establish new connections, which will be routed to the healthy region as per the failover mechanism.

By combining these strategies, you can greatly enhance the resilience and responsiveness of your regional failover scenarios. The optimal approach depends on your control over client applications, your specific infrastructure requirements, and your tolerance for trade-offs in connection latency and performance.

Additional Resources

AWS CDK Getting Started: A comprehensive guide introducing core concepts of the AWS Cloud Development Kit (CDK) and providing step-by-step instructions for installation and configuration.
AWS Global Accelerator Documentation: Review the official documentation to learn more about AWS Global Accelerator, its capabilities, and configuration options.
GitHub Repository for Practical Examples: Access the GitHub repository featured in this blog series for hands-on examples, infrastructure-as-code templates, and detailed implementation guides.

Tags:

Written by

Jaime Navarro Santapau

I'm a Senior DevOps Engineer with over 18 years of experience starting as a Software Engineer followed by Backend Software Development Lead. Currently, I manage complex infrastructure and deliver scalable solutions. Proficient in cloud platforms like AWS and Azure, containerization technologies (Docker, Kubernetes), and automation tools. Proven track record in driving successful DevOps implementations, streamlining workflows, and improving team collaboration. Strong expertise in CI/CD pipelines, monitoring, and scripting languages.

Our Ideas

Explore More Blogs

View All

‌

Demonstrating Google Principal Access Boundaries with terraform

Let’s discuss how we can support your journey.

‌

Response

Related Topics

Context Files

Related Topics

Building Resilient Public Networking on AWS - Part 5 - Managing HTTP Clients and TCP Persistent Connections for Failover

Jaime Navarro Santapau

Introduction

Why HTTP Clients Use TCP Persistent Connections

Common HTTP Clients For Testing

Demonstrating the Impact of TCP Persistence on Regional Failover

Step 1 – Review which is your closest AWS region

Step 2 – Simulating Failure (us-east-1)

Step 3 – New TCP connections are routed to us-west-2

Step 4 – Old TCP connections still reach us-east-1

Possible solutions

Update ALB config

Force Connection Closures at the Backend Level

Additional Resources

Written by

Jaime Navarro Santapau

Explore More Blogs

Demonstrating Google Principal Access Boundaries with terraform

Where Coding meets Sports: building the running app of my dreams

Building a Scalable Knowledge Base Agent with Amazon Bedrock and the MCP Gateway

The Power of Being Surrounded by People Who Inspire You.

Let’s discuss how we can support your journey.