In my last post, I showed how to connect AWS API Gateway directly to SNS using a service integration.
A few people asked me about the performance implications of this architecture.
Is it significantly faster than using a Lambda-based approach?
How does it compare to EC2 or ECS?
My answer: I don't know! But I know how to find out (sort of).
In this post, we do a performance bake-off of three ways to deploy the same HTTP endpoint in AWS:
Using an API Gateway service proxy
With the new hotness, AWS Lambda
With the old hotness, Docker containers on AWS Fargate
We'll deploy our three services and throw 15,000 requests at each of them. Who will win?
If you're impatient, skip here to see full results
Table of Contents:
- Performance results
Before we review the results, let's set up the problem.
I wanted to keep our example as simple as possible so that the comparison is limited to the architecture itself rather than the application code. Further, I wanted an example that would work with the API Gateway service proxy so we could use it as a comparison as well.
I decided to set up a simple endpoint that receives an HTTP POST request and forwards the request payload into an AWS SNS topic.
Let's take a look at the architecture and deployment methods for each of our three approaches.
Go Serverless with AWS Lambda
A user will make an HTTP POST request to our endpoint, which will be handled by API Gateway. API Gateway will forward the request to our AWS Lambda function for processing. The Lambda function will send our request payload to the SNS topic before returning a response.
Skipping the middleman with API Gateway service proxy
The second approach is similar to the first, but we remove Lambda from the equation. We use an API Gateway service proxy integration to publish directly to our SNS topic from API Gateway:
Before doing any testing, my hunch is that this will be faster than the previous method since we're cutting out a network hop in the middle. Check below for full results. Note that API Gateway service proxies won't work for all parts of your infrastructure, even if the performance is faster.
If you want additional details on how, when, and why to use this, check out my earlier post on using an API Gateway service proxy integration. It does a step-by-step walkthrough of setting up your first service proxy.
To deploy this example, there is a CloudFormation template here. This will let you quickly spin up the stack for testing.
Containerizing your workload with Docker and AWS Fargate
The final approach is to run our compute in Docker containers. There are a few different approaches for doing this on AWS, but I chose to use AWS Fargate.
The architecture will look as follows:
Users will make HTTP POST requests to an HTTP endpoint, which will be handled by an Application Load Balancer (ALB). This ALB will forward requests to our Fargate container instances. The application on our Fargate container instances will forward the request payload to SNS.
With Fargate, you can run tasks or services. A task is a one-off container that will run until it dies or finishes execution. A service is a defined set of a certain number of instances of a task. Fargate will ensure the correct number of instances of your service are running.
We'll use a service so that we can run a sufficient number of instances. Further, you can easily set up a load balancer for managing HTTP traffic across your service instances.
Now that we know our architecture, let's jump into the bakeoff!
After I deployed all three of the architectures, I wanted to do testing in two phases.
First, I ran a small sample of 2000 requests to check the performance of new deploys. This was running at around 40 requests per second.
Then, I ran a larger test of 15000 requests to see how each architecture performed when they are warmed up. For this larger test, I was sending around 100 requests per second.
Let's check the results in order.
Sidebar: Fargate performance tuning
When I ran my initial Fargate warmup, I got the following results:
Around 10% of my requests were failing altogether!
When I dug in, it looked like I was overwhelming my container instances, causing them to die.
I'm not a Docker or Flask performance expert, and that's not the goal of this exercise. To remedy this, I decided to bump the specs on my deployments.
The general goal for this bakeoff is to get a best-case outcome for each of these architectures, rather than an apples-to-apples comparison of cost vs performance.
For Fargate, this meant deploying 50 instances of my container with pretty beefy settings -- 8 GB of memory and 4 full CPU units per container instance.
For the Lambda service, I set memory to the maximum of 3GB.
For APIG service proxy, there are no knobs to tune. 🎉
With that out of the way, let's check the initial results.
Initial warmup results
For the first 2000 requests to each type of endpoint, the performance results are as follows:
Note: Chart using a log scale
The raw data for the results are:
|APIG Service Proxy
Takeaways from the warmup test
- Fargate was consistently the fastest across all percentiles.
- AWS Lambda had the longest tail on all of them. This is due to the cold start problem.
- API Gateway service proxy outperformed AWS Lambda at the median, but performance in the upper-middle of the range (75% - 99%) was pretty similar between the two.
Now that we've done our warmup test, let's check out the results from the full performance test.
Full performance test results
For the main part of the performance test, I ran 15,000 requests at each of the three architectures. I planned to use 500 'users' in Locust to accomplish this, though, as noted below, I had to make some modifications for Fargate.
First, let's check the results:
Note: Chart using a log scale
The raw data for the results are:
|APIG Service Proxy
Takeaways from the full performance test
- Fargate was still the fastest across the board, though the gap narrowed. API Gateway service proxy was nearly as fast as Fargate at the median, and AWS Lambda wasn't far behind.
- The real differences show up between the 80th and 99th percentile. Fargate had a lot more consistent performance as it moved up the percentiles. The 98th percentile request for Fargate is less than double the median (130ms vs 69ms, respectively). In contrast, the 98th percentile for API Gateway service proxy was more than triple the median (250ms vs 73ms, respectively).
- AWS Lambda outperformed the API Gateway service proxy at some higher percentiles. Between the 95th and 99th percentiles, AWS Lambda was actually faster than the API Gateway service proxy. This was surprising to me.
Another sidebar on Fargate
I mentioned above that I wanted to use 500 Locust 'users' when testing the application. Both AWS Lambda and API Gateway service proxy handled 15000+ requests without a single error.
With Fargate, I consistently had failed requests:
I finally throttled it down to 200 Locust users when testing for Fargate, which got my error rate down to around 3% of overall requests. Still, this was infinitely higher than the error with AWS Lambda.
I'm not saying you can't deploy a Fargate service without tolerating a certain percentage of failures. Rather, performance tuning Docker containers was more time than I wanted to spend on a quick performance test.
UPDATED NOTES ON FARGATE ERRORS
I've gotten some pushback saying that the test is worthless due to the Fargate errors, or that I was way over-provisioned on Fargate.
A few notes on that:
First, Nathan Peck, an awesome and helpful container advocate at AWS, reached out to say the failures were likely around some system settings like the 'nofile' ulimit.
That sounds pretty reasonable to me, but I haven't taken the time to test it out. I don't have huge interest in digging deep into container performance tuning for this. If that's something you're into, let me know and I'll link to your results if they're interesting!
The key points on Fargate are:
You can get much lower failure rates than I got. You'll just need to tune it.
I didn't use 50 instances with a ton of CPU and memory because I thought Fargate needed it. I used it because I didn't want to think about resource exhaustion at all (even though I did end up hitting the open file limits). I was going for a best-case scenario -- if the load balancer, container, and SNS are all humming, what kind of latency can we get?
I don't think this invalidates the general results of what a basic 'optimistic-case' could look like with Fargate within these general constraints (multiple instances + Python + calling SNS).
If you're making a million dollar decision on this, you should run your own tests.
If you want a quick, fun read, these results should be directionally correct.
This was a fun and enlightening experience for me, and I hope it was helpful for you. There's not a clear right answer on which architecture you should use based on these performance results.
Here's how I think about it:
- Do you need high performance? Using dedicated instances with Fargate (or ECS/EKS/EC2) is your best best. This will require more setup and infrastructure management, but that may be necessary for your use case.
- Is your business logic limited? If so, use API Gateway service proxy. API Gateway service proxy is a performant, low-maintenance way to stand up endpoints and forward data to another AWS service.
- In the vast number of other situations, use AWS Lambda. Lambda is dead-simple to deploy (if you're using a deployment tool). It's reliable and scalable. You don't have to worry about tuning a bunch of knobs to get solid performance. And it's code, so you can do anything you want. I use it for almost everything.