Two options, one benchmark and bonus code
Two options, one benchmark and bonus code
In this article, I'll show you how to forward events to private containers using serverless services and fan-out patterns.
I'll explore possible solutions within the AWS ecosystem, but all are applicable regardless of the actual service / implementation.
Suppose you have a cluster of containers and you need to notify them when a database record is inserted or changed, and these changes apply to the internal state of the application. A fairly common use case.
Let's say you have the following requirements:
Given these requirements, let's explore a few options.
Pros:
Cons:
In this kind of scenario, I definitely don't like polling.
Let's try a different, opposite approach.
Instead of having tasks ask the database, let's have the database notify them for changes.
Before going into the pros and cons, I must say that it would be very hard, if not impossible, to implement this solution exactly as I did. We can use a very popular pattern, called fan-out.
This is the Wikipedia definition:
In message-oriented middleware solutions, fan-out is a messaging pattern used to model an information exchange that implies the delivery (or spreading) of a message to one or multiple destinations possibly in parallel, and not halting the process that executes the messaging to wait for any response to that message.
To make things a little more concrete, let's use some popular AWS services commonly used to implement this pattern:
The solution looks like this:
Now let's explore pros and cons:
Pros
Cons
Open points:
The main open point here, to me, was: is this fast enough? Let's verify it.
I couldn't find any official SLA about latency for involved services nor any AWS official benchmark.
So I decided to perform one myself, and I scripted a basic application using TypeScript and CDK / SDK.
Here is the Github repo with the actual code and details on how the system is implemented.
Before going ahead, bear in mind that I performed this benchmark with the goal to understand if this combination of services / configuration could fit for my specific context / use case. Your context may be different, and this configuration may not fit with it.
Key system parameters:
Benchmark parameters
I used a basic postman collection runner to perform a mutation to Appsync every 5 seconds, for 720 iterations.
Goal
The goal was to verify if containers would be updated within 2 seconds.
Measurements
I used the following Cloudwatch provided metrics:
and I created two custom metrics for measuring SQS and SNS time taken.
Time-taken custom metrics are calculated from the SNS and SQS-provided attributes:
SNS Timestamp: from the AWS doc / The time (GMT) when the notification was published.
ApproximateFirstReceiveTimestamp: from the AWS doc / returns the time the message was first received from the queue (epoch time in milliseconds).
SentTimestamp: from the AWS doc / Returns the time the message was sent to the queue (epoch time in milliseconds).
The following code snippet shows you how attributes are used to calculate sns time taken in millis and sqs time taken in millis
//despite the name, this is the ISO Date the message was sent to the SNS topic
let snsReceivedISODate = messageBody.Timestamp;
if (snsReceivedISODate && message.Attributes) {
clientReceivedTimestamp = +message.Attributes.ApproximateFirstReceiveTimestamp!;
sqsReceivedTimestamp = +message.Attributes.SentTimestamp!;
let snsReceivedDate = new Date(snsReceivedISODate);
snsReceivedTimestamp = snsReceivedDate.getTime();
clientReceivedDate = new Date(clientReceivedTimestamp!);
sqsReceivedDate = new Date(sqsReceivedTimestamp!);
snsTimeTakenInMillis = sqsReceivedTimestamp - snsReceivedTimestamp;
sqsTimeTakenInMillis = clientReceivedTimestamp - sqsReceivedTimestamp;
I didn't calculate the time taken by the client to parse the message because it really depends on the logic the client applies to parsing the message.
Results
Disclaimer: Some latency measurements are calculated on consumers' side, and we all know that synchronising clocks in a distributed system is a hard problem.
Still, measurements are performed by the same computing nodes.
Please consider the following latencies not as precise measurements, but as coarse indicators.
Here are screenshots from my Cloudwatch dashboard:
Here is some key data, from the average numbers:
This solution has proven to be fast and reliable and requires little configuration to set up.
Since almost everything is managed, there is little space for tuning and improvements. In this particular configuration, I could simply give the Stream Processor Lambda more memory, but memory and latency do not scale (inversely) together.
UPDATE: Here the benchmark of the aforementioned solution with EventBridge.
Last but not least, keep in mind that AWS does not always include latency in the service SLA. I've run this benchmark a few times with comparable results, but I can't be sure that I will always get the same results over time. If your system requires stable and predictable performance over time, you can't go with services that don't include performance metrics in their SLA. You're better off taking control of the layers below, which means you should consider going to a restaurant or even making your own pizza at home.
In this article, I presented a solution that I had to design as part of my work and my approach to solution development. This included clarifying the scope and context, evaluating different options and having a good knowledge of the parts involved and the performance and quality attributes of the overall system, writing code and benchmarking where necessary, but always with the clear awareness that there are no perfect solutions.
I hope it was helpful to you, and here is the GitHub repo to deploy both versions of the solution.