Building event driven microservices that scales

Welcome to 2020, the year which will mark more serious adoption of Kubernetes, DevOps practises and with that, Microservices!

Earlier, I spoke about how one can make microservices communicate. In this post, I want to dive a bit deeper into building scalable microservices-based architectures. Skipping all the pep talk, I’ll get right to it.

What really goes into building microservices

For me, microservices is just a fancy name for a bunch of web servers having cross dependencies on one another. However, there is one restriction; each web server or service we run must manage its own state.

While we can add more rules to the mix, the goal is to consume more and more APIs rather than rely on the state stored in databases.

Let’s be honest with ourselves. We are pretty good at writing simple HTTP servers. Bragging a little about how beautiful your API or project structure is wouldn’t be wrong at all. We have been doing it for years after all.

The tricky part, however, would be building APIs which do not break often. In other words, we need some form of versioning for our APIs, which can preserve backward compatibility. I have full faith in you for this too. I’m pretty sure you’ve nailed this aspect already. No need for me to go deeper into this.

What’s left is the deployment of these microservices (a topic for another day) and making these microservices communicate. I’m going to be talking about communications in this post.

Breaking down microservice communications

There are two fundamental ways of making two microservices talk:

Synchronous, also known as a brokerless design.
Asynchronous, also known as a broker based designed.

Let’s dissect them one by one.

Synchronous communication

These are request-response type of communication which usually use HTTP as the underlying protocol. The main point to note here is that the services talk to each other directly. Requests passing through a proxy, like an API gateway, are also classified as synchronous.

The core idea is that the client gets an absolute response.

Now, this might sound strange initially. All requests return a response right? 🙄

Let’s take an example to understand this. Imagine if you send a command to insert a document in a database and the database returns an acknowledgement. Ever wondered what that acknowledgement means?

It’s intuitive to think that the ack means the database has successfully written your data to disk, and hence, your data is persisted. This does sound logical. Everything would become way more reliable if things worked this way.

This is a classic example of synchronous mode of communications. Even if you had a load balancer like pgbouncer in front of your database which could retry requests on your behalf, the architecture still remains the same.

Asynchronous communications

What if I told you, that the ack meant the database has accepted your command and will perform the actual write to disk later 😈. I’m not talking about the concept of a WAL here. What I’m implying is that if the database failed, your document would be lost. Forever!

Imagine working in such an environment. Its complete madness.

You would have no idea if the write has actually gone through or not. It seems there is no possible way to retry as well. This seems like the worst possible way to do things.

The perils of synchronous microservices

You’re probably right to have formed a negative opinion on async communications. And what has that got to do we event driven microservices anyways?

I know it’s intuitive for us to think in terms of synchronous microservices. Its way easier. Faster to implement.

Now listen to what I’m about to tell you.

Synchrounous microservices are slowing you down!

That’s right. I said it. You can curse me if you want, but this is a fact written in all codebases throughout the globe.

Tightly coupled microservices are noting but distrubuted monoliths which are way harder to build, debug and maintain.

Let’s see why synchronous microservices aren’t always the best choice.

The Inefficiencies

First things first, it is very inefficient. Certain I/O bound operations take time. A lot of time.

Let’s take the example of sending an email whenever a new user signs up. In this case, you would hit your mail server (to send an email) right after your database insert.

But what if the mail server is unavailable? Would you retry? For how long? What if the mail server is out for hours 😥?

In the synchronous world of microservices, your sign-up service would simply stop working if something like that was to happen.

Wouldn’t it be better we could queue this operation somehow and try this later?

Your microservices would really remain that micro

The beauty of a microservice architecture is that it lets teams work on just their logic without having to worry about anything else. That’s where the agility comes from.

But look at the previous example again. The sign-up service is talking to the email service!

WHY?????

What if there were more operations you wanted to do whenever the user signs up? Everyone would have to modify the sign-up service! And what happens if any one of the services is down? Things won’t look good.

The poor microservice simply wanted to sign up new users. It was never meant to do so many things!

And this is just a single example. There could be several such instances.

Apparently, the only way out is asynchronous microservices.

Taming asynchronous microservices

So we need to improve our architecture such that the sign-up service can happily do its job without having to worry about anything else.

How do we do it?

Wouldn’t it be great if we could broadcast or publish a message whenever the user signs up? Anyone interested could keep an ear out and subscribe for that message? This is the pub-sub broker patten.

What we do here is that the sign-up service would publish an event in the broker (let’s say Kafka or RabbitMQ). The email microservices interested in this event, would subscribe for it.

What’s important to note here is, the sign-up service will talk to the broker and the broker simply acknowledges the receipt of the message. There are no guarantees that the event has been processed or not.

I think that’s fair. It shouldn’t be the sign-up service’s problem anyway.

But someone needs to take the responsibility of delivering those events reliably! Luckily, for us, the message brokers do it.

Going event driven

I still have a few problems with this setup. But you can probably see the benefits already. Let’s take a look at a few areas where we can still improve?

My simple microservices needs to learn the existence of a broker

This might not be that big a deal for you. But why should the sign-up microservice publish events?

If you think about it, it all boils down to a database write operation. The sign-in event is triggered whenever a new document is inserted in the database. If somehow we can capture these events (using CDC or similar), our sign up microservice doesn’t really need to publish events.

Unnecessarily adding new components adds points of coupling. Also, in my experience:

Most events are triggered via. the data layer.

So making our data layer smarter can help us go a long way.

Maintaining a messaging broker is an addon cost

Have you ever tried setting up Kafka? It’s difficult.

What about maintaining it in production? Total nightmare.

Initially, you don’t need to get into such heavy tools. We need to be careful not to add too much complexity early on without compromising on scale and flexibility.

How long will a broker last?

This is an inherent problem of most of the pub-sub based messaging queues. What happens if the mail server is out for a few days straight? The message will lie in the broker and eat up space. To make matters worse, if a huge surge of new users arrives at your doorstep, you are most likely to exhaust your memory.

Most brokers will probably give up at this point or suffer a major performance hit.

Mature ones like Kafka might be able to take the load, but it too will end up losing events at some point or the other.

To be fair, message brokers are designed to handle transient failures, not week-long outages.

What does the final event-driven microservices architecture look like?

Let’s say somehow we figure out how to solve the problems I’ve mentioned above. So now we would have an architecture where each microservice is going on minding their own business. Each is talking to their share of databases and other microservices.

Events would automatically get generated as a side effect based on these interactions (e.g. writes to a database).

Interested microservices can express interest in these events and can consume them.

At the core of this lies the eventing system which is capturing events from various sources (databases, HTTP, etc.) and triggering the HTTP endpoints of the microservices which have shown interest in those events.

Let’s say you have another microservices interested in an event? Just add a webhook for that event in the eventing system , and there you go.

In a nutshell the eventing system is responsible for:

Capturing events from various sources (database, object stores, HTTP, etc.)
Maintaining a configurable mapping on the webhooks to call for each event.
Hitting the endpoints with the respective payloads.
Handling automatic retries, timeouts and even rescheduling of the same event in case of long-lived outages

One added benefit of such architecture is that it’s massively scalable. As long as your eventing system can take the load, you can process millions of queries per second. I’ll leave it up to you to figure out why is this the case : P.

Having said that it’s important to note when not to use an event-driven microservices architecture.

You should avoid using event-driven microservices architecture for transactional workloads requiring an absolute response.

Requests like sign in/up, fetching of billing history, etc.need to be synchronous cause the data needs to be forwarded to the user. As a rule of thumb, you can default to using event stores for everything. It would be evident what requests need to be synchronous.

This approach will make sure you don’t end up making all kinds of requests synchronous and, as I mentioned before, ending up with a distributed monolith.

What’s Next

As promised, we’ll be jumping into building an event-driven microservice-based architecture using an event store in the next blog post. However, to keep things simple, we’ll not be using a messaging log or queue.

I hope this article helped share some light on the need to adopt event-driven architectures.

Leave a comment below if I’ve missed something important or have got anything wrong. Would love to hear from you!