Microservices and Distributed log tracing

More posts by Mohammad Zolmajd

Nowadays, when it comes to choosing an enterprise application architecture, the most common options that people tend to think about come from two opposite ends of the spectrum: Monolith vs. Microservices.

In a Monolith architecture, all of an application’s functionality lives in one deployable module such as a WAR or an EAR file. On the other hand, Microservices comprise smaller independent modules that can be developed and deployed independently.

There are various reasons why you might want to choose Microservices over a Monolith architecture, including:

It’s not easy to add new functionality to a Monolith without having to retest the entire application;
With a Monolith it’s harder to split the work between several teams on different release cycles; and
It’s not easy to do Agile development on a Monolith with releases early and often.

Despite the disadvantages outlined above, there are some key benefits of a Monolith Architecture, primarily the ability to investigate application logs. Since the entire application is one deployable module, all the log files tend to be in one place, which makes it much easier to determine where things have gone wrong.

A challenge with Microservices is they often need to interact with each other. These interactions bring with them the complexities of the Microservices architecture. Data distributed across several independent Microservices is harder to maintain and collect because the records no longer live in a single place. What you could have easily achieved with joins is much more difficult. The same complexity applies when it comes to troubleshooting and looking into application logs. As Martin Fowler says “You must be this tall to use microservices.”

Microservices Diagram

Distributed Logs Problem

Imagine that to fulfill a client request your system has to call Microservice A which in turn needs to call Microservice B, leading to a chain of calls to Microservice X. The following diagram is a good illustration of this example scenario:

Microservices Diagram A

Diagram A

The above denotes a simple case, in reality, this situation can be way more complicated:

Microservices Diagram B

Diagram B

What happens if there’s a problem, and you need to look into the logs to see where, when and why something didn’t work as expected? How do you do that when the records are in multiple Microservices, and you don’t know which one?

Solution

Now imagine if you were able to tag every single request that comes into your system and trace them through all the different Microservices in your application. Fortunately, there are many tools and frameworks available for these types of common problems, and distributed log tracing is not an exception.

Recently, we worked on an application that involved many Microservices built using Spring Boot, and we needed a distributed log tracing tool that would easily integrate into our Microservices that could achieve our log tracing goals. We decided to use Spring Cloud Sleuth.

Spring Cloud Sleuth will automatically instrument communication channels, like:

Requests sent over messaging technologies;
HTTP headers received at Spring MVC controllers;
Requests that pass through a Netflix Zuul micro proxy; and
Requests made with the RestTemplate, etc.

The first time a request comes to a component in your system, Sleuth automatically assigns a TraceID to it. A TraceID is a unique identifier assigned to every single request which is maintained for the whole journey of the request through your application.

Later on, when the request flows from one component (Microservice) to another, Sleuth automatically creates a SpanID. A SpanID is a unique identifier for the request in the context of one component. Every span may contain tags or metadata that can be used later to contextualize the request in a particular span.

The following diagram illustrates how these concepts apply to the example shown earlier:

Microservices Diagram C

Diagram C

Now, every request is identifiable by a TraceID throughout the whole journey through the system, and each stop in a Microservice will capture enough metadata in the context of that Microservice to identify it. In a nutshell, we now have enough data in the logs to help troubleshoot or access request patterns throughout our entire application.

What are we missing at this point?

We needed a proper tool to aggregate the logs from different Microservices with appropriate reporting/querying capabilities. There are plenty of log aggregation tools out there, for Spring Cloud Sleuth the popular choice is OpenZipkin. However, we decided to take advantage of Splunk as it was already in use within our client’s organization.

Now, with Spring Cloud Sleuth and Splunk, we only have to look in one place to track when a request comes into our system, and we can use the TraceID to query all the logs and see all the different Microservices that were involved with fulfilling the request.

To summarize, to solve the distributed log tracing issue we needed to combine Sleuth as the log tracing tool and we added Splunk on top for log aggregation. This combination enabled us to capture enhanced data in our logs and be able to run queries and easily do meaningful analysis on the logged data.

Microservices Diagram D