Implementing a basic MDC based tracing system in Vert.X Web-API based Kotlin microservices

Vineeth Venudasan
6 min readJan 19, 2020

Tracing out a request flow is not unlike following ski-tracks on a big field of snow

Before I go ahead, I will say this: if the scope of your project requires a full-fledged distributed tracing system, and if that is affordable for you, don’t re-invent the wheel, spend the money and buy a distributed tracing product.

But if you belong to the other camp, here goes.

But first, what is request tracing?

In a distributed system, a user request (which might initiated by a customer clicking “Buy Now” on your website) — might go through multiple micro-services. You, as a developer and a maintainer would find yourself in scenarios where you might have to trace out the journey of that request.

This is usually done by sending a correlation-id in the request sent by the browser, usually as as part of the request header.

The correlation-id travels through all the downstream services

The backend services that accept this request would then in turn read this correlation-id from it and log it out. Every downstream service that the request goes through also does the same.

Note to trace out the journey that the request made, all you need is the correlation-id and an aggregated logging solution.

Basic Idea: Frontend

At the client, or the UI, we will send along a unique correlation-id in the HTTP request.

Assuming that you use a library like axios for making your requests, you can make use of an interceptor

Backend

At the backend, our approach to solve this problem piggybacking on the logging mechanism, using Logback’s (the logging framework used) MDC. More information on the MDC can be read from here.

The MDC is used to store contextual information (essentially it is a hash-map), which is then printed along with every log statement. One of the key-value pairs in the MDC will in our case be the correlation-id.

Why piggyback on logging? The only reason is because then developer does not have to explicitly log the correlation id in code.

And due to the manner in which MDC contents are logged, the log-aggregation tool can be configured to be able to query log entries by the correlation-id.

Challenges

Vert.X being an asynchronous framework, the use of MDC to store contextual information like a correlation-id is troublesome, because the MDC context is ThreadLocal in nature.

To show what could go wrong, lets have a look at the following code snippet:

Disregard the non-standard ApiEndpoint based code, that’s merely an abstraction I had for ease of readability. Let’s just focus on the two Handlerclasses.

The first one EnsureCorrelationIdIsAvailableHandler makes sure that there is a correlation-id in the HTTP request header, or it generates a new one. After that, it would populate the Logback MDC with the correlation-id, and execute the next handler in the chain, MakeACallToARemoteServiceHandler

The second handler makes a non-blocking call to an endpoint that responds after an ‘x’ number of seconds (based on a request header parameter value), and prints the response from the call once complete.

Once we make these changes,

  1. Run the application
  2. Make two requests, one after the other (use two terminal windows if you have to). The first request should take 10 seconds to complete, and the other with just 1 second
  3. Check the log messages
http localhost:8080/something delay:10
http localhost:8080/something delay:1

Note that in both the requests I made in step 2, I didn’t pass in a correlation-id, and hence the handler generated two different correlation ids for me. This is evident from the log messages starting with Missing X-CorrelationId in line 1 and 3.

However, once the second handler executes the second non-blocking remote call (with the 1 second delay) , and once that remote call is complete, it resumes on a completely different thread. See lines 4 and 5 in the screenshot above.

As such, the correlation-id in lines 4 and 5 is different, one starting with 49cc and the other with 0abfbut they should have been the same since they are the same request being executed on two different threads. We have this problem because the MDC is ThreadLocal.

Solving this problem

Our team already had a ground rule that we never did any business operation directly on the event-loop thread (even non-blocking ones) and instead used a Kotlin Coroutines instead. This solved this problem for us. The details of how this solved the problem is what we will see as we proceed.

Setting everything up

Let’s start off with a simple handler

Ensure that your logback.xml is set up correctly to print the MDC. For reference, it is the %X in the <pattern/>that makes this possible.

Run the application, and we see that the correlation-id is logged

However, as we mentioned earlier, we don’t do any business function in the event handler, and we start a new coroutine

But now if you were to run the application and check the logs again

The correlation-id is no longer logged.

This happens due to the same reason we had seen earlier. The code is running on two different threads, the vert.x-eventloop-thread-0 and DefaultDispatcher-worker-1 threads. As such, the MDC is not shared between them and we didn’t see the correlation-d in the second log message.

To fix this, start by adding the following dependency to build.gradle.kts

implementation("org.jetbrains.kotlinx:kotlinx-coroutines-slf4j:1.3.3")

Then, we use the MDCContext() in our code

Line 15 is where the magic happens

We then see that the correlation-id is back.

What does the MDCContext do?

The MDCContext also ensures that when the coroutines suspend, and may resume on a completely different thread, that the values of the MDC are preserved between the switch.

How it does this is pretty simple. Please see the original source here.

There is a gotcha here that I have seen people miss: Do NOT modify the MDC while you are within the MDCContext() scope.

The reason for this is that these new contents were will be lost once the coroutine suspends and is then restored. A look at the source-code in the link shared above will show why this is the case.

Fixing the initial example

Let’s go back the to the case where we make a time-delayed web-client call and solve the problem of overwritten correlation-ids using co-routines and the MDCContext()

We make the same two calls (one with a 10 second delay and another with a 1 second delay) and check the logs

And we see that despite the coroutine resuming on the same thread DefaultDispatcher-worker-1, the correlation-id is different in both cases. See the last two log lines.

Passing the correlation-id to downstream services

If you would make calls from your handlers to downstream services, you must add the X-CorrelationId header (or the request header name you had chosen) to the request.

For this, I had populated the RoutingContext map with the correlation-id in the EnsureCorrelationIdIsAvailableHandler with the correlation-id

Line 14 puts the correlation id in the RoutingContext

Since the RoutingContext instance is shared across handlers (source), I can read values from the context in handlers that come after the EnsureCorrelationIdIsAvailableHandler

In Line 11, to X-CorrelationId header is set up from the value in the RoutingContext

Yes, there is a risk of a miss-spelt header name if the same code in line 11 is repeated in multiple places, but maybe you could solve that by centralizing the request creating code.

Summary

I hope this was understandable, and please do let me know in case there are mistakes or problems you’ve faced, or if something wasn’t too clear. All this information was from a lot of googling, and I assumed, collating it might be useful for someone.

We used this technique in production successfully for about over a year, and saw no issues, and hence I can safely recommend it.

I hope this helped, if you’ve had issues, please let me know in the comments.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Vineeth Venudasan

Backend Engineer / Software Consultant. The views expressed here are my own and does not necessarily represent my employer’s positions, strategies or opinions

Responses (2)

Write a response

Hello! Great article. I have one question.
Should MDCs be cleared correctly?
In you example there is two lines
```MDC.remove("X-CorrelationId")```
and
```MDC.put("CorrelationId", headers.get("X-CorrelationId"))```...