Implementing a basic MDC based tracing system in Vert.X Web-API based Kotlin microservices

Before I go ahead, I will say this: if the scope of your project requires a full-fledged distributed tracing system, and if that is affordable for you, don’t re-invent the wheel, spend the money and buy a distributed tracing product.
But if you belong to the other camp, here goes.
But first, what is request tracing?
In a distributed system, a user request (which might initiated by a customer clicking “Buy Now” on your website) — might go through multiple micro-services. You, as a developer and a maintainer would find yourself in scenarios where you might have to trace out the journey of that request.
This is usually done by sending a correlation-id in the request sent by the browser, usually as as part of the request header.

The backend services that accept this request would then in turn read this correlation-id from it and log it out. Every downstream service that the request goes through also does the same.
Note to trace out the journey that the request made, all you need is the correlation-id and an aggregated logging solution.
Basic Idea: Frontend
At the client, or the UI, we will send along a unique correlation-id in the HTTP request.
Assuming that you use a library like axios for making your requests, you can make use of an interceptor
Backend
At the backend, our approach to solve this problem piggybacking on the logging mechanism, using Logback’s (the logging framework used) MDC. More information on the MDC can be read from here.
The MDC is used to store contextual information (essentially it is a hash-map), which is then printed along with every log statement. One of the key-value pairs in the MDC will in our case be the correlation-id.
Why piggyback on logging? The only reason is because then developer does not have to explicitly log the correlation id in code.
And due to the manner in which MDC contents are logged, the log-aggregation tool can be configured to be able to query log entries by the correlation-id.
Challenges
Vert.X being an asynchronous framework, the use of MDC to store contextual information like a correlation-id is troublesome, because the MDC context is ThreadLocal in nature.
To show what could go wrong, lets have a look at the following code snippet:
Disregard the non-standard ApiEndpoint
based code, that’s merely an abstraction I had for ease of readability. Let’s just focus on the two Handler
classes.
The first one EnsureCorrelationIdIsAvailableHandler
makes sure that there is a correlation-id in the HTTP request header, or it generates a new one. After that, it would populate the Logback MDC with the correlation-id, and execute the next handler in the chain, MakeACallToARemoteServiceHandler
The second handler makes a non-blocking call to an endpoint that responds after an ‘x’ number of seconds (based on a request header parameter value), and prints the response from the call once complete.
Once we make these changes,
- Run the application
- Make two requests, one after the other (use two terminal windows if you have to). The first request should take 10 seconds to complete, and the other with just 1 second
- Check the log messages
http localhost:8080/something delay:10
http localhost:8080/something delay:1

Note that in both the requests I made in step 2, I didn’t pass in a correlation-id, and hence the handler generated two different correlation ids for me. This is evident from the log messages starting with Missing X-CorrelationId
in line 1 and 3.
However, once the second handler executes the second non-blocking remote call (with the 1 second delay) , and once that remote call is complete, it resumes on a completely different thread. See lines 4 and 5 in the screenshot above.
As such, the correlation-id in lines 4 and 5 is different, one starting with 49cc
and the other with 0abf
but they should have been the same since they are the same request being executed on two different threads. We have this problem because the MDC is ThreadLocal.
Solving this problem
Our team already had a ground rule that we never did any business operation directly on the event-loop thread (even non-blocking ones) and instead used a Kotlin Coroutines instead. This solved this problem for us. The details of how this solved the problem is what we will see as we proceed.
Setting everything up
Let’s start off with a simple handler
Ensure that your logback.xml is set up correctly to print the MDC. For reference, it is the %X
in the <pattern/>
that makes this possible.
Run the application, and we see that the correlation-id is logged

However, as we mentioned earlier, we don’t do any business function in the event handler, and we start a new coroutine
But now if you were to run the application and check the logs again

The correlation-id is no longer logged.
This happens due to the same reason we had seen earlier. The code is running on two different threads, the vert.x-eventloop-thread-0
and DefaultDispatcher-worker-1
threads. As such, the MDC is not shared between them and we didn’t see the correlation-d in the second log message.
To fix this, start by adding the following dependency to build.gradle.kts
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-slf4j:1.3.3")
Then, we use the MDCContext()
in our code
We then see that the correlation-id is back.

What does the MDCContext do?
The MDCContext
also ensures that when the coroutines suspend, and may resume on a completely different thread, that the values of the MDC are preserved between the switch.
How it does this is pretty simple. Please see the original source here.
There is a gotcha here that I have seen people miss: Do NOT modify the MDC while you are within the MDCContext() scope.
The reason for this is that these new contents were will be lost once the coroutine suspends and is then restored. A look at the source-code in the link shared above will show why this is the case.
Fixing the initial example
Let’s go back the to the case where we make a time-delayed web-client call and solve the problem of overwritten correlation-ids using co-routines and the MDCContext()
We make the same two calls (one with a 10 second delay and another with a 1 second delay) and check the logs

And we see that despite the coroutine resuming on the same thread DefaultDispatcher-worker-1
, the correlation-id is different in both cases. See the last two log lines.
Passing the correlation-id to downstream services
If you would make calls from your handlers to downstream services, you must add the X-CorrelationId header (or the request header name you had chosen) to the request.
For this, I had populated the RoutingContext
map with the correlation-id in the EnsureCorrelationIdIsAvailableHandler
with the correlation-id
Since the RoutingContext
instance is shared across handlers (source), I can read values from the context in handlers that come after the EnsureCorrelationIdIsAvailableHandler
Yes, there is a risk of a miss-spelt header name if the same code in line 11 is repeated in multiple places, but maybe you could solve that by centralizing the request creating code.
Summary
I hope this was understandable, and please do let me know in case there are mistakes or problems you’ve faced, or if something wasn’t too clear. All this information was from a lot of googling, and I assumed, collating it might be useful for someone.
We used this technique in production successfully for about over a year, and saw no issues, and hence I can safely recommend it.
I hope this helped, if you’ve had issues, please let me know in the comments.