Logging in a distributed system of Kotlin services

One reason for this complexity is that these systems are split into a collection of services instead of monoliths that they were previously. It can be argued if they should be called microservices or not depending on their size, but in our use case this has no significance thus we will use the term microservices for them.

While the goal of this article is not to compare microservices with monoliths, it can be agreed that one of the reasons that microservices are more complex is that is harder for developers to have insights about an action that involves multiple services.

The goal of this article is to study how the mental overhead that is needed when debugging and searching inside logs in a system with multiple services can be reduced.

In this case the system is implemented with Kotlin services that contain both synchronous and asychronous operations. For synchronous operations GRPC is used and SQS will be the our asynchronous component.

The text assumes that the reader is familiar with what GRPC and SQS (and queues in general) are and how do they work. If this is not the case, briefly reading about them is advised which will clarify the topics and solutions discussed below.

The system

The system that will be demonstrated has 4 services that will allow us to buy a ticket for a concert. To fullfill this requirement the system must be able to do the following operations:

  1. Search for tickets (Synchronous)
  2. Book the ticket (Synchronous)
  3. Send an email to the user after the payment has been processed (Asynchronous)

Each operation will be handled by a different service and our entrypoint to perform them will be an API that uses these 3 services, thus the number of services being 4.

Our goal is to be able to observe the events - expected or unexpected - across the servivces in a dedicated view, ordered by the time they have been logged.

It is assumed that a logging-id is a string is generated in our entrypoint - the API. With this assumption, we move forward with building a mechanism that will share this string across multiple services. We will look at the need of this id in chapter Using the logging id.

Synchronous operations

As stated previously, GRPC will be the tool that will be used to perform synchronous actions across our services.

The definitions of the GRPC operations are stated as following:

service TicketService {
  rpc Search(SearchParameters) returns (AvailabilityResult);
  rpc Book(BookingParameters) returns (BookingResult);
}

The 2 methods will be called by the API that the user will use. After the ticket has been successfully been booked, a sqs message will be produced and consumed from a different service that is responsible for notifications.

GRPC has 2 parts. A client and a server which requires us to be able to use the logging-id both when sending a request from a client and when receiving a request on a server.

Apart from the request message that GRPC uses, it also has metadata that allows clients and servers to transfer more information than the request itself. Using the metadata we can now pass the id from the client using the following approach:

import io.grpc.Metadata

val key = Metadata.Key.of("logging-id", Metadata.ASCII_STRING_MARSHALLER)
val loggingId = "logging-id-1234"

val metadata = Metadata().apply { 
  put(key, loggingId)
}

ticketStub.search(searchParameters {  }, metadata)

Servers need to use Interceptors in order to be able to access the metadata of the request. The interceptor will extract the header from the call and place it somewhere where the service implementation will be able to access it.

Since we are using grpc-kotlin which uses coroutines for the service’s implementation, we need a solution for sharing data between the interceptor and the implementation since those 2 can run in different coroutines. grpc-kotlin also provides CoroutineContextServerInterceptor which is a GRPC interceptor that offers the option to modify the coroutine context where the grpc service implementation will run.

Following this path, we are now able to create an interceptor that does this using the given code snippet:

class LoggingContextElement(val loggingId: String) : AbstractCoroutineContextElement(LoggingContextElement),
    CoroutineContext.Element {
    override val key = Key

    companion object Key : CoroutineContext.Key<LoggingContextElement>
}

class LoggingInterceptor(private val context: CoroutineContext) :
    CoroutineContextServerInterceptor() {

    override fun coroutineContext(call: ServerCall<*, *>, headers: Metadata): CoroutineContext {
        val loggingId =
            headers.get(Metadata.Key.of("logging-id", Metadata.ASCII_STRING_MARSHALLER)) ?: generateTraceId()
        return context + LoggingContextElement(loggingId)
    }
}

and now we can use the interceptor:

private val server: Server = ServerBuilder.forPort(8000)
    .addService(ServerInterceptors.intercept(TicketService(), LoggingInterceptor(ctx)))
    .build()

and access the logging-id from the context using the coroutine context element:

class TicketService : TicketServiceGrpcKt.TicketServiceCoroutineImplBase() {
    override suspend fun search(request: SearchParameters): AvailabilityResult {
        val loggingId = coroutineContext[LoggingContextElement]?.loggingId
        return availabilityResult {  }
    }

    override suspend fun book(request: BookingParameters): BookingResult {
        val loggingId = coroutineContext[LoggingContextElement]?.loggingId
        return bookingResult {  }
    }
}

The generation of a new logging-id in case of absence of an existing one is for the case where the server will also need to act as a client to a differet service. The diagrams below demonstrate the 2 cases of a logging id being used between 2 services. One having a pre-existing id passed from the client and one generating a new id when the client has not sent one.

Examples:

  1. With id
  2. Without id

Now Search and Book methods are able to use the same logging-id generated in API:

Asynchronous operations

In a synchronous operation we can corellate logs across services by looking at their timestamps. If the logs are grouped by time frames, we can assume that logs for a specific operation, in a specific timeframe are related. While it requires more mental overhead to manually associate logs, we can still do it up until to a specific point.

On the contrary, corellating logs in systems that also have asynchronous operations becomes harder since we cannot know when the async operation will be executed. Thus it is crucial for the observer to be able to see investigate the system by grouping the logs using a different grouping mechanism instead of grouping them by time frames.

SQS also allows us to provide metadata in it’s messages, using message metadata. The approach is similar with the approach deseribed for GRPC. SQS producers and consumers are required to make use of metadata in order to be able to pass more information between service. In our case the metadata will be the logging-id.

Producers can include the the id when sending a message to SQS using during the message build phase:

val messageAttributes = mapOf("logging-id", MessageAttributeValue().withDataType("String").withStringValue("logging-id-123"));
// ...
sendMessageRequest.withMessageAttributes(messageAttributes);

while consumers can access the attributes using message.getMessageAttributes(). This allows the logging-id to be shared across async operations also.

Using the logging id

The previous chapters focused on transporting the logging-id value across different services, both synchronous and asynchronous. So far the need and the use of this id has not been discussed. This chapter does that.

Most applications have logs even if the user does not log anything himself, the libraries and frameworks he uses usually do. For example, below we can see typical Java and Kotlin logs when a Spring Boot project runs.

2019-03-05 10:57:51.112  INFO 45469 --- [           main] org.apache.catalina.core.StandardEngine  : Starting Servlet Engine: Apache Tomcat/7.0.52
2019-03-05 10:57:51.253  INFO 45469 --- [ost-startStop-1] o.a.c.c.C.[Tomcat].[localhost].[/]       : Initializing Spring embedded WebApplicationContext
2019-03-05 10:57:51.253  INFO 45469 --- [ost-startStop-1] o.s.web.context.ContextLoader            : Root WebApplicationContext: initialization completed in 1358 ms
2019-03-05 10:57:51.698  INFO 45469 --- [ost-startStop-1] o.s.b.c.e.ServletRegistrationBean        : Mapping servlet: 'dispatcherServlet' to [/]
2019-03-05 10:57:51.702  INFO 45469 --- [ost-startStop-1] o.s.b.c.embedded.FilterRegistrationBean  : Mapping filter: 'hiddenHttpMethodFilter' to: [/*]

We call unstructured logs, logs that are meant for humans and are usually sequences of plain string lines. The challenge with such logs is that parsing them and analyzing them is not very simple. If they lack structure, the tools that parse them will have hard time understanding what the logs mean and how they should be parsed.

The solution to this is structured logs, logs that are formed with a specific structure that tools understand in order to allow other tools search, filter and analyze them. One such structure is json, which is what we will use to demonstrate our solution.

A json structure for logs means that the logs will be written in a json format, where different fields are used for different parts of the log. An example of this could be the following:

{
  "message": "example",
  "level": "info",
  "timestamp": "2024-03-05 10:57:51.702"
}

The given example demonstrates a log that has a info level , with a timestamp 2024-03-05 10:57:51.702 and a message “example”. This example allows us to enhance the log structure with more fields, such as logging-id.

To be able to add more fields to a log, we need to use Mapped Diagnostic Context. Since in this example we use kotlin-logging as our logging library, we can do that by using its withLoggingContext.

Thus, in our use case we can add the logging id with:

val loggingId = "12-logging-id"

withLoggingContext("logging-id" to loggingId) {
  logger.info {"my message"}
}

which will result into:

{
  "message": "example",
  "level": "info",
  "timestamp": "2024-03-05 10:57:51.702",
  "logging-id": "12-logging-id"
}

Note: The use of an encoder is required in order for the logs to be written as json. In the example, logstash-logback-encoder was used. If an encoder is not used, the context will be populated but the field will not be shown since they will be plain strings.

Viewing logs

The last part of the solution is missing. Logs now are written, they do contain the logging id but the user still cannot make use of them. This is because the logs need to be stored in order for other tools to be able to search and visualise them.

There are many tools that do store and visualise logs, some of them being Grafana, Datadog, Kibana, and AppSignal.

While as tools they do differ, the solution for our use case remains the same across all of them. We need to store the logs somewhere where these tools can retrieve them and write a query (depending on the tool) that searches them using the logging-id.

This will query will return all the logs that have that logging-id ordered by timestamp which results into us having the logs across different systems ordered, which is what this article tried to accomplish.