When we run our software, we obviously want to see and understand what is happening and how well our software performs. To achieve this, we need observability as a key characteristic for our software. Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. This definition, borrowed from control theory, infers that metrics, tracing, and logging are key topics to be implemented in your software system.
Two of these pillars, metrics and tracing, are also of great importance to allow yourself to paint the complete picture. In this blog post, I will focus on getting the most benefits from your logging.
Structured, Centralized, and Normalized
Fortunately, I have never been in any environment where there was no logging implemented at all. Typically, at least some file-based, plaintext logging would exist to provide some insight into your application. An excellent start to improving this type of logging is to go from plaintext logging to structured logging. This can be done by:
- Writing the log statements as json blobs.
- Capture them on some centralized platform that is able to parse them.
Now that we have structured and centralized logging, we can effortlessly search, filter, and query the log statements. Maybe even from multiple applications at the same time.
hen we view data of various sources, we have another wish: normalized logs. To achieve this, all applications should structure their log statements in the same fashion. Consider using a reusable logging component or a default logging configuration scaffolded when setting up a new application.
Now that we have our basic instrumentation in place with structured, centralized, and normalized logging, we are well underway with observability. We can simultaneously query all our applications, search for some specific exceptions, and filter out error-logs.
If your environment is anything like the places I have been, you are probably overloaded with data of unclear importance. Just because you have structured logging does not mean it is worthwhile logging. Put differently, just because you installed observability tooling does not mean that you see more.
So, next up, how easy is it to gather real insights from it?
We need a strategy to get from structured logging to information. It may help to set a concrete goal. E.g.: “every log statement with severity error should be an actual error.” In other words, a log statement with severity error should indicate that something went wrong, i.e. an operation could not be executed successfully. Such a goal can prove valuable in many situations.
So, I set myself this goal in one of my latest projects. When going through the logs, I found that two other types exist:
- The severity is actually of type warning. Something unexpected happened, but the application can recover from it.
- It is a bug or a missed business flow. Everything is working correctly for successful execution; there’s just a mistake in the code.
Now that I had set myself this goal, I needed a way to get there. I implemented a routine:
- Every morning, in the ~30 minutes between my morning coffee and the stand-up meeting, I tracked down the most common error pattern.
- Because we can only do so much in 30 minutes, we can either:
- solve the error pattern within the 30 minutes, or
- prepare and plan for further investigation. This would include more detailed log analysis and use of metrics and tracing to grasp the complete picture. In the latter case, I would typically create a ticket on the product backlog
The results of the routine? After only a few weeks, the number of error logs was significantly reduced. And after a few months, I reached my goal.
In conclusion, we now can use the error log to get valuable insights into the health of our application landscape – a pivotal step and tool in the bigger toolbox of progressive delivery.
Next steps, a.k.a. a word of warning
Observability is not just about the tooling, it is also about providing it with useful data.
We cannot assume that when we clean it once, it will stay that way. With every change we make, we need to take into account our observability needs. Answer the following questions: What logging do we need? What data or tags should be in it? What metrics should we add to obtain insights into our service level objectives?
If there are too many errors to investigate and if all you see is a bunch of meaningless data, simply no one will ever look at it.