Browsing through the log files of any production system, you will be amazed on how many messages are logged on the Error level. It is not uncommon to find hundreds to thousands of error messages per day! When you point these out to developers or system administrators, the usual reply I get is: “Oh, that is normal”. Let me tell you: it is not normal. Any diagnostics message that is logged on the Error level, is an indication of a failure in the system. When a error log message references a situation that is not erroneous, that is an error in itself and they may blind you for any real errors that are logged.
The fun part is that the application log is a very good indicator of the system’s maturity.
As an architect, I often need to define the system’s maturity. Maturity refers to the system quality concerning the frequency of failure of the software. So, how many times do you want a failure of the system to occur? You can compare this to the question: “How often do you want your car to show a failure?” Once a day? Once a week? Once a month? Once a year? This is irrespective of the severity of the failure: it might be a warning light that indicates a condition that does not exist or a complete breakdown. But as I know from experience with Italian cars: the higher the frequency of little failures, the greater the chances are on big failures like a roadside breakdown. The relation between defect density and mean time between failure has also been observed by Caper Jones in a computer system a long long time ago.
So, why are the application logs a very good indicator of the system’s maturity?
The reasoning is quite simple: where people work, errors are made. When an error is made in the source code, a “defect” is inserted into the system. A defect is a potential error that under certain conditions might result in a failure. A failure is the inability of the system to perform the request function according to specification. Some type of failures, like producing an incorrect result will not show in the log. But some other failures will occur in the system log. The frequency of the failures reported in a log file, is a subset of all the failures in the system, which in turn is a subset of all defects in the system. Hence, the frequency of error messages in the log is a good indicator of how well build the system is!
When the logs show a very high frequency of error message, one would expect many defects in the system. On the other hand, the total absence of errors in the log files would make me suspicious. It might be it is a very well build product, but it might also indicate the lack of another system quality: Analyzability. Analyzability is the quality of the system that refers to the effort needed for diagnosis of deficiencies or causes of failures.
Of course analyzing the log file of a production system will not help you create a high quality system. But when you are building a system, monitor the error frequency in the log files, pay proper attention to the exact cause and remove if it is a structural error. In this way, the log file adds a cheap and easily available defect removal technique to your development process .
Mark van Holsteijn is a senior software systems architect at Xebia Cloud-native solutions. He is passionate about removing waste in the software delivery process and keeping things clear and simple.