by Anthony Vallone

How long does it take to find the root cause of a failure in your system? Five minutes? Five days? If you answered close to five minutes, it’s very likely that your production system and tests have great logging. All too often, seemingly unessential features like logging, exception handling, and (dare I say it) testing are an implementation afterthought. Like exception handling and testing, you really need to have a strategy for logging in both your systems and your tests. Never underestimate the power of logging. With optimal logging, you can even eliminate the necessity for debuggers. Below are some guidelines that have been useful to me over the years.

Channeling Goldilocks

Never log too much. Massive, disk-quota burning logs are a clear indicator that little thought was put in to logging. If you log too much, you’ll need to devise complex approaches to minimize disk access, maintain log history, archive large quantities of data, and query these large sets of data. More importantly, you’ll make it very difficult to find valuable information in all the chatter.

The only thing worse than logging too much is logging too little. There are normally two main goals of logging: help with bug investigation and event confirmation. If your log can’t explain the cause of a bug or whether a certain transaction took place, you are logging too little.

Good things to log:

  • Important startup configuration
  • Errors
  • Warnings
  • Changes to persistent data
  • Requests and responses between major system components
  • Significant state changes
  • User interactions
  • Calls with a known risk of failure
  • Waits on conditions that could take measurable time to satisfy
  • Periodic progress during long-running tasks
  • Significant branch points of logic and conditions that led to the branch
  • Summaries of processing steps or events from high level functions - Avoid logging every step of a complex process in low-level functions.

Bad things to log:

  • Function entry - Don’t log a function entry unless it is significant or logged at the debug level.
  • Data within a loop - Avoid logging from many iterations of a loop. It is OK to log from iterations of small loops or to log periodically from large loops.
  • Content of large messages or files - Truncate or summarize the data in some way that will be useful to debugging.
  • Benign errors - Errors that are not really errors can confuse the log reader. This sometimes happens when exception handling is part of successful execution flow.
  • Repetitive errors - Do not repetitively log the same or similar error. This can quickly fill a log and hide the actual cause. Frequency of error types is best handled by monitoring. Logs only need to capture detail for some of those errors.

There is More Than One Level

Don't log everything at the same log level. Most logging libraries offer several log levels, and you can enable certain levels at system startup. This provides a convenient control for log verbosity.

The classic levels are:

  • Debug - verbose and only useful while developing and/or debugging.
  • Info - the most popular level.
  • Warning - strange or unexpected states that are acceptable.
  • Error - something went wrong, but the process can recover.
  • Critical - the process cannot recover, and it will shutdown or restart.

Practically speaking, only two log configurations are needed:

  • Production - Every level is enabled except debug. If something goes wrong in production, the logs should reveal the cause.
  • Development & Debug - While developing new code or trying to reproduce a production issue, enable all levels.

Test Logs Are Important Too

Log quality is equally important in test and production code. When a test fails, the log should clearly show whether the failure was a problem with the test or production system. If it doesn't, then test logging is broken.

Test logs should always contain:

  • Test execution environment
  • Initial state
  • Setup steps
  • Test case steps
  • Interactions with the system
  • Expected results
  • Actual results
  • Teardown steps

Conditional Verbosity With Temporary Log Queues

When errors occur, the log should contain a lot of detail. Unfortunately, detail that led to an error is often unavailable once the error is encountered. Also, if you’ve followed advice about not logging too much, your log records prior to the error record may not provide adequate detail. A good way to solve this problem is to create temporary, in-memory log queues. Throughout processing of a transaction, append verbose details about each step to the queue. If the transaction completes successfully, discard the queue and log a summary. If an error is encountered, log the content of the entire queue and the error. This technique is especially useful for test logging of system interactions.

Failures and Flakiness Are Opportunities

When production problems occur, you’ll obviously be focused on finding and correcting the problem, but you should also think about the logs. If you have a hard time determining the cause of an error, it's a great opportunity to improve your logging. Before fixing the problem, fix your logging so that the logs clearly show the cause. If this problem ever happens again, it’ll be much easier to identify.

If you cannot reproduce the problem, or you have a flaky test, enhance the logs so that the problem can be tracked down when it happens again.

Using failures to improve logging should be used throughout the development process. While writing new code, try to refrain from using debuggers and only use the logs. Do the logs describe what is going on? If not, the logging is insufficient.

Might As Well Log Performance Data

Logged timing data can help debug performance issues. For example, it can be very difficult to determine the cause of a timeout in a large system, unless you can trace the time spent on every significant processing step. This can be easily accomplished by logging the start and finish times of calls that can take measurable time:

  • Significant system calls
  • Network requests
  • CPU intensive operations
  • Connected device interactions
  • Transactions

Following the Trail Through Many Threads and Processes

You should create unique identifiers for transactions that involve processing across many threads and/or processes. The initiator of the transaction should create the ID, and it should be passed to every component that performs work for the transaction. This ID should be logged by each component when logging information about the transaction. This makes it much easier to trace a specific transaction when many transactions are being processed concurrently.

Monitoring and Logging Complement Each Other

A production service should have both logging and monitoring. Monitoring provides a real-time statistical summary of the system state. It can alert you if a percentage of certain request types are failing, it is experiencing unusual traffic patterns, performance is degrading, or other anomalies occur. In some cases, this information alone will clue you to the cause of a problem. However, in most cases, a monitoring alert is simply a trigger for you to start an investigation. Monitoring shows the symptoms of problems. Logs provide details and state on individual transactions, so you can fully understand the cause of problems.

 

Optimal Logging的更多相关文章

  1. 读书摘要,Hackable Projects

    完整读完Google的三篇谈Hackable Projects的文章,以及一篇从Test Pyramid看UnitTest的比重.一篇谈Optimal Logging的文章,感觉这5篇在测试.日志两个 ...

  2. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory

    学习架构探险,从零开始写Java Web框架时,在学习到springAOP时遇到一个异常: "C:\Program Files\Java\jdk1.7.0_40\bin\java" ...

  3. Oracle补全日志(Supplemental logging)

    Oracle补全日志(Supplemental logging)特性因其作用的不同可分为以下几种:最小(Minimal),支持所有字段(all),支持主键(primary key),支持唯一键(uni ...

  4. Java程序日志:java.util.logging.Logger类

    一.Logger 的级别 比log4j的级别详细,全部定义在java.util.logging.Level里面.各级别按降序排列如下:SEVERE(最高值)WARNINGINFOCONFIGFINEF ...

  5. python 学习笔记 -logging模块(日志)

    模块级函数 logging.getLogger([name]):返回一个logger对象,如果没有指定名字将返回root loggerlogging.debug().logging.info().lo ...

  6. python logging colorlog

    import logging LOG_LEVEL = logging.NOTSET LOGFORMAT = "[%(log_color)s%(levelname)s] [%(log_colo ...

  7. [转]ASP.NET Core 开发-Logging 使用NLog 写日志文件

    本文转自:http://www.cnblogs.com/Leo_wl/p/5561812.html ASP.NET Core 开发-Logging 使用NLog 写日志文件. NLog 可以适用于 . ...

  8. python 之 logging

    #coding=utf-8 import logging logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(filename ...

  9. Python Logging模块的简单使用

    前言 日志是非常重要的,最近有接触到这个,所以系统的看一下Python这个模块的用法.本文即为Logging模块的用法简介,主要参考文章为Python官方文档,链接见参考列表. 另外,Python的H ...

随机推荐

  1. Axis2 java调用.net webservice接口的问题(郑州就维)

    这是一个古老的问题,古老到从我若干年前遇到这样的问题就是一个解决之道:反复尝试.其实标准是什么,标准就是一个束缚,一种按既定规则的束缚,错点点,你的调用就可能不成功,不成功后你要花费大量的力气查找原因 ...

  2. c#调用带有安全认证的java webservice

    最近使用c#调用另外一个同事写的java webservice耽误了很多时间,网上资料不太完整,走了很多弯路,希望对大家有帮助. 基本思路是1.拼装soap使用http post ,主要将验证身份信息 ...

  3. C# 线程更新UI

    最方便的用法: private void ViewMsg(string msg)        { /* control.Invoke(new SetControlTextDelegate((ct,  ...

  4. SPOJ3267 D-query 离线+树状数组 在线主席树

    分析:这个题,离线的话就是水题,如果强制在线,其实和离线一个思路,然后硬上主席树就行了 离线的代码 #include <iostream> #include <stdio.h> ...

  5. 不同的jar里边相同的包名类名怎么区别导入

    今天在做项目的时候遇到了一个很有意思的问题,折磨了我很长时间,不过最终还是解决了,特留此文纪念一下. 遇到的问题: 同样一段代码,在同事那就好使,在我这就找不到一个方法.引用的包也都是相同的,这种问题 ...

  6. Shell的那些事儿

    日常工作中,哪种语言对你的帮助最大?我觉得非Shell莫属.最早接触Shell应该是在大学的时候,如做Linux文件系统裁减会用到一些命令,如find, tar, xargs, cp等等,并把它们通过 ...

  7. setTimeout中0毫秒延时

    先来看段代码,思考一下执行的结果. alert(1); setTimeout(function(){alert(2);}, 0); alert(3); 估计很多人认为执行结果为1,2,3,原因就是认为 ...

  8. 引入less报错解决方法以及浏览器设计不同的地方

    XMLHttpRequest cannot load file:///C:/Users/PAXST/Desktop/805/first.less. Cross origin requests are ...

  9. Java异常处理的误区和经验总结

    本文着重介绍了 Java 异常选择和使用中的一些误区,希望各位读者能够熟练掌握异常处理的一些注意点和原则,注意总结和归纳.只有处理好了异常,才能提升开发人员的基本素养,提高系统的健壮性,提升用户体验, ...

  10. SGU107——987654321 problem

    For given number N you must output amount of N-digit numbers, such, that last digits of their square ...