最近在查调试相关资料的时候,无意看到Tess的一篇关于如何快速分析复合场景问题的博文,感觉很实用,Mark备忘。

My 9 questions for a pretty thorough problem description

When I call up a customer to start working on an issue I am generally looking for them to answer 9 questions in their own words.

1. What is happening? 
2. What did you expect to happen? 
3. When is it happening? 
4. When did it start happening? 
5. How does this problem affect you? 
6. What do you think the problem is? and what data are you basing this on? 
7. What have you tried so far? 
8. What is the expected resolution? 
9. Is there anything that would prohibit certain troubleshooting steps or solutions?

1. What is happening

tips:准确描述问题

Better answer:  Once or twice a day over the last two weeks we have gotten reports from customers that the login page is not responding.  Eventually if they wait long enough the page will time out.  We have confirmed this behavior by logging in at the time of the failure, but all other pages seem to be working well during the problem period.   The problem persists until we restart IIS.  We have not seen any events in the event log matching the time of the failure.

2. What did you expect to happen

tips:建立基线,收集相关日志

For a performance issue we need to have a baseline response time.  Even if the problem we are trying to resolve is that pages are timing out, it is important to know if the expected response time for the page is 5 seconds or a few milliseconds, and also if the application has been stress tested and shown those results during testing.

In the best of cases we would have a solid set of performance logs, event logs and iis logs from normal operation to compare to once we gather for the faulty state.

3. When is it happening?

tips:详细的收集各类数据

A solid repro scenario is of course optimal, but with production type issues this is seldom the case.   Things that I am looking for here are.

  • How often the problem reproduces (eg. once or twice a day in the above example)
  • Do we know what actions people take, i.e. it happens when they try to log in?
  • Do we know if it is confined to one server, one application, one page, one action?
  • Do we know if it is reproducible in test?
  • Is it happening only for certain users?
  • Is it happening only under load?
  • Does it only happen when memory usage is high, or when CPU usage is over 80%?
  • Does it always happen at 8 am when the first people in the office start logging in?
  • etc. etc.  any and all conditions relating to the issue are interesting.

4. When did it start happening?

Knowing when it started happening is often one of the most important clues to finding root cause.

5. How does this problem affect you?

tips:了解影响范围,根据问题造成的影响选择不同的处理策略

When I work on an issue, I like to know what kind of impact the problem has on the customer and the users of the system.

The reason I ask about this is because if a problem causes users not to be able to log in for example, and this is a critical application for the business, obviously we should be starting by finding a temporary fix, rather than going through root cause analysis, or maybe starting with a temporary fix first (like recycling when memory reaches 600 MB if the problem is a memory leak, to avoid an unexpected crash), and then preparing for root cause analysis.

It also tells me a little bit about the priority that troubleshooting is going to get.  If the issue is very severe, maybe it is ok to perform some troubleshooting actions that have a lot of impact on the system, if it is going to help us find the problem faster.

6. What do you think the problem is? and what data are you basing this on?

tips:批判精神,不要过度相信问题反馈人提供的信息,不要被带歪了

This is especially interesting if you come in as an external troubleshooter, as you get a lot of insight into known problem areas of the application by asking what they think the problem might be.

There is a temptation to follow down the path of what is already guessed because it often sounds very plausible, but when I look at a problem I try to keep a very open mind to avoid getting stuck in a tunnel.

If someone believes that the problem is “x” then oftentimes, because of how our minds work, we tend to look at the data from that direction, even if it doesn’t fit. Part #2 of this question is what data you are basing your theory on?  I try to look at that data with fresh eyes, to see if it really collaborates the theory or if there are some holes in it.   To be honest, if a person comes to me with an issue and already has a theory, the first thing i try to do is to disprove it.

7. What have you tried so far?

tips:有没有尝试过什么验证

This question serves two purposes.

1. Finding out what has been done so we don’t need to re-invent the wheel.    
2. Finding out the results of those actions as it gives us more data about the problem.

Keep in mind here as well that reliability is important.  If something was presumably done but there is no documentation of it or the results, it is probably worthwhile re-doing it, depending on how complex the task was.

8. What is the expected resolution?

tips:对应的分析思路是什么

This might sound like a repeat of question 2 (What did you expect to happen), but it is very different.

An expected resolution might be to “avoid the crash” which you can do by preemptive recycling, separating the app into a different application pool, moving to 64-bit, reverting back to a previous build etc.

Another expected resolution might be to “find the root cause to avoid that it happens again”.

And a third may be, “if we get the pages to consistently respond in less than 5 seconds under load we are good”.

Defining the expected resolution is crucial as this is what you will measure and verify the solution against to determine when you are done troubleshooting.  If you don’t have this, how will you know when you are done?

This expected resolution, just like requirements in a software project, should also be agreed upon so that all involved parties work against the same goal.

9. Is there anything that would prohibit certain troubleshooting steps or solutions?

tips:是否可以通过什么方式避免问题的发生

Knowing the limitations of data gathering or certain solutions both determine what actions you can take and change the expectations of what can be achieved.

On this question I would expect answers like, we can not install any tools on the server without going through a process of 10 change requests, or we can reproduce this on a test machine so you can live debug if you need to.

then...

Summarizing everything

Once I have all the facts from the questions above I usually sit down and summarize everything in a format that looks something like this:

Problem description: 
=========================== 
… 
Expected resolution: 
=========================== 
… 
Troubleshooting done: 
============================ 
… 
Next steps: 
=========================== 
… 
Timeline for next steps: 
=========================== 

From: https://blogs.msdn.microsoft.com/tess/2009/09/09/first-step-in-troubleshooting-complex-issues-define-and-scope-your-issue-properly/

First step in troubleshooting complex issues: Define and scope your issue properly的更多相关文章

  1. Top Things to Consider When Troubleshooting Complex Application Issues

    http://blogs.msdn.com/b/debuggingtoolbox/archive/2011/10/03/top-things-to-consider-when-troubleshoot ...

  2. 转 如何观察 undo Oracle DML语句回滚开销估算

    https://searchdatabase.techtarget.com.cn/7-20392/ --use_urec 详细解读: select USED_UREC from v$transacti ...

  3. SRDC - ORA-30036: Checklist of Evidence to Supply (Doc ID 1682700.1)

    SRDC - ORA-30036: Checklist of Evidence to Supply (Doc ID 1682700.1) Action Plan 1. Execute srdc_db_ ...

  4. SRDC - ORA-30013: Checklist of Evidence to Supply (Doc ID 1682701.1)

    Action Plan 1. Execute srdc_db_undo_ora-30013.sql as sysdba and provide the spool output --srdc_db_u ...

  5. SRDC - ORA-1562: Checklist of Evidence to Supply (Doc ID 1682728.1)

    SRDC - ORA-1562: Checklist of Evidence to Supply (Doc ID 1682728.1) Action Plan 1.  Execute srdc_db_ ...

  6. Click to add to Favorites Troubleshooting: High Version Count Issues (Doc ID 296377.1)

    Copyright (c) 2018, Oracle. All rights reserved. Oracle Confidential. Click to add to Favorites Trou ...

  7. 条款2:尽量以const、enum、inline替换#define

    1> 以const替换#define • 比如用const double Ratio = 1.653替换#define RATIO 1.653 因为宏定义在预处理阶段就会被替换成其所指代的内容, ...

  8. mysql下面的INSTALL-BINARY的内容,所有的mysql的配置内容都在这

    2.2 Installing MySQL on Unix/Linux Using Generic Binaries Oracle provides a set of binary distributi ...

  9. windows ntp安装及调试

    Setting up NTP on Windows It's very helpful that Meinberg have provided an installer for the highly- ...

随机推荐

  1. Java:Java 队列的遍历

    Java队列到底有没有可以遍历的功能呢?暂且试一下吧 参考链接:stl容器遍历测试 1.LinkedList实现简单遍历 for(Iter =LocTimesSerials.size()-1; iSe ...

  2. react基础篇四

    列表 & Keys 渲染多个组件 你可以通过使用{}在JSX内构建一个元素集合 下面,我们使用Javascript中的map()方法遍历numbers数组.对数组中的每个元素返回<li& ...

  3. .NET Framework 3.5 安装

    今天vCenter服务器悲剧了,只好火速重新部署新vCenter服务器... Windows server 2016 中,安装VCenter 5.5 提示  未安装 .NET Framework 3. ...

  4. 团体程序设计天梯赛-练习集-L1-042. 日期格式化

    L1-042. 日期格式化 世界上不同国家有不同的写日期的习惯.比如美国人习惯写成“月-日-年”,而中国人习惯写成“年-月-日”.下面请你写个程序,自动把读入的美国格式的日期改写成中国习惯的日期. 输 ...

  5. luogu P1714 切蛋糕 单调队列

    单调队列傻题. 考虑以 $i$ 结尾的答案 : $max(sumv_{i}-sumv_{j}),j \in [i-m,i-1]$ ($sumv_{i}$ 为前缀和) 稍微搞一搞,发现 $sumv_{i ...

  6. PAT_A1146#Topological Order

    Source: PAT A1146 Topological Order (25 分) Description: This is a problem given in the Graduate Entr ...

  7. eas左树右表基础资料界面引用为左树右表F7的简单方法

    age:   /** * 加载配件F7(左树右表) * @param F7Filed           要加载的F7控件 * @param ctx               界面上下文 * @单据 ...

  8. 被遗忘的 Logrotate

    转自: http://huoding.com/2013/04/21/246 被遗忘的 Logrotate 发表于 2013-04-21 我发现很多人的服务器上都运行着一些诸如每天切分 Nginx 日志 ...

  9. SSL/TLS 加密新纪元 - Let's Encrypt

    转自: https://linux.cn/article-6565-1.html SSL/TLS 加密新纪元 - Let's Encrypt 根据 Let's Encrypt 官方博客消息,Let's ...

  10. python项目开发:ftp server开发

    程序要求: 1.用户加密认证 (对用户名密码进行MD5验证)2.允许同时多用户登陆 (使用socket server方法,为每个用户都创建一个信息文件)3.每个用户有自己的家目录,且只能访问自己的家目 ...