As a follow up to my article Anatomy of a Successful Troubleshooter, I began thinking about the habits that I have found effective when troubleshooting. I was trying to coach a teammate with troubleshooting and put together this catalog of what I have found valuable that could qualify as effective troubleshooting habits.
Problem Management and Root Cause Analysis
Problem Management and Root Cause Analysis are vast subjects, but I think it can be boiled down to: performing troubleshooting more effectively and creating measurable results.
Here are some key takeaways from this topic:
- Reactive Problem Management is focused on finding an immediate work-around
- Proactive Problem Management is focused more on finding the root cause and prevention
- If we can properly identify a problem and understand it then we can probably do something to prevent it from occurring again
- Often the first problem identified is not the root cause and we must continue to explore and dig deeper (this is the fun part)
There is a lot written about these topics from a management perspective, which talks about various barriers and challenges. However, we don’t need management committees to tell us how to get results; we just need to continue to improve our own troubleshooting skills and use good judgement.
When particular problems arise, we know performing some Root Cause Analysis can produce valuable information. This is where we discover how it works, how to fix it, why it is the correct fix and most importantly prevent it from reoccurring. For this investment, the worst case is that we learn something new; best case we are able to produce measurable improvements.
Effective Troubleshooting Habits
Filter and Corroborate Advice you Find
During a google search, you will may be lead down many wrong paths and solutions may be either wrong or don’t apply to your situation. Sometimes I find that an error that has over a dozen possible causes and a dozen proposed solutions. Read carefully and do not blindly implement them one-by-one until one works (see avoid brute force methods). You’ll be more efficient in resolving the problem (and also not risk introducing additional problems) when you take the time to filter and critique you’re reading. Before implementing a solution, keep reading until you discover enough about the problem and the various solutions to make an educated choice of action. Try to corroborate advice from more reputable sources.
We have to consider that the solution to the problem will be the result of connecting several different clues that can only be found when examining them together and not individually. In other words, if we look at one symptom and find a solution to that symptom we may not necessary have found the correct solution our specific problem. All aspects must be correlated and examined together to discover the eventual correct solution.
The internet is also notorious for giving bad advice and selling it as good advice or even a best practice. There are many reasons for this. The poster’s complete situation may be specific to their environment or undisclosed factors that may not apply to you situation. Sometimes the solution applied to older OS or SQL versions that now is either incorrect or doesn’t apply today. Try to find sources relevant to your versions or configuration. Attempt to verify that the symptoms in their solution are the same symptoms you are observing and are coming from the same source problem. Look for posts where there is some common agreement and eliminate the fringe cases that probably don’t apply to the situation.
- Be careful of being led down a wrong paths that won’t work
- When too many possible causes, keep researching to eliminate ones that don’t apply
- Corroborate advice from more reputable sources
- Don’t risk introduction new problems by implementing solutions not understood
- Attempt to verify the proposed solution matches the symptoms associated with your problem
- Filter and eliminate fringe cases that don’t apply
Observe the Obvious
Sometimes the clues are right in front of your face and it just takes stepping back to see it.
EXAMPLE: One time I was troubleshooting an FTP problem with a vendor. I logged on to their FTP server and confirmed our process had uploaded the files, but they were having issues with their back-end processes that were picking up the files and claiming that we were not sending the files. I noticed in the same directory the files we uploaded on their server they had these large files labeled ftp_activity_log. I opened one and found their back-end software was logging API communication failures in their own logs that they never bothered to inspect.
Correlate the Clues
Sometimes it takes someone saying has anything else changed today? It helps discovering what else has been going on recently with your network and servers with other teams. But don’t use this just to point blame at another team without actually finding a clue and taking the time to actually investigate further. Armed with preliminary information you can begin to ask intelligent questions of other teams in a cooperative resolution focused manner.
EXAMPLE: One day an application that involved 4 SQL clusters all starting failing. The application lead spent half a day on the phone with vendor troubleshooting to no avail. Later in the day I was inquiring with the busy application lead. I didn’t get much information but the little I got, I started inspecting things myself and found in the application error logs the problems all started during lunch hour. I remember during lunch getting a call from our SAN engineers that were moving drives on a different cluster to the new SAN and had to recreate the MS DTC resources on the cluster. I went to the cluster and found the DTC service was not properly configured for remote access. When I fixed that everything started working with the original application again. The application lead and vendor were unable to make this connection because it didn’t look outside their scope.
Avoid Brute Force Methods without Prior Inspection
In an effort to just make problems go away and move on, don’t just fire off service restarts or server reboots without even looking at the problem. It is important to fix problems quickly, but collecting key data while the problem is occurring and testing your theories is also important to determining the root cause (especially for reoccurring problems). When you fail to collect key data beforehand you can sometimes erase evidence or be unable to examine the problem afterwards. Sometimes mysterious problems are just solved with a simple service restart or reboot, but if the problem is reoccurring this is a primary opportunity to learn something new and rise above the average troubleshooter.
- Don’t just fire off service restarts or server reboots until you have inspected the problem
- Collect data beforehand to make it easier to duplicate or troubleshoot later
Get a Fresh Perspective
Sometimes you need to stop thinking about something to see it more clearly. When you spend too much time on a problem your efforts become less effective and have diminishing returns. Sometimes you are troubleshooting a particularly evasive problem over many hours. Get up take a break and do something different. Verbalizing the problem or drawing it on a white board helps and solutions become more obvious when explaining to someone else – then BAM – something you weren’t seeing before comes into the picture and you’re back on track. Don’t be embarrassed to be wrong with your assumptions, we all go through a similar process and have similar blocks. It helps having colleagues that can help each other when you get stuck.
A similar topic is brainstorming. When developing a new solution (or solving a problem) after spending many hours, I frequently find solutions come to me after taking a break and when I revisit them with a fresh perspective (an ah-ha moment that leads me in a new direction).
Document and Improve
After the troubleshooting is over and the fires are put out, take time to review the situation. The best time to document something is right after you developed it or just fixed it when all the details are still fresh in your mind.
Documentation
After troubleshooting something, it will usually come up again in some fashion. I can’t remember every detail I troubleshooted months ago – heck sometimes s week ago. And remember, someone else might benefit from the knowledge you gained, so you can be a hero by sharing it. Create a Visio diagram, write a procedure document, write a wiki or blog article, etc. Share it with your peers and incorporate their input.
This extra step requires investing a little more time, but it also reinforces what you learned. When you start to document what you learned you gain a more complete understanding of the subject.
Identify Things that Could be Improved
During your troubleshooting you may discover systems or processes that need to be upgraded or improved. I want to make things more efficient which in turn makes mine and others jobs easier. This may mean putting in some automation task or suggesting a system or process improvement to management.
What habits have you found useful?