Troubleshooting
10 Steps to Finding and Fixing Technical Issues
Jonathan Marsden
jmarsden@fastmail.fm
13 February 2010
(page 1)
0. Contents
1. Be prepared (Be a Boy Scout)
2. First, do no harm (be a Doctor)
3. Get a description (be a counsellor)
4. Reproduce the issue (go forth and multiply?)
5. Preventive Maintenance (do what should already have been done)
6. Narrow it down (box it in)
7. Fix or replace what broke (Be Mr. Fixit, at last!)
8. Is it really gone? (Be a QA Dept)
9. Take pride in your work (Be happy!)
10. Prevent recurrence (Learn your lessons)
11. Summary and Questions
(page 2)
1. Be prepared (be a Boy Scout)
- Knowledge
- Know how Linux and its services log information
- Know how to use your tools (of all kinds)
- Know your own limitations
- 1.1 Tools
- 1.2 Information
- 1.3 Connectivity
- 1.4 Data Storage
- 1.5 Spare Parts
- 1.6 Personal Comfort Items
(page 3)
1.1 Tools
- Software
- LiveCDs are wonderful things:
- Ultimate Boot CD (hardware diagnostics) http://www.ultimatebootcd.com/
- Be prepared to handle 64bit and 32bit systems
- A bootable USB key (or a few of them!)
- If you are old-fashioned: a bootable floppy, tomsrtbt
- Use the command line tools that already exist:
- top, ps, netstat, vmstat, iostat, lsof, find, grep, nc, dd, tcpdump, ...
(page 4)
1.2 More Tools
Hardware
- Tools to disassemble a PC/other hardware
- Flashlight
- Compressed air
- Roll of shop towel (for cleaning systems, and you!)
- Keys (physical keys to gain access to premises, open cabinets, etc.)
- Network cable tester (cheapie, not a Fluke $x000 one!)
- PC power supply tester
- AC power tester
- Optional Extras to Consider
- Extension cord (long beefy one, short small one)
- Power strip (surge protected)
- VGA, USB and PS/2 extenders
- Multimeter??
- USB LED light on gooseneck
- Digital camera (screenshots, photos of cabling, etc)
- Multitool (for cutting things, and emergencies)
(page 5)
1.3 Even more Tools
- Information
- Pen and paper
- Paperwork re. this site/customer/system config
- May include usernames/passwords/alarm codes...
- Who to call (boss, experts, vendors, spouse, ...)
- System documentation (paper, CDs, URLs to vendor site)
- Books and reference materials?
- Connectivity
- Phone - charged and working cellphone (charger too?)
- Data - 3G card? Wifi card?
- Network cable
- Netbook or laptop?
- USB cables - different connectors!
- Serial cables and adaptors (if you still deal with this)
(page 6)
1.4 Yet More Tools
- Data Storage (for emergency backups)
- Writeable CDs/DVDs
- Spare USB key
- Maybe a USB/eSATA external hard drive
- Spare Hardware
- You know what commonly breaks on the systems you work on
- Fans
- Power supply
- DVD burner
- Networking gear
- Switch (small 5 or 8 port is fine)
- Router
- USB wifi adapter
- USB NIC
- PCI NIC (becoming less common)
- Keyboard and Mouse (PS/2 and USB)
(page 7)
1.5 The last set of tools!
- Personal Comfort
- Food, Water, Clothing
- Medicines, Cash
- Vehicle:
- Gas, Water, Oil
- Spare Tire, Toolkit
- First Aid Kit
(page 8)
2. First, do no harm (Be a Doctor)
- Make backups - because the value is in the data
- Backing up from a rescue CD
- Backing up over the network
- Verify your backups!
- Document the current state
- Connections, location, lights, switch states, ...
- Hardware configuration, OS version, application version
- Use paper, or a text file on an independent machine
(page 9)
3. Get full description (be a counsellor)
- Who
- (is the reporter? is the owner? is affected?)
- Where
- (are the affected people and systems?)
- What
- (parts of the system/network/business are affected?)
- (What has been done already to troubleshoot the issue?)
- (What exactly happens?)
- (What does the reporter think should happen?)
(page 10)
3.1 Get more description
- When
- (did this issue start?)
- (and what else happened around that time?)
- (by when do we absolutely need a working system?)
- (Until when will a workaround/kludge/manual approach be OK?)
- How
- (can the problem be duplicated?)
- (and is it always reproducible?)
- Why
- (does the reporter think this issue happened?) - if appropriate
(page 11)
4. Reproduce the issue
- Try it yourself
- Ask the reporter to reproduce it
- If appropriate, ask whoever first noticed it to reproduce it
- Correct your "How" description to reflect reality
- Document specific error messages/log entries/indicators
- Dealing with intermittent issues
(page 12)
5. Preventive Maintenance
- Do only maintenance that is:
- quick
- safe
- likely to fix the issue
- If the hardware is dirty/dusty, consider cleaning it
- If it is noisy, replace noisy fans(s) if practical
- If it seems too hot, find out why, and cool it down
- If a software subsystem is far behind on updates,
- think about updating
- (be *sure* you have good backups first)
(page 13)
6. Narrow it down (box it in)
- My PC has 5000+ executables and 3300+ config files...
- Divide and conquer!
- Tests divide the problem space, leaving smaller
- sets of things which might be the cause
- Binary search (or close to it) is highly efficient
- Googling for error messages can help decide what tests to use
- Do not follow blindly the ideas you see "out there"!
(page 14)
6.1 Thoughts on narrowing it down
Test carefully, at *every* step
- Keep checking for error messages and log file entries
- Write down your tests and their results (legibly!)
- If in doubt, simplify rather than add complexity
- Increase verbosity (debug logging, for example)
- Best done when you know the subsystem or application
- To find a performance bottleneck:
- Get a repeatable baseline first
- Slow down the suspected bottleneck component
- Speeding it up may be hard or expensive
- Don't let the problem get "outside your box"
- Don't "know that it must be X, or can't be Y..." - test!
(page 15)
7. Fix or replace what broke (Be Mr. Fixit, at last!)
- Follow safety guidelines
- (unplug stuff before opening it!)
- PC hardware replacement is usually simple
- Remember/document how it came apart
- Use that info to put it back together again
- Sometimes opening the PC case is the only hard part!
- Software config changes are often not so simple
- Backup the original files!
- Use package management tools
- Use version control if appropriate
(page 16)
8. Is it really gone? (Be a QA Dept)
- Test it yourself
- Have the reporter test it, and agree it is gone
- Try hard to make it happen. Multiple times.
- Reboot and retest, if appropriate.
- If it was intermittent:
- Ask the reporter to contact you if it returns
- Express your willingness to return
- Set expectations.
- Instruct and educate -- avoid "free return visits"!
- If appropriate, have someone sign something saying you fixed it
(page 17)
9. Take pride in your work (Be happy!)
- Troubleshooting is stressful, savour this moment!
- Take appropriate credit - be a hero for a minute
- Document the solution for your friends/colleagues
- Write it down now, while the details are still in your head
(page 18)
10. Prevent recurrence (Learn your lessons)
- Determine how this could/should have been prevented
- Take steps to ensure it won't happen again
- Apply lessons learned to other systems/sites/customers
- How could you have arrived at this result faster?
- Could someone have been alerted automatically?
(page 19)
Summary
- You can systematically troubleshoot any technical issue
- It takes preparation and careful systematic testing
- Are there specific tools or ideas you would like demonstrated?
(page 20)