Troubleshooting: Fix Problems Systematically
Troubleshooting is systematic. Check one thing at a time. Find the problem. Fix it.
Here's the thing: Troubleshooting is a skill. Learn the approach. Use it.
The Troubleshooting Process
1. Understand the Problem
Ask:
- What's broken?
- When did it break?
- What changed?
- What should it do?
My take: Understand the problem first. Don't guess.
2. Reproduce the Problem
# Try to reproduce
# See if it happens consistently
My take: Reproduce the problem. If you can't reproduce it, you can't fix it.
3. Check Logs
# System logs
journalctl -p err --since "1 hour ago"
# Service logs
journalctl -u service-name -n 100
# Application logs
tail -f /var/log/app.log
My take: Logs tell you what's wrong. Check them first.
4. Check Status
# Service status
systemctl status service-name
# System resources
htop
df -h
My take: Check status. See what's running. See what's using resources.
5. Isolate the Problem
# Test components separately
# Find what works and what doesn't
My take: Isolate the problem. Narrow it down. Find the root cause.
6. Fix and Test
# Make the fix
# Test that it works
# Verify the problem is solved
My take: Fix it. Test it. Verify it works.
Common Troubleshooting Areas
Services Not Starting
# Check status
systemctl status service-name
# Check logs
journalctl -u service-name -n 100
# Test manually
sudo -u service-user /path/to/command
My take: Services fail for reasons. Check status. Check logs. Test manually.
High Resource Usage
# Check CPU
top
htop
# Check memory
free -h
# Check disk I/O
sudo iotop
# Find the process
ps aux --sort=-%cpu | head -10
My take: High usage means something is using resources. Find it. Fix it.
Network Problems
# Check interface
ip link show
# Check IP
ip addr show
# Test connectivity
ping 8.8.8.8
# Check DNS
nslookup google.com
My take: Network problems are systematic. Check one thing at a time.
Systematic Approach
Start Broad, Narrow Down
- Check if system is running
- Check if service is running
- Check if network is working
- Check specific component
- Fix the problem
My take: Start broad. Narrow down. Find the problem.
Common Mistakes (I've Made These)
-
Jumping to conclusions: Don't guess. Check systematically.
-
Not checking logs: Logs tell you everything. Check them.
-
Changing too many things: Change one thing at a time. Test results.
-
Not documenting: Document what you tried. Learn from it.
-
Giving up too early: Troubleshooting takes time. Be patient.
Real-World Examples
Service Won't Start
# 1. Check status
systemctl status myservice
# 2. Check logs
journalctl -u myservice -n 100
# 3. Test manually
sudo -u service-user /usr/local/bin/myservice
# 4. Check permissions
ls -la /usr/local/bin/myservice
# 5. Fix and restart
sudo systemctl restart myservice
System is Slow
# 1. Check resources
htop
df -h
# 2. Find resource hogs
ps aux --sort=-%cpu | head -10
# 3. Check I/O
sudo iotop
# 4. Fix the problem
What's Next?
Now that you understand troubleshooting, you can fix problems systematically. Or review Logging to understand logs better.
Personal note: Troubleshooting used to frustrate me. Then I learned the systematic approach. Now I troubleshoot methodically. It works. Learn it.