Complete Troubleshooting Guide: Your Kubernetes Debugging Playbook

This is your complete troubleshooting playbook. Everything you need to debug Kubernetes issues. Systematic. Methodical. Effective.

🎯 The Big Picture

Think of this guide like a hotel operations manual. When something goes wrong, you have a process. You follow steps. You find the problem. That's troubleshooting.

This guide combines all troubleshooting knowledge into one comprehensive resource. Use it. Reference it. Master it.

The Troubleshooting Framework

Follow this framework for any issue:

Understand the problem - What's the symptom?
Check the obvious - Cluster health, resources, events
Narrow down - Which namespace? Which pod? Which service?
Gather information - Describe, logs, events
Analyze - What's the pattern? What's the cause?
Form hypothesis - What's likely wrong?
Test fix - Try the solution
Verify - Did it work?

Quick Reference: Common Issues

Pod Issues

CrashLoopBackOff:

kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
# Check: Application errors, config issues, resource limits

ImagePullBackOff:

kubectl describe pod <pod-name>
# Check: Image name, registry, authentication

Pending:

kubectl describe pod <pod-name>
# Check: Resources, affinity, taints, storage

Service Issues

Service not accessible:

kubectl get endpoints <service-name>
kubectl get svc <service-name> -o yaml | grep selector
kubectl get pods --show-labels
# Check: Selectors, endpoints, ports

Storage Issues

PVC pending:

kubectl describe pvc <pvc-name>
kubectl get storageclass
# Check: Storage class, provisioner, events

Deployment Issues

Rollout stuck:

kubectl rollout status deployment/<deployment-name>
kubectl get pods -l app=<app-label>
kubectl describe pod <pod-name>
# Check: Readiness probe, resource constraints, image pull

Essential Commands Cheat Sheet

Information Gathering

# Get all resources
kubectl get all
kubectl get all -n <namespace>

# Describe resources
kubectl describe pod <pod-name>
kubectl describe node <node-name>
kubectl describe svc <service-name>
kubectl describe deployment <deployment-name>

# Get logs
kubectl logs <pod-name>
kubectl logs <pod-name> -f  # Follow
kubectl logs <pod-name> --previous  # Previous instance
kubectl logs <pod-name> -c <container>  # Multi-container

# Get events
kubectl get events
kubectl get events --sort-by='.lastTimestamp'
kubectl get events --field-selector involvedObject.name=<pod-name>

# Get YAML/JSON
kubectl get pod <pod-name> -o yaml
kubectl get pod <pod-name> -o json
kubectl get pod <pod-name> -o wide

Debugging

# Execute into pod
kubectl exec -it <pod-name> -- /bin/sh
kubectl exec <pod-name> -- <command>

# Port forward
kubectl port-forward pod/<pod-name> <local-port>:<pod-port>
kubectl port-forward svc/<service-name> <local-port>:<service-port>

# Resource usage
kubectl top nodes
kubectl top pods
kubectl top pods --containers

Service Debugging

# Check service
kubectl get svc
kubectl describe svc <service-name>

# Check endpoints
kubectl get endpoints <service-name>
kubectl describe endpoints <service-name>

# Check selectors
kubectl get svc <service-name> -o yaml | grep selector
kubectl get pods --show-labels

Storage Debugging

# Check PVC
kubectl get pvc
kubectl describe pvc <pvc-name>

# Check PV
kubectl get pv
kubectl describe pv <pv-name>

# Check storage class
kubectl get storageclass
kubectl describe storageclass <sc-name>

Troubleshooting Scenarios

Scenario 1: Application Not Accessible

Symptoms:

Application not responding
Connection refused
Timeout errors

Debugging steps:

Check pods: kubectl get pods
Check service: kubectl get svc
Check endpoints: kubectl get endpoints
Check pod logs: kubectl logs <pod-name>
Check service selectors: Match pod labels?
Check ports: Service port vs pod port?
Test from pod: kubectl exec -it <pod> -- curl localhost:8080

Scenario 2: Pods Not Starting

Symptoms:

Pods in Pending or CrashLoopBackOff
Deployment not scaling
Replicas not matching

Debugging steps:

Check pod status: kubectl get pods
Describe pod: kubectl describe pod <pod-name>
Check events: Look for errors
Check logs: kubectl logs <pod-name> --previous
Check resources: kubectl top nodes
Check image: Is it correct?
Check storage: PVC bound?

Scenario 3: Deployment Not Updating

Symptoms:

Deployment not rolling out
Old pods still running
New pods not starting

Debugging steps:

Check rollout: kubectl rollout status deployment/<name>
Check replica sets: kubectl get replicasets
Check new pods: kubectl get pods -l app=<label>
Check pod status: Why aren't new pods ready?
Check readiness probe: Is it configured?
Check image: Is new image correct?
Check resources: Enough resources?

Real-World Troubleshooting Flow

Here's how I troubleshoot production issues:

Step 1: Understand the problem

What's the symptom?
When did it start?
What changed?

Step 2: Check cluster health

kubectl get nodes
kubectl get pods --all-namespaces
kubectl top nodes

Step 3: Narrow down

Which namespace?
Which pod?
Which service?

Step 4: Gather information

kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
kubectl get events --sort-by='.lastTimestamp'

Step 5: Analyze

What's the error?
What's the pattern?
What's likely wrong?

Step 6: Fix

Apply the fix
Monitor the result

Step 7: Verify

kubectl get pods
kubectl logs <pod-name>
# Test the application

My Take: The Troubleshooting Mindset

I used to panic when things broke. I'd try random fixes. I'd guess.

Then I learned the systematic approach:

Stay calm - Panic doesn't help
Follow the process - Systematic approach works
Gather evidence - Don't guess, investigate
Fix root cause - Not just symptoms
Verify - Make sure it works

Now I fix issues in minutes, not hours. The process works.

Memory Tip: The Detective Analogy

Troubleshooting is like being a detective:

Understand the crime (symptom)
Gather evidence (logs, events, describe)
Follow leads (check resources, configurations)
Form hypothesis (likely cause)
Test theory (try fix)
Verify solution (confirm it works)

Once you see it this way, troubleshooting becomes systematic.

Key Takeaways

Have a framework - Systematic approach works
Gather information - Don't guess, investigate
Use the tools - kubectl describe, logs, events
Fix root cause - Not just symptoms
Verify - Make sure it works

What's Next?

Now that you have the complete troubleshooting guide, you're ready to debug any Kubernetes issue. Practice. Use this guide. Master troubleshooting.

Remember: Troubleshooting isn't about knowing everything. It's about having a systematic approach. Follow the process. It works.

🎯 The Big Picture​

The Troubleshooting Framework​

Quick Reference: Common Issues​

Pod Issues​

Service Issues​

Storage Issues​

Deployment Issues​

Essential Commands Cheat Sheet​

Information Gathering​

Debugging​

Service Debugging​

Storage Debugging​

Troubleshooting Scenarios​

Scenario 1: Application Not Accessible​

Scenario 2: Pods Not Starting​

Scenario 3: Deployment Not Updating​

Real-World Troubleshooting Flow​

My Take: The Troubleshooting Mindset​

Memory Tip: The Detective Analogy​

Key Takeaways​

What's Next?​

🎯 The Big Picture

The Troubleshooting Framework

Quick Reference: Common Issues

Pod Issues

Service Issues

Storage Issues

Deployment Issues

Essential Commands Cheat Sheet

Information Gathering

Debugging

Service Debugging

Storage Debugging

Troubleshooting Scenarios

Scenario 1: Application Not Accessible

Scenario 2: Pods Not Starting

Scenario 3: Deployment Not Updating

Real-World Troubleshooting Flow

My Take: The Troubleshooting Mindset

Memory Tip: The Detective Analogy

Key Takeaways

What's Next?