Production Scenarios: Real-World Troubleshooting
Production is different. Real problems. Real pressure. Real solutions. That's production scenarios.
🎯 The Big Picture​
Think of production scenarios like emergency situations. Car breaks down on highway (production issue). You need to fix it. Fast. That's production troubleshooting.
Production scenarios are real problems. In production. Under pressure. With real solutions. This is how you handle them.
Scenario 1: Container Crash Loop​
Problem:
- Container keeps crashing
- Restarts continuously
- Application unavailable
Investigation:
# Check container status
docker ps -a
# Check logs
docker logs container-name
# Check recent logs
docker logs --tail 100 container-name
# Check exit code
docker inspect container-name | grep ExitCode
Common Causes:
- Application error
- Missing dependencies
- Configuration error
- Resource exhaustion
Solution:
# Fix the root cause
# - Fix application code
# - Fix configuration
# - Add missing dependencies
# - Increase resources
# Restart container
docker restart container-name
# Or rebuild
docker build -t my-app:1.1 .
docker run my-app:1.1
Prevention:
- Health checks
- Proper error handling
- Resource limits
- Testing before deploy
Scenario 2: High Memory Usage​
Problem:
- Containers using too much memory
- Host running out of memory
- Containers being killed
Investigation:
# Check memory usage
docker stats
# Check specific container
docker stats container-name
# Check host memory
free -h
# Check container limits
docker inspect container-name | grep Memory
Solution:
# Set memory limits
docker run -m 512m my-app
# Or in Compose
services:
app:
deploy:
resources:
limits:
memory: 512M
Optimization:
- Use smaller base images
- Remove unnecessary packages
- Optimize application
- Use multi-stage builds
Scenario 3: Network Connectivity Issues​
Problem:
- Containers can't communicate
- Services unreachable
- Connection timeouts
Investigation:
# Check networks
docker network ls
# Inspect network
docker network inspect network-name
# Test connectivity
docker exec container1 ping container2
# Check DNS
docker exec container-name nslookup service-name
Solution:
# Connect to same network
docker network create app-network
docker run --network app-network app
docker run --network app-network db
# Use service names
# app can connect to db using hostname "db"
Prevention:
- Use Docker networks
- Use service names
- Document network topology
- Test connectivity
The Emergency Situation Analogy​
Think of production scenarios like emergency situations:
Crash loop: Car breaks down High memory: Out of fuel Network issues: Can't communicate
Once you see it this way, production troubleshooting makes perfect sense.
Scenario 4: Disk Space Exhaustion​
Problem:
- No space left on device
- Can't build images
- Can't create containers
Investigation:
# Check disk usage
docker system df
# Check detailed usage
docker system df -v
# Check host disk
df -h
Solution:
# Clean up unused resources
docker system prune -a
# Remove unused volumes
docker volume prune
# Remove unused images
docker image prune -a
# Remove build cache
docker builder prune -a
Prevention:
- Regular cleanup
- Monitor disk usage
- Set up alerts
- Use .dockerignore
Scenario 5: Slow Performance​
Problem:
- Slow response times
- High CPU usage
- Poor user experience
Investigation:
# Monitor resources
docker stats
# Check CPU usage
docker stats --format "table {{.Container}}\t{{.CPUPerc}}"
# Profile application
docker exec container-name top
Solution:
# Set CPU limits
docker run --cpus="1.0" my-app
# Or in Compose
services:
app:
deploy:
resources:
limits:
cpus: '1.0'
Optimization:
- Use BuildKit
- Optimize Dockerfile
- Use cache mounts
- Multi-stage builds
Real-World Example: Complete Troubleshooting​
Problem: Production application is down
Step 1: Check status
docker ps -a
# Container status: Exited
Step 2: Check logs
docker logs --tail 100 container-name
# Error: Database connection failed
Step 3: Check network
docker network inspect app-network
# Database container not connected
Step 4: Check database
docker ps -a | grep db
# Database container stopped
Step 5: Restart database
docker start db-container
Step 6: Restart application
docker restart app-container
Step 7: Verify
docker logs app-container
# Application running
curl http://localhost:3000/health
# Healthy
That's production troubleshooting. Fast. Effective.
Best Practices​
1. Monitor Everything​
Set up monitoring:
- Container health
- Resource usage
- Logs
- Metrics
Why: Know what's happening. Detect issues early.
2. Have Runbooks​
Document procedures:
- Common issues
- Solutions
- Escalation paths
Why: Faster resolution. Consistent approach.
3. Test Scenarios​
Practice troubleshooting:
- Simulate failures
- Test recovery
- Document learnings
Why: Better prepared. Faster response.
4. Use Health Checks​
Monitor health:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/health"]
interval: 30s
Why: Automatic detection. Faster response.
5. Set Up Alerts​
Alert on issues:
- Container down
- High resource usage
- Errors in logs
Why: Know immediately. Faster response.
My Take: Production Strategy​
Here's what I do:
Prevention:
- Health checks
- Resource limits
- Monitoring
- Alerts
- Testing
When issues occur:
- Check status
- Check logs
- Investigate systematically
- Fix and verify
- Document
The key: Monitor. Alert. Respond. Document. Learn.
Memory Tip: The Emergency Situation Analogy​
Production scenarios = Emergency situations
Crash loop: Car breaks down High memory: Out of fuel Network issues: Can't communicate Disk space: No storage
Once you see it this way, production troubleshooting makes perfect sense.
Common Mistakes​
- No monitoring: Don't know what's happening
- No runbooks: Slow response
- No health checks: Don't detect issues
- No alerts: Issues go unnoticed
- Not documenting: Repeat same mistakes
Hands-On Exercise: Production Troubleshooting​
1. Create a problem:
docker run -d --name test nginx
docker stop test
# Container stopped
2. Investigate:
docker ps -a
docker logs test
docker inspect test
3. Fix:
docker start test
docker logs test
4. Verify:
curl http://localhost
# Should work
That's production troubleshooting. Practice. Learn.
Key Takeaways​
- Monitor everything - Know what's happening
- Have runbooks - Faster resolution
- Test scenarios - Better prepared
- Use health checks - Automatic detection
- Set up alerts - Know immediately
- Document everything - Learn from experience
What's Next?​
Congratulations! You've completed the Troubleshooting module. Now let's learn about best practices. Next: Image Best Practices.
Remember: Production scenarios are like emergency situations. Monitor. Alert. Respond. Document. Learn. Be prepared. Act fast.