Production Scenarios: Real-World Troubleshooting

Production is different. Real problems. Real pressure. Real solutions. That's production scenarios.

🎯 The Big Picture

Think of production scenarios like emergency situations. Car breaks down on highway (production issue). You need to fix it. Fast. That's production troubleshooting.

Production scenarios are real problems. In production. Under pressure. With real solutions. This is how you handle them.

Scenario 1: Container Crash Loop

Problem:

Container keeps crashing
Restarts continuously
Application unavailable

Investigation:

# Check container status
docker ps -a

# Check logs
docker logs container-name

# Check recent logs
docker logs --tail 100 container-name

# Check exit code
docker inspect container-name | grep ExitCode

Common Causes:

Application error
Missing dependencies
Configuration error
Resource exhaustion

Solution:

# Fix the root cause
# - Fix application code
# - Fix configuration
# - Add missing dependencies
# - Increase resources

# Restart container
docker restart container-name

# Or rebuild
docker build -t my-app:1.1 .
docker run my-app:1.1

Prevention:

Health checks
Proper error handling
Resource limits
Testing before deploy

Scenario 2: High Memory Usage

Problem:

Containers using too much memory
Host running out of memory
Containers being killed

Investigation:

# Check memory usage
docker stats

# Check specific container
docker stats container-name

# Check host memory
free -h

# Check container limits
docker inspect container-name | grep Memory

Solution:

# Set memory limits
docker run -m 512m my-app

# Or in Compose
services:
  app:
    deploy:
      resources:
        limits:
          memory: 512M

Optimization:

Use smaller base images
Remove unnecessary packages
Optimize application
Use multi-stage builds

Scenario 3: Network Connectivity Issues

Problem:

Containers can't communicate
Services unreachable
Connection timeouts

Investigation:

# Check networks
docker network ls

# Inspect network
docker network inspect network-name

# Test connectivity
docker exec container1 ping container2

# Check DNS
docker exec container-name nslookup service-name

Solution:

# Connect to same network
docker network create app-network
docker run --network app-network app
docker run --network app-network db

# Use service names
# app can connect to db using hostname "db"

Prevention:

Use Docker networks
Use service names
Document network topology
Test connectivity

The Emergency Situation Analogy

Think of production scenarios like emergency situations:

Crash loop: Car breaks down High memory: Out of fuel Network issues: Can't communicate

Once you see it this way, production troubleshooting makes perfect sense.

Scenario 4: Disk Space Exhaustion

Problem:

No space left on device
Can't build images
Can't create containers

Investigation:

# Check disk usage
docker system df

# Check detailed usage
docker system df -v

# Check host disk
df -h

Solution:

# Clean up unused resources
docker system prune -a

# Remove unused volumes
docker volume prune

# Remove unused images
docker image prune -a

# Remove build cache
docker builder prune -a

Prevention:

Regular cleanup
Monitor disk usage
Set up alerts
Use .dockerignore

Scenario 5: Slow Performance

Problem:

Slow response times
High CPU usage
Poor user experience

Investigation:

# Monitor resources
docker stats

# Check CPU usage
docker stats --format "table {{.Container}}\t{{.CPUPerc}}"

# Profile application
docker exec container-name top

Solution:

# Set CPU limits
docker run --cpus="1.0" my-app

# Or in Compose
services:
  app:
    deploy:
      resources:
        limits:
          cpus: '1.0'

Optimization:

Use BuildKit
Optimize Dockerfile
Use cache mounts
Multi-stage builds

Real-World Example: Complete Troubleshooting

Problem: Production application is down

Step 1: Check status

docker ps -a
# Container status: Exited

Step 2: Check logs

docker logs --tail 100 container-name
# Error: Database connection failed

Step 3: Check network

docker network inspect app-network
# Database container not connected

Step 4: Check database

docker ps -a | grep db
# Database container stopped

Step 5: Restart database

docker start db-container

Step 6: Restart application

docker restart app-container

Step 7: Verify

docker logs app-container
# Application running
curl http://localhost:3000/health
# Healthy

That's production troubleshooting. Fast. Effective.

Best Practices

1. Monitor Everything

Set up monitoring:

Container health
Resource usage
Logs
Metrics

Why: Know what's happening. Detect issues early.

2. Have Runbooks

Document procedures:

Common issues
Solutions
Escalation paths

Why: Faster resolution. Consistent approach.

3. Test Scenarios

Practice troubleshooting:

Simulate failures
Test recovery
Document learnings

Why: Better prepared. Faster response.

4. Use Health Checks

Monitor health:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost/health"]
  interval: 30s

Why: Automatic detection. Faster response.

5. Set Up Alerts

Alert on issues:

Container down
High resource usage
Errors in logs

Why: Know immediately. Faster response.

My Take: Production Strategy

Here's what I do:

Prevention:

Health checks
Resource limits
Monitoring
Alerts
Testing

When issues occur:

Check status
Check logs
Investigate systematically
Fix and verify
Document

The key: Monitor. Alert. Respond. Document. Learn.

Memory Tip: The Emergency Situation Analogy

Production scenarios = Emergency situations

Crash loop: Car breaks down High memory: Out of fuel Network issues: Can't communicate Disk space: No storage

Once you see it this way, production troubleshooting makes perfect sense.

Common Mistakes

No monitoring: Don't know what's happening
No runbooks: Slow response
No health checks: Don't detect issues
No alerts: Issues go unnoticed
Not documenting: Repeat same mistakes

Hands-On Exercise: Production Troubleshooting

1. Create a problem:

docker run -d --name test nginx
docker stop test
# Container stopped

2. Investigate:

docker ps -a
docker logs test
docker inspect test

3. Fix:

docker start test
docker logs test

4. Verify:

curl http://localhost
# Should work

That's production troubleshooting. Practice. Learn.

Key Takeaways

Monitor everything - Know what's happening
Have runbooks - Faster resolution
Test scenarios - Better prepared
Use health checks - Automatic detection
Set up alerts - Know immediately
Document everything - Learn from experience

What's Next?

Congratulations! You've completed the Troubleshooting module. Now let's learn about best practices. Next: Image Best Practices.

Remember: Production scenarios are like emergency situations. Monitor. Alert. Respond. Document. Learn. Be prepared. Act fast.

🎯 The Big Picture​

Scenario 1: Container Crash Loop​

Scenario 2: High Memory Usage​

Scenario 3: Network Connectivity Issues​

The Emergency Situation Analogy​

Scenario 4: Disk Space Exhaustion​

Scenario 5: Slow Performance​

Real-World Example: Complete Troubleshooting​

Best Practices​

1. Monitor Everything​

2. Have Runbooks​

3. Test Scenarios​

4. Use Health Checks​

5. Set Up Alerts​

My Take: Production Strategy​

Memory Tip: The Emergency Situation Analogy​

Common Mistakes​

Hands-On Exercise: Production Troubleshooting​

Key Takeaways​

What's Next?​

🎯 The Big Picture

Scenario 1: Container Crash Loop

Scenario 2: High Memory Usage

Scenario 3: Network Connectivity Issues

The Emergency Situation Analogy

Scenario 4: Disk Space Exhaustion

Scenario 5: Slow Performance

Real-World Example: Complete Troubleshooting

Best Practices

1. Monitor Everything

2. Have Runbooks

3. Test Scenarios

4. Use Health Checks

5. Set Up Alerts

My Take: Production Strategy

Memory Tip: The Emergency Situation Analogy

Common Mistakes

Hands-On Exercise: Production Troubleshooting

Key Takeaways

What's Next?