Most network troubleshooting fails for one reason: you think you already know the problem.
You don't.
Good engineers don't guess. They eliminate.
This is the checklist I fall back on every time. It's boring, methodical, and it works.
1. Define the Problem (Not Your Theory)
"Internet is down" is not a problem. It's a complaint.
Ask:
- What exactly is failing?
- Since when?
- Who is affected?
- What changed?
If you skip this, you're debugging blind.
2. Scope It Fast
Before touching anything, figure out the blast radius.
- One user or everyone?
- One VLAN or entire network?
- One app or all traffic?
If it's one device โ local issue. Many devices โ network issue. Everyone โ core or upstream.
This step alone cuts your search space by 80%.
3. Start at Layer 1 (Yes, Seriously)
Everyone wants to jump to BGP. Meanwhile the cable is unplugged.
Check link lights, cables, and interface status:
show interface status
No link = no network. Don't overthink it.
4. Validate IP Basics
No IP, no life. Check the address, subnet, and gateway:
ipconfig /all
ifconfig
Wrong subnet or missing gateway kills everything silently.
5. Test the Path (Hop by Hop)
Don't just ping Google like a tourist. Go step by step:
- Ping yourself
- Ping gateway
- Ping next hop
- Then external
ping <gateway>
traceroute 8.8.8.8
Where it fails = where the problem lives.
6. DNS Is Always Guilty (Until Proven Innocent)
If IP works but domain doesn't โ DNS. Test it:
nslookup google.com
If DNS fails, users will say "internet is down" even when it's not. Classic.
7. Check Routing (Don't Assume It's There)
Routes disappear. Misconfig happens.
show ip route
Look for missing routes, wrong next hop, asymmetry. If traffic can't find the way back, it's dead.
8. ACLs and Firewalls (Silent Killers)
Packets don't complain. They just disappear. Check ACL rules and firewall logs.
If something "should work but doesn't" โ this is where you look.
9. NAT (The Invisible Saboteur)
NAT breaks things quietly. Check translations, overload issues, and port exhaustion. Especially relevant when internal works but external doesn't.
10. Recent Changes (The Real Villain)
90% of issues come from this. Ask:
- What changed?
- Who touched it?
- What was deployed?
No change? Someone forgot to mention it.
11. Logs Don't Lie (People Do)
Stop guessing. Read logs โ syslog, device logs, monitoring alerts. They usually tell you exactly what happened. You just ignored them.
12. Reproduce the Issue
If you can't reproduce it, you don't understand it. Same device, same network, same conditions.
No reproduction = temporary fix at best.
13. Fix One Thing at a Time
Don't shotgun config changes.
Change โ test โ observe.
Otherwise you create a second problem while fixing the first.
14. Document the Root Cause
Not "fixed issue". Write what failed, why it failed, and how you fixed it.
Future you will forget. Guaranteed.
Final Thought
Troubleshooting is not about being smart. It's about being disciplined.
Anyone can guess. Very few people can systematically eliminate.
That's the difference between "try restarting it" and actually knowing what you're doing.
Follow the checklist. Every time. Even when you're sure you already know the answer.
Especially then.