If you’re a network engineer, IT manager, or ops lead working with Juniper gear — specifically the MX480, SRX series, or a Mist AI deployment — and you’ve just discovered a critical performance issue, a faulty config push, or a total site outage? This checklist is for you.
I’ve been the person called in when “normal troubleshooting” takes too long. In my role coordinating network recoveries for a managed services company, I’ve handled 50+ emergency triage events in the last two years — including same-day turnarounds for financial services clients where downtime cost $15k per hour. This list has been refined through 3 major incidents in March 2024 alone.
It’s 5 steps. No fluff. Let’s go.
When everything goes dark, the monitoring stack lights up like a Christmas tree. The biggest mistake I see is chasing alarms. Don’t.
Ask two questions:
Say the issue is a silent fail on an aggregated Ethernet interface. The alarm might say “interface down,” but the real problem is a mismatched LACP mode. I said “check LACP status.” They heard “reload the switches.” Result: 45 extra minutes of downtime.
A quick fix: On the MX480, run show lacp interfaces before touching any port config. That one command has saved me three times this year.
This is the risk-weighing step. The upside of rushing a swap is fast recovery. The risk is breaking something else. I kept asking myself: is a 10-minute fix worth potentially corrupting the backup config?
Calculated the worst case: complete redo of the config at 3 AM. Best case: it boots up clean in 60 seconds. The expected value said go for the fix, but the downside felt catastrophic — so I always double-check the candidate config. Always.
If you’re using Junos, run commit check | commit confirmed 10. That’s the no-brainer move. It gives you a 10-second rollback if things go sideways. Use it every single time.
This is the step most people skip, and it’s the one that saves you hours.
If you have Mist AI active, pull up the relevant site in the Marvis virtual assistant. Ask it: “Why did the SRX drop all BGP sessions at 02:15 UTC?”
In Q4 2024, I used this on an SRX380 issue. The Mist AI had already correlated the BGP flap with a CPU spike caused by a specific DoS pattern — something we would have spent 90 minutes pinpointing manually. Instead, 8 minutes.
If you don’t have Mist, use request support information (RSI) from the device and grep for the offending error. But honestly, after seeing Mist in action on a full MX480 stack, it’s a game-changer for emergency responses.
Now you know the cause. Is the fix a config patch or a hardware swap?
Here’s where the total cost of ownership mindset kicks in. A config patch is free, fast, and low risk. A hardware swap means carrier dispatch, repair costs, and a migration window.
But here’s the thing: The $0 patch might take 2 hours of testing and still fail. The $500 replacement PSU for the MX480, installed overnight, might solve it permanently. I now calculate TCO before making the call — even in emergencies. The cheapest fix upfront is not always the cheapest fix overall.
For example, we lost a $20,000 contract in 2023 because we tried to save $120 on a standard PSU instead of a cold spare swap. The client was a logistics company, and the extra 3 hours of partial outage broke their SLA. That’s when we implemented a 48-hour cold spare policy for all MX series routers.
So in a crisis: ask yourself what the full cost of a false fix is. Sometimes, the expensive route is the real bargain.
Once the fix is applied, don’t walk away. Verify that the symptom is gone, the alarm is cleared, and the users are back to normal. Then, document exactly what happened.
I use a simple three-line log format:
This sounds boring, but I promise: a 10-minute post-incident note will save you two hours six months later when a similar issue comes up. And be honest in front of the client. Say what went wrong.
Three things I see repeatedly:
One thing I regret: not setting up automated rollback commit scripts sooner. We lost a core switch for 3 hours in January 2024 because someone pushed a BGP filter that was too restrictive. If we had a commit confirmed 1 as a safety net, it would have been a 60-second rollback. That was a painful lesson.
Pricing as of March 2025: Emergency spare dispatch for an SRX1500 runs about $700 (verify current rates with your carrier). A cold spare MX480 PSU is around $1,500. In the context of a $15k/hour downtime, it’s noise. Budget for it.